# Ensemble Generation

## Overview

To start ensemble-based data assimilation one has to generate an ensemble of model states that represents both the state estimate (provided by the ensemble mean state) and the uncertainty of the state estimate, given by the ensemble spread. In case of ensemble Kalman filters, the uncertainty is estimate by the sample covariance matrix of the ensemble, i.e. the covariance matrix of the deviations of the ensemble states from the ensemble mean state.

There are different possibilities to generate an ensemble. With PDAF we currently provide support for one sampling method, the co-called second-order exact sampling from a model trajectory. This method was introduced by Pham et al. (see e.g. Pham, Monthly Weather Review 129 (2001) 1194-1207). The method is as follows:

- Run the model with which the data assimilation will be performed and write a trajectory of model snapshots into files.
- In a separate program read the model snapshots. Compute the temporal mean state and subtract it from all snapshots. Now compute the singular value decomposition of the resulting matrix. This is known as the EOF-decomposition. It provides the singular vectors (EOFS, empirical orthogonal functions) and singular values (EOF coefficients). The number of EOFs is usually
**r**if the trajectory contains**r+1**model snapshots. Store the singular vectors and singular values in a file. - Now perform the second-order exact sampling: This is usually in the ensemble-initialization routine, e.g.
`init_ens_pdaf.F90`. Here, read the singular values and vectors into arrays. Multiply each singular vector with the corresponding singular value. Multiply the resulting matrix by a random matrix that preserves the mean and the covriances. This particular matrix can be generated using Householder reflections. PDAF provides the routine`PDAF_seik_omega`to generate this matrix. Now multiply the result with the square-root of the ensemble size minus one. The resulting matrix holds ensemble perturbations. Thus the sum over each row is very. Finally add a desired central (mean) state to the matrix. This could e.g. by the model state at the time at which the data assimilation will be started. The ensemble size will be the number of singular vectors (**r**) plus one. Thus, to control the ensemble size (**m**), one can just read in the**m-1**leading singular vectors (those with the largest singular values) and perform the second-order exact sampling with these vectors.

The described sampling method is efficient because it uses the long time variability of the model dynamics to estimate the uncertainty. The trajectory and hence the full set of singular vectors hold the full information on the variability. If a large number of singular vectors is stored in step 2 into a file, one can freely choose the ensemble size up to **r+1**. Using the leading EOFs provides a systematic approximation of the model variability. Thus, a small ensemble will contain the leading patterns of model variability, which have the largest amplitudes. Increasing the ensemble will systematically add finer patterns.

## Ensemble generation with PDAF

In order to simplify the ensemble generation with second-order exact sampling, PDAF provides two routines:

### PDAF_eofcovar

For step 2, the routine PDAF_eofcovar helps to compute the singular value decomposition. For this task, one provides the routine with the matrix holding the model snapshots. One can also let the routine compute the temporal mean state, or one can compute and subtract the mean state before (this is particularly useful if one does not want to subtract the mean state over the full model trajectory). The routine computes the singular values decomposition and returns the singular vectors and values.

Using `PDAF_eofcovar`, one has to read in the trajectory. Then one calls the routine, and finally one stores the singular vectors and values into a file. This procedure is demonstrated for the Lorenz-96 model in `testsuite/src/lorenz96/tools/generate_covar.F90`.

For the documentation of the calling interface of `PDAF_eofcovar` see the detail page about the routine.

### PDAF_sampleens

For step 3, the routine PDAF_sampleens performs the second-order exact sampling. For this task, one provides the routine the singular vectors and values from step 2 and a central state for the ensemble mean. The routine will then generate and return the ensemble to the program that called `PDAF_sampleens`.

The ensemble generation is usually done when a data assimilation sequence is initialized. One read the singular vectors and values from the file generate in step 2. In addition one read the central state of the ensemble. then one calls `PDAF_sampleens` and obtains an ensmeble for the data assimilation.

There are two variants to use `PDAF_sampleens`:

- One can use the routine in the data assimilation program in the call-back routine that initializes the ensemble array for PDAF (usually
`init_ens_pdaf`). Thus one starts the data assimilation program. when`PDAF_init`is executed, it calls`init_ens_pdaf`. Here one can use`PDAF_sampleens`to generate the ensemble from the stored singular values and vectors and directly store it in the ensemble array provided by PDAF. Thus, one can directly start the data assimilation sequence. This variant is demonstrated for the Lorenz-96 model in`testsuite/src/lorenz96/init_ens_eof.F90`. - One can use a separate program that reads the singular values and vectors, and a central ensemble state and then calls
`PDAF_sampleens`to generate the ensemble. Then one can store the ensemble state in ensemble restart files. This variant can be useful for both the online- and offline modes of PDAF:

In the online mode, one would implement`init_ens_pdaf`as a routine reading the ensemble states from files. This is useful, e.g., for the case of a large scale data assimilation when the execution time for the forecasts and the analysis steps is so long that one needs to checkpoint and restart the data assimilation. In this case the checkpointing could write ensemble restart files in the same format as the ensemble generation program does. Thus, one can use`init_ens_pdaf`to read ensemble restart files either at the very beginning of a data assimilation sequence or when the assimilation program is restarted.

In the offline-mode the data assimilation program only contains the analysis step, while the forecasts are computed by the model program. Thus, the model program just reads model restart files, but does not execute`init_ens_pdaf`. In this case one would let the program that generates the ensemble used`PDAF_sampleens`write restart files for the model. The model will use these files to initialize the model fields so that the forecast for an ensemble state can be computed. At the end of a forecast phase, the model writes new restart files. The restart files for each ensemble state are then read by the data assimialtion program to compute the analysis step.

## Variants and options

### Mean state

The routine `PDAF_eofcovar` has the option to compute and subtract the temporal mean state before computing the singular value decomposition. Alternatively, one can subtract the mean before calling `PDAF_eofcovar`. The first variant is only useful if the temporal mean state over the full model trajectory should be subtracted. In many cases this would not be a good choice. For example if one runs an ocean model over one year the state trajectory will contain the seasonal cycle. However, the seasonal cycle is usually not an uncertainty (If one starts the data assimilation e.g. in January, one knows the typical conditions for January). In this case it will be useful to compute a running mean state, e.g. over 2-3 months and subtract this mean state from the snapshots. This computation is case-specific and can only be done before calling `PDAF_eofcovar`.

### Multivariate normalization

The multivariate normalization normalizes the different variability of different model variables. The routine `PDAF_eofcovar` allows to activate a multivariate normalization. The normalization is motivated by the fact that different model variables can show very different variability. For example, the temperature field in a global ocean model varies much more than the salinity or the velocities. Without multivariante normalization, the leading singular vectors can contain mainly information from those variables that show the largest variability. thus, if the ensemble is then generated from the leading singular vectors, the variability of some model variables might be underrepresented. The multivariate normalization avoid this case, because it normalizes the variability of each model variable to one before computing the singular value decomposition. Thus, the singular value decomposition treats each model fields equivalently. The model fields in the resulting singular vectors are then re-scaled to their original variability by `PDAF_eofcovar`. Due to the normalization, the leading singular vectors will contain contributions from all different model variables. thus, the different fields should be better represented in the ensemble states.

### Central ensemble state

Using second-order exact sampling one can freely choose the central ensemble state (i.e. the ensemble mean), which is usually used as the state estimate in ensemble Kalman filters. Usually, one should try to select the best state estimate one has available. A long-term mean state would be a very uninformed state. However, if one wants to access the data assimilation influence, also the long-time mean, e.g. the mean state over the whole trajectory in step 2 above could be a useful choice (This is e.g. used in the example of the Lorenz-96 model. However this model is chaotic without seasonality, so that this state does in deed represent the best estimate).

### Scaling the initial ensemble

using the model variability does not guarantee that the estimated uncertainty is realistic. While the sampling provides estimates of the covariances that result from the model dynamics, the variances might not have the expected amplitude. One can check this by comparing the estimated variances (i.e. the diagonal of the ensmeble sample error covariance matrix, which is computed in the examples like the Lorenz-96 model) with the deviation of the ensemble mean from the observations. Often, one can plot both quantities like model fields (e.g. if ocean surface temperature is observed). Both quantities will be different, but if they are on the same order of magnitude the uncertainty estimate is usually realistic. If the uncertainty is overestimated, one can reduce the ensemble spread (by multiplying the ensemble perturbations by a factor between 0 and 1). Likewise one can increase the ensmeble spread, if it underestimates the difference between the ensemble mean and the observations.