= Ensemble Generation =

[[PageOutline]]

== Overview ==

To start ensemble-based data assimilation one has to generate an ensemble of model states that represents both the state estimate (provided by the ensemble mean state) and the uncertainty of the state estimate, given by the ensemble spread. In case of ensemble Kalman filters, the uncertainty is estimate by the sample covariance matrix of the ensemble, i.e. the covariance matrix of the deviations of the ensemble states from the ensemble mean state.

There are different possibilities to generate an ensemble. With PDAF we currently provide support for one sampling method, the so-called second-order exact sampling from a model trajectory. This method was introduced by Pham et al. (see e.g. Pham, Monthly Weather Review 129 (2001) 1194-1207). The method is as follows:
1. Run the model with which the data assimilation will be performed and write a trajectory of model snapshots into files.
1. Generate an EOF-composition (PDAF provides the routine `PDAF_eofcovar` to support this step)
 a. In a separate program read the model snapshots. 
 a. Then compute the mean state over time and subtract it from all snapshots. 
 a. Now compute the singular value decomposition of the resulting matrix. This is known as the EOF-decomposition. It provides the singular vectors (EOFS, empirical orthogonal functions) and singular values (EOF coefficients). The number of EOFs is usually '''r''' if the trajectory contains '''r+1''' model snapshots. 
 a. Store the singular vectors and singular values in a file.
1. Now perform the second-order exact sampling. (PDAF provides the routine `PDAF_sampleens` tu support this step). This is usually done in the ensemble-initialization routine, e.g. `init_ens_pdaf.F90`:
 a. Read the singular values and vectors into arrays. 
 a. Multiply each singular vector with the corresponding singular value. 
 a. Multiply the resulting matrix by a random matrix that preserves the mean and the covariances. PDAF provides the routine `PDAF_seik_omega` to generate this matrix, which is used inside `PDAF_sampleens`. 
 a. Now multiply the result with the square-root of the ensemble size minus one. The resulting matrix holds ensemble perturbations. Thus, the sum over each row is zero. 
 a. Finally, add a desired central (mean) state of the ensemble to the matrix. This could, e.g., be the model state at the time at which the data assimilation will be started. 

The resulting ensemble size will be **N** if one uses **N-1** EOFs in step 3. Thus, to control the ensemble size ('''N'''), one can just read in the '''N-1''' leading singular vectors (those with the largest singular values) and perform the second-order exact sampling with these vectors. 

The described sampling method is efficient because it uses the long time variability of the model dynamics to estimate the uncertainty. The trajectory and hence the full set of singular vectors hold the full information on the variability. If a large number of singular vectors is stored in step 2, one can freely choose the ensemble size up to '''r+1'''. Using the leading EOFs provides a systematic approximation of the model variability. Thus, a small ensemble will contain the leading patterns of model variability, which have the largest amplitudes. Increasing the ensemble will systematically add finer patterns. 

One can refine the method, e.g. by a multivariate normalization that takes into account a distinct variability of different model fields, or by replacing the long-time mean by a running mean. Such aspects are discussed further below.

== Ensemble generation with PDAF ==

In order to simplify the ensemble generation with second-order exact sampling, PDAF provides two routines: 

=== PDAF_eofcovar ===

For step 2, the routine [wiki:PDAF_eofcovar PDAF_eofcovar] helps to compute the singular value decomposition. For this task, one provides the routine with the matrix holding the model snapshots. One can also let the routine compute the mean state over time, or one can compute and subtract the mean state before (this is particularly useful if one does not want to subtract the mean state over the full model trajectory). The routine computes the singular values decomposition and returns the singular vectors and values. 

Using `PDAF_eofcovar`, one has to read the trajectory from files. Then. one calls the routine, and finally one stores the singular vectors and values into a file. This procedure is demonstrated for the Lorenz-96 model in `models/lorenz96/tools/generate_covar.F90`.

For the documentation of the calling interface of `PDAF_eofcovar` see [wiki:PDAF_eofcovar the detail page about the routine]. 

=== PDAF_sampleens ===

For step 3, the routine [wiki:PDAF_sampleens PDAF_sampleens] performs the second-order exact sampling. For this task, one provides to the routine the singular vectors and values from step 2 and a central state for the ensemble mean. The routine will then generate and return the ensemble to the program that called `PDAF_sampleens`. 

The ensemble generation is usually done when a data assimilation sequence is initialized. One reads the singular vectors and values from the file generate in step 2. In addition, one reads the central state of the ensemble. Then, one calls `PDAF_sampleens`  and obtains an ensmeble for the data assimilation. 

There are two variants to use `PDAF_sampleens`: 
1. One can use the routine in the data assimilation program in the call-back routine that initializes the ensemble array for PDAF (usually `init_ens_pdaf`). Thus, one starts the data assimilation program.  `PDAF_init` will call `init_ens_pdaf`. Here, one can use `PDAF_sampleens` to generate the ensemble from the stored singular values and vectors and directly store it in the ensemble array provided by PDAF. Thus, one can directly start the data assimilation sequence. This variant is demonstrated for the Lorenz-96 model in `models/lorenz96/init_ens_eof.F90`. 
1. One can use a separate program that reads the singular values and vectors, and a central ensemble state and then calls `PDAF_sampleens` to generate the ensemble. Then one can store the ensemble state in ensemble restart files. This variant can be useful for both the online- and offline modes of PDAF: 
 * In the online mode, one would implement `init_ens_pdaf` as a routine reading the ensemble states from files. This is useful, e.g., for the case of a large scale data assimilation when the execution time for the forecasts and the analysis steps is so long that one needs to checkpoint and restart the data assimilation. In this case the checkpointing could write ensemble restart files in the same format as the ensemble generation program does. Thus, one can use `init_ens_pdaf` to read ensemble restart files either at the very beginning of a data assimilation sequence or when the assimilation program is restarted. 
 * In the offline-mode the data assimilation program only contains the analysis step, while the forecasts are computed by the model program. Thus, the model program just reads model restart files, but does not execute `init_ens_pdaf`. In this case, one would let the program that generates the ensemble using `PDAF_sampleens` write restart files for the model. The model will use these files to initialize the model fields so that the forecast for an ensemble state can be computed. At the end of a forecast phase, the model writes new restart files. The restart files for each ensemble state are then read by the data assimilation program to compute the analysis step. 

== Variants and options ==

=== Mean state === 

The routine `PDAF_eofcovar` has the option to compute and subtract the mean state over time before computing the singular value decomposition. Alternatively, one can subtract the mean before calling `PDAF_eofcovar`. The first variant is only useful if the time-mean state over the full model trajectory should be subtracted. In many cases this would not be a good choice. For example if one runs an ocean model over one year the state trajectory will contain the seasonal cycle. However, the seasonal cycle is usually not an uncertainty (If one starts the data assimilation, e.g., in January, one knows the typical conditions for January). In this case it will be useful to compute a running mean state, e.g. over 2-3 months and subtract this mean state from the snapshots. This computation is case-specific and can only be done before calling `PDAF_eofcovar`.

=== Multivariate normalization ===

The multivariate normalization takes into account the different variability of different model variables. The routine `PDAF_eofcovar` allows to activate a multivariate normalization. The normalization is motivated by the fact that different model variables can show very different variability. For example, the temperature field in a global ocean model varies much more than the salinity or the velocities. Without multivariate normalization, the leading singular vectors can contain mainly information from those variables that show the largest variability. Thus, if the ensemble is then generated from the leading singular vectors, the variability of some model variables might be under-represented. The multivariate normalization avoids this case, because it normalizes the variability of each model variable to one before computing the singular value decomposition. Thus, the singular value decomposition treats each model fields equally. The model fields in the resulting singular vectors are then re-scaled to their original variability by `PDAF_eofcovar`. Due to the normalization, the leading singular vectors will contain contributions from all different model variables. Thus, the different fields should be better represented in the ensemble states. 

=== Central ensemble state ===

Using second-order exact sampling, one can freely choose the central ensemble state  (i.e. the ensemble mean), which is usually used as the state estimate in ensemble-based Kalman filters. Usually, one should try to select the best state estimate one has available. A long-term mean state would be a very uninformed state. It is up to the user to choose a good initial central state.

=== Scaling the initial ensemble ===

Using the model variability does not guarantee that the estimated uncertainty is realistic. While the sampling provides estimates of the covariances that result from the model dynamics, the variances might not have a realistic amplitude. One can check this by comparing the ensemble-estimated variances (thus, the diagonal of the ensemble sample error covariance matrix, which is computed in the examples like the Lorenz-96 model. It can also be computed with the routine [wiki:DataAssimilationDiagnostics#PDAF_diag_variance PDAF_diag_variance].) with the deviation of the ensemble mean from the observations. Often, one can plot both quantities like model fields (e.g. if ocean surface temperature is observed). Both quantities will be different, but if they are on the same order of magnitude the uncertainty estimate is usually realistic. If the uncertainty is overestimated, one can reduce the ensemble spread (by multiplying the ensemble perturbations by a factor between 0 and 1). Likewise one can increase the ensemble spread, if it underestimates the difference between the ensemble mean and the observations.