Estimate simulation parameters — estimateParam • powsimR

This function estimates and returns parameters needed for power simulations.
The user needs to choose the following options at least: specify a gene expression matrix; the type of RNA-seq experiment, i.e. bulk or single cell; the recommended distribution is negative binomial (NB) except for single-cell full-length Smart-seq2 read data where we recommend zero-inflated NB (ZINB); the preferred normalisation method, we recommend scran for single cell and TMM or MR for bulk.
The other parameters are optional (additional data) or have preset values (gene and sample filtering). Please consult the detailed arguments description.

estimateParam(countData,
readData = NULL,
batchData = NULL,
spikeData = NULL,
spikeInfo = NULL,
Lengths = NULL,
MeanFragLengths = NULL,
RNAseq = c('bulk', 'singlecell'),
Protocol = c('UMI', 'Read'),
Distribution = c('NB', 'ZINB'),
Normalisation = c("TMM", "MR", "PosCounts", "UQ",
"scran", "Linnorm", "sctransform",
"SCnorm", "Census", "depth", "none"),
GeneFilter = 0.05,
SampleFilter = 5,
sigma = 1.96,
NCores = NULL,
verbose=TRUE)

Arguments

countData	is a (UMI) count `matrix` of gene expression. Rows correspond to genes, columns to samples. The gene names should be given as `rownames` without "_" in the names. The samples names should be given as `colnames`. The count matrix should only contain the expression of one group, e.g. wildtype / untreated control / one cell type population.
readData	is a the matching read count `matrix` of gene expression if `countData` is UMI and the same formatting should be applied. Default is `NULL` and users only need to supply the read count matrix if they plan to apply downsampling of UMI counts for simulations, see `Setup`.
batchData	is a `data.frame` for batch annotation. Rows correspond to samples. The first column should contain the batches, e.g. 'a', 'b', 'c', etc.
spikeData	is a count `matrix`. Rows correspond to spike-ins, columns to samples. The order of columns should be the same as in the `countData`. This is only needed for spike-in aware normalisation methods ('MR', 'Linnorm', 'scran', 'SCnorm', 'bayNorm', 'Census'), see Details.
spikeInfo	is a molecule count `matrix` of spike-ins. Rows correspond to spike-ins. The order of rows should be the same as in the `spikeData`. The column names should be 'SpikeID' and 'SpikeInput' for molecule counts of spike-ins. This is only needed for spike-in aware normalisation methods (), see Details.
Lengths	is a numeric vector of transcript lengths with the same length and order as the rows in countData. This variable is only needed for internal gene length corrections (TPM), see Details.
MeanFragLengths	is a numeric vector of mean fragment lengths with the same length as columns in countData. This variable is only needed for internal gene length corrections (TPM), see Details.
RNAseq	is a character value: "bulk" or "singlecell".
Protocol	is a character value defining the type of counts given in `countData`. Options are "UMI" (e.g. 10X Genomics, CEL-seq2) or "Read" (e.g. Smart-seq2).
Distribution	is a character value: "NB" for negative binomial or "ZINB" for zero-inflated negative binomial distribution fitting.
Normalisation	is a character value: 'TMM', 'MR', 'PosCounts', 'UQ', 'scran', 'Linnorm', 'SCnorm', 'Census', 'depth', 'none'. For more information, please consult the Details section.
GeneFilter	is a numeric vector indicating the minimal proportion of nonzero expression values for a gene across all samples to be considered expressed and used for normalisation and parameter estimation. The default is `0.05`, i.e. at least 5% of the expression values per gene need to be nonzero.
SampleFilter	is a numeric vector indicating the minimal number of MADs (median absolute deviation) away from the median number of features detected as well as sequencing depth across all samples so that outlying samples are removed prior to normalisation and parameter estimation. The default is `5`, i.e. at least 5 MADs away from the median. Choose higher values if you want to filter out less samples. This parameter is particularly important for single cells to ensure reliable parameter estimation. For more information, please consult `isOutlier`.
sigma	The variability band width for mean-dispersion loess fit defining the prediction interval for read count simulation. Default is 1.96, i.e. 95% interval. For more information see `loess.sd`.
NCores	The number of cores for normalisation method SCnorm and Census. The default `NULL` means 1 core.
verbose	Logical value to indicate whether to print function information. Default is `TRUE`.

Value

List object with the following entries:

Parameters

A list object containing the estimated moments for the full, dropped out genes, dropped out samples and filtered normalized count matrix. For more information please consult the details section and the plot made with plotParam.

Fit

A list object containing the fitting results of the mean-dispersion and mean-dropout relation as well as the estimated parameter data used for the fits. For more information please consult the details section and the plot made with plotParam.

totalS,totalG

Number of samples and genes provided with at least one read count.

DropOuts

List object containing logical vectors for gene and sample dropouts after applying gene frequency and sample outlier filtering.

The estimated library size factor per sample.

Lengths -..- SampleFilter

The chosen parameters settings.

Details

Normalisation Methods

TMM, UQ

employ the edgeR style normalization of weighted trimmed mean of M-values and upperquartile as implemented in calcNormFactors, respectively.

MR, PosCounts

employ the DESeq2 style normalization of median ratio method and a modified geometric mean method as implemented in estimateSizeFactors, respectively.

scran, SCnorm

apply the deconvolution and quantile regression normalization methods developed for sparse RNA-seq data as implemented in calculateSumFactors and SCnorm, respectively. Spike-ins can also be supplied for both methods via spikeData. Note, however that this means for scran that the normalisation as implemented in computeSpikeFactors is also applied to genes (general.use=TRUE).

Linnorm

apply the normalization method for sparse RNA-seq data as implemented in Linnorm.Norm. For Linnorm, the user can also supply spikeData.

sctransform

apply the normalization method developed for single-cell UMI RNA-seq data as implemented in vst.

Census

converts relative measures of TPM/FPKM values into mRNAs per cell (RPC) without the need of spike-in standards. Census at least needs Lengths for single-end data and preferably MeanFragLengths for paired-end data. The authors state that Census should not be used for UMI data.

depth

Sequencing depth normalisation.

none

No normalisation is applied. This approach can be used for prenormalized expression estimates, e.g. cufflinks, RSEM or salmon.

Examples

if (FALSE) {
# Single Cells
data("SmartSeq2_Gene_Read_Counts")
Batches <- data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_Gene_Read_Counts), "_"), "[[", 1),
                      stringsAsFactors = FALSE, row.names = colnames(SmartSeq2_Gene_Read_Counts))
data("GeneLengths_mm10")
estparam <- estimateParam(countData = SmartSeq2_Gene_Read_Counts,
                          readData = NULL,
                          batchData = Batches,
                          spikeData = SmartSeq2_SpikeIns_Read_Counts,
                          spikeInfo = SmartSeq2_SpikeInfo,
                          Lengths = GeneLengths, MeanFragLengths = NULL,
                          RNAseq = 'singlecell', Protocol = 'Read',
                          Distribution = 'ZINB', Normalisation = "scran",
                          GeneFilter = 0.1, SampleFilter = 3,
                          sigma = 1.96, NCores = NULL, verbose = TRUE)

# Bulk
data("Bulk_Read_Counts")
data("GeneLengths_hg19")
estparam <- estimateParam(countData = Bulk_Read_Counts,
                          readData = NULL,
                          batchData = NULL,
                          spikeData = NULL,
                          spikeInfo = NULL,
                         Lengths = GeneLengths_hg19,
                          MeanFragLengths = NULL,
                          RNAseq = 'bulk', Protocol = 'Read',
                          Distribution = 'NB', Normalisation = "MR",
                          GeneFilter = 0.1, SampleFilter = 3,
                          sigma = 1.96, NCores = NULL, verbose = TRUE)

# plot the results of estimation
plotParam(estparam, Annot = FALSE)
}