simulateDE is the main function to simulate differential expression for RNA-seq experiments. The simulation parameters are specified with Setup. The user now needs to specify the RNA-seq Analysis Pipeline including preprocessing, normalisation and differential testing method. The return object contains DE test results from all simulations as well as descriptive statistics. The error matrix calculations will be conducted with evaluateDE.

simulateDE(SetupRes,
Prefilter = NULL,
Imputation = NULL,
Normalisation = c("TMM", "MR", "PosCounts", "UQ",
"scran", "Linnorm", "sctransform",
"SCnorm", "Census", "depth"),
Label = "none",
DEmethod = c("T-Test", "edgeR-LRT", "edgeR-QL",
"edgeR-zingeR", "edgeR-ZINB-WaVE",
"limma-voom", "limma-trend",
"DESeq2", "DESeq2-zingeR", "DESeq2-ZINB-WaVE",
"ROTS", "baySeq", "NOISeq", "EBSeq",
"MAST", "BPSC", "scDD", "DECENT"),
DEFilter = FALSE,
Counts = FALSE,
NCores = NULL,
verbose = TRUE)

Arguments

SetupRes

This object specifies the simulation setup. This must be the return object from Setup.

Prefilter

A character vector specifying the gene expression filtering method to be used prior to normalisation (and possibly imputation). Default is NULL, i.e. no filtering. Please consult the Details section for available options.

Imputation

A character vector specifying the gene expression imputation method to be used prior to normalisation. Default is NULL, i.e. no imputation. Please consult the Details section for available options.

Normalisation

Normalisation method to use. Please consult the Details section for available options.

Label

A character vector to define whether information about group labels should be used for normalisation. This is only implemented for scran and SCnorm. Possible options include the default "none" which means that no sample group information is considered for normalisation; "known" means that the simulated group labels are used and "clustering" which applies an unsupervised hierarchical clustering to determine the group labels (see quickCluster for details).

DEmethod

A character vector specifying the DE detection method to be used. Please consult the Details section for available options.

DEFilter

A logical vector indicating whether to run DE testing on filtered and/or imputed count data. Default is FALSE.

Counts

A logical vector indicating whether the simulated count matrix is also provided as output. Default is FALSE since the output can be quite large. Note that if DEFilter is TRUE, then the returned count matrix will countain the filtered and/or imputed count data.

NCores

integer positive number of cores for parallel processing. Default is NULL, i.e. 1 core.

verbose

Logical vector to indicate whether to show progress report of simulations. Default is TRUE.

Value

SimulateRes: Results of DE simulations

pvalue, fdr

3D array (ngenes * N * nsims) for p-values and FDR from each simulation. Note that FDR values will be empty and the calculation will be done by evaluateDE whenever applicable.

mu,disp,dropout

3D (ngenes * N * nsims) array for mean, dispersion and dropout of library size factor normalized read counts.

true.mu,true.disp,true.dropout

3D (ngenes * N * nsims) array for true mean, dispersion and dropout of simulated read counts.

true.depth,est.depth

True simulated and processed (after prefiltering and/or imputation if defined) sequencing depth per sample.

est.sf

Global library size factor estimates per sample.

est.gsf

3D array (ngenes * N * nsims) for size factor estimates. These are gene- and sample-wise estimates and only for SCnorm and Linnorm normalisation.

elfc,rlfc

3D array (ngenes * N * nsims) for log fold changes (LFC): elfc is for the DE tool estimated LFC; rlfc is for the LFC estimated from the normalised read counts.

time.taken

The time taken given by proc.time for each simulation, given for preprocessing, normalisation, differential expression testing and moment estimation.

SetupRes: Simulation specifications

DESetup - ... - estSpikeRes

Reiterating the simulated setup defined by Setup.

Pipeline

A list of chosen pipeline tools defined by above arguments.

Counts: Simulated Count Matrices

Counts

List of lists object where Counts[[Sample Size Setup]][[Simulation Run]] containing simulated counts. Note that this will only be returned when Counts is TRUE. In addition, if DEFilter is TRUE then the filtered/imputed counts are returned.

Details

Here you can find detailed information about preprocessing, imputation, normalisation and differential testing choices. For recommendations concerning single cell RNA-sequencing pipelines, we kindly refer the user to Vieth, et al (2019). A systematic evaluation of single cell RNA-seq analysis pipelines. Nature Communications, 10(1), 4667.
Prefiltering

CountFilter

removes genes that have a mean expression below 0.2.

FreqFilter

removes genes that have more than 80 % dropouts.

Imputation

scImpute

apply the imputation as implemented in scimpute

DrImpute

apply the imputation as implemented in DrImpute.

SAVER

apply the imputation as implemented in saver.

scone

apply the imputation as implemented in scone, defining 'house keeping genes' for the FNR model estimation as those that have less than 20 % dropouts and small variance (i.e. in the lower 20th quartile). If less than 25 genes could be identified, the genes with less than 5 % dropouts are used. If spike-in data is provided then these are used for the FNR model estimation.

MAGIC

apply the imputation as implemented in magic. Please note that the python package MAGIC needs to be installed to use this implementation.

Normalisation

TMM, UQ

employ the edgeR style normalization of weighted trimmed mean of M-values and upperquartile as implemented in calcNormFactors, respectively.

MR, PosCounts

employ the DESeq2 style normalization of median ratio method and a modified geometric mean method as implemented in estimateSizeFactors, respectively.

scran, SCnorm

apply the deconvolution and quantile regression normalization methods developed for sparse RNA-seq data as implemented in computeSumFactors and SCnorm, respectively. Spike-ins can also be supplied for both methods via spikeData. Note, however that this means for scran that the normalisation as implemented in calculateSumFactors is also applied to genes (general.use=TRUE).

Linnorm

apply the normalization method for sparse RNA-seq data as implemented in Linnorm.Norm. For Linnorm, the user can also supply spikeData.

sctransform

apply the normalization method developed for single-cell UMI RNA-seq data as implemented in vst.

Census

converts relative measures of TPM/FPKM values into mRNAs per cell (RPC) without the need of spike-in standards. Census at least needs Lengths for single-end data and preferably MeanFragLengths for paired-end data. The authors state that Census should not be used for UMI data.

depth

Sequencing depth normalisation.

Differential testing

T-Test

A T-Test per gene is applied using log transformed and normalized expression values (i.e. CPM or TPM).

limma-trend, limma-voom

apply differential testing as implemented in lmFit followed by eBayes on counts transformed by voom or by applying mean-variance trend on log CPM values in eBayes.

edgeR-LRT, edgeR-QL

apply differential testing as implemented in glmFit, glmLRT andglmQLFit, glmQLFTest, respectively.

DESeq2

applies differential testing as implemented in DESeq.

ROTS

applies differential testing as implemented in ROTS with 100 permutations on transformed counts (voom).

baySeq

applies differential testing as implemented in getLikelihoods based on negative binomial prior estimates (getPriors.NB).

NOISeq

applies differential testing as implemented in noiseqbio based on CPM values.

EBSeq

applies differential testing as implemented in EBTest.

MAST

applies differential testing as implemented in zlm for zero-inflated model fitting followed by lrTest on log CPM values.

BPSC

applies differential testing as implemented in BPglm on CPM values.

scDD

applies differential testing as implemented in scDD on CPM values.

DECENT

applies differential testing as implemented in decent.

edgeR-zingeR, DESeq2-zingeR

In a first step, the posterior probabilities of the zero-inflated negative binomial component are estimated (see zeroWeightsLS) and used to define a weight matrix for dispersion estimation in estimateDisp. For the edgeR approach, the generalized model as implemented in glmFit is fitted. This is followed by an adapted LRT for differential testing to account for the weighting (see glmWeightedF). For DESeq2, the generalized linear model coefficients are estimated using nbinomWaldTest and the weighting is done by setting the degrees of freedom for the T distribution.

edgeR-ZINB-WaVE, DESeq2-ZINB-WaVE

In a first step, a zero-inflated negative binomial regression model is fitted (see zinbFit) to estimate observational weights (see computeObservationalWeights) used for dispersion estimation in estimateDisp. For the edgeR approach, the generalized model as implemented in glmFit is fitted. This is followed by an adapted LRT for differential testing to account for the weighting (see glmWeightedF). For DESeq2, the generalized linear model coefficients are estimated using nbinomWaldTest and the weighting is done by setting the degrees of freedom for the T distribution.

See also

estimateParam and estimateSpike, for parameter specifications;
Setup for simulation setup;
evaluateDE, evaluateROC and evaluateSim for evaluation of simulation results.

Examples

if (FALSE) { # estimate gene parameters data("SmartSeq2_Gene_Read_Counts") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_Gene_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_Gene_Read_Counts)) data("GeneLengths_mm10") estparam_gene <- estimateParam(countData = SmartSeq2_Gene_Read_Counts, readData = NULL, batchData = Batches, spikeData = NULL, spikeInfo = NULL, Lengths = GeneLengths_mm10, MeanFragLengths = NULL, RNAseq = 'singlecell', Protocol = 'Read', Distribution = 'ZINB', Normalisation = "scran", GeneFilter = 0.1, SampleFilter = 3, sigma = 1.96, NCores = NULL, verbose = TRUE) # estimate spike parameters data("SmartSeq2_SpikeIns_Read_Counts") data("SmartSeq2_SpikeInfo") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_SpikeIns_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_SpikeIns_Read_Counts)) estparam_spike <- estimateSpike(spikeData = SmartSeq2_SpikeIns_Read_Counts, spikeInfo = SmartSeq2_SpikeInfo, MeanFragLength = NULL, batchData = Batches, Normalisation = 'depth') # define log fold change p.lfc <- function(x) sample(c(-1,1), size=x,replace=T)*rgamma(x, shape = 1, rate = 2) # set up simulations setupres <- Setup(ngenes = 10000, nsims = 10, p.DE = 0.1, pLFC = p.lfc, n1 = c(20,50,100), n2 = c(30,60,120), Thinning = c(1,0.9,0.8), LibSize = 'given', estParamRes = estparam_gene, estSpikeRes = estparam_spike, DropGenes = FALSE, sim.seed = 52679, verbose = TRUE) # run simulation simres <- simulateDE(SetupRes = setupres, Prefilter = "FreqFilter", Imputation = NULL, Normalisation = 'scran', Label = 'none', DEmethod = "limma-trend", DEFilter = FALSE, NCores = NULL, verbose = TRUE) # quick evaluation evalderes <- evaluateDE(simRes = simres) # plot evaluation plotEvalDE(evalRes = evalderes, rate = "marginal", quick = TRUE, Annot = FALSE) }