simulateDE is the main function to simulate differential expression for RNA-seq experiments.
The simulation parameters are specified with Setup
.
The user now needs to specify the RNA-seq Analysis Pipeline including preprocessing, normalisation and differential testing method.
The return object contains DE test results from all simulations as well as descriptive statistics.
The error matrix calculations will be conducted with evaluateDE
.
simulateDE(SetupRes, Prefilter = NULL, Imputation = NULL, Normalisation = c("TMM", "MR", "PosCounts", "UQ", "scran", "Linnorm", "sctransform", "SCnorm", "Census", "depth"), Label = "none", DEmethod = c("T-Test", "edgeR-LRT", "edgeR-QL", "edgeR-zingeR", "edgeR-ZINB-WaVE", "limma-voom", "limma-trend", "DESeq2", "DESeq2-zingeR", "DESeq2-ZINB-WaVE", "ROTS", "baySeq", "NOISeq", "EBSeq", "MAST", "BPSC", "scDD", "DECENT"), DEFilter = FALSE, Counts = FALSE, NCores = NULL, verbose = TRUE)
SetupRes | This object specifies the simulation setup.
This must be the return object from |
---|---|
Prefilter | A character vector specifying the gene expression filtering method
to be used prior to normalisation (and possibly imputation).
Default is |
Imputation | A character vector specifying the gene expression imputation method
to be used prior to normalisation.
Default is |
Normalisation | Normalisation method to use. Please consult the Details section for available options. |
Label | A character vector to define whether information about
group labels should be used for normalisation.
This is only implemented for scran and SCnorm.
Possible options include the default |
DEmethod | A character vector specifying the DE detection method to be used. Please consult the Details section for available options. |
DEFilter | A logical vector indicating whether to run DE testing on
filtered and/or imputed count data.
Default is |
Counts | A logical vector indicating whether the simulated count matrix is also provided as output.
Default is |
NCores | integer positive number of cores for parallel processing.
Default is |
verbose | Logical vector to indicate whether to show progress report of simulations.
Default is |
SimulateRes: Results of DE simulations
3D array (ngenes * N * nsims) for p-values and FDR from each simulation.
Note that FDR values will be empty and the calculation will be done by evaluateDE
whenever applicable.
3D (ngenes * N * nsims) array for mean, dispersion and dropout of library size factor normalized read counts.
3D (ngenes * N * nsims) array for true mean, dispersion and dropout of simulated read counts.
True simulated and processed (after prefiltering and/or imputation if defined) sequencing depth per sample.
Global library size factor estimates per sample.
3D array (ngenes * N * nsims) for size factor estimates. These are gene- and sample-wise estimates and only for SCnorm and Linnorm normalisation.
3D array (ngenes * N * nsims) for log fold changes (LFC): elfc is for the DE tool estimated LFC; rlfc is for the LFC estimated from the normalised read counts.
The time taken given by proc.time
for each simulation, given for preprocessing, normalisation, differential expression testing and moment estimation.
SetupRes: Simulation specifications
Reiterating the simulated setup defined by Setup
.
A list of chosen pipeline tools defined by above arguments.
Counts: Simulated Count Matrices
List of lists object where Counts[[Sample Size Setup]][[Simulation Run]]
containing simulated counts. Note that this will only be returned when Counts
is TRUE
. In addition, if DEFilter
is TRUE
then the filtered/imputed counts are returned.
Here you can find detailed information about preprocessing, imputation, normalisation and differential testing choices.
For recommendations concerning single cell RNA-sequencing pipelines,
we kindly refer the user to Vieth, et al (2019). A systematic evaluation of single cell RNA-seq analysis pipelines. Nature Communications, 10(1), 4667.
Prefiltering
removes genes that have a mean expression below 0.2.
removes genes that have more than 80 % dropouts.
Imputation
apply the imputation as implemented in scimpute
apply the imputation as implemented in DrImpute
.
apply the imputation as implemented in saver
.
apply the imputation as implemented in scone
, defining 'house keeping genes' for the FNR model estimation as those that have less than 20 % dropouts and small variance (i.e. in the lower 20th quartile). If less than 25 genes could be identified, the genes with less than 5 % dropouts are used. If spike-in data is provided then these are used for the FNR model estimation.
apply the imputation as implemented in magic
. Please note that the python package MAGIC needs to be installed to use this implementation.
Normalisation
employ the edgeR style normalization of weighted trimmed mean of M-values and upperquartile
as implemented in calcNormFactors
, respectively.
employ the DESeq2 style normalization of median ratio method and a modified geometric mean method
as implemented in estimateSizeFactors
, respectively.
apply the deconvolution and quantile regression normalization methods developed for sparse RNA-seq data
as implemented in computeSumFactors
and SCnorm
, respectively. Spike-ins can also be supplied for both methods via spikeData
. Note, however that this means for scran that the normalisation as implemented in calculateSumFactors
is also applied to genes (general.use=TRUE
).
apply the normalization method for sparse RNA-seq data
as implemented in Linnorm.Norm
.
For Linnorm
, the user can also supply spikeData
.
apply the normalization method developed for single-cell
UMI RNA-seq data as implemented in vst
.
converts relative measures of TPM/FPKM values into mRNAs per cell (RPC) without the need of spike-in standards.
Census at least needs Lengths
for single-end data and preferably MeanFragLengths
for paired-end data.
The authors state that Census should not be used for UMI data.
Sequencing depth normalisation.
Differential testing
A T-Test per gene is applied using log transformed and normalized expression values (i.e. CPM or TPM).
apply differential testing as implemented in lmFit
followed by eBayes
on counts transformed by voom
or by applying mean-variance trend on log CPM values in eBayes
.
apply differential testing as implemented in glmFit
, glmLRT
andglmQLFit
, glmQLFTest
, respectively.
applies differential testing as implemented in DESeq
.
applies differential testing as implemented in ROTS
with 100 permutations on transformed counts (voom
).
applies differential testing as implemented in getLikelihoods
based on negative binomial prior estimates (getPriors.NB
).
applies differential testing as implemented in noiseqbio
based on CPM values.
applies differential testing as implemented in EBTest
.
applies differential testing as implemented in zlm
for zero-inflated model fitting followed by lrTest
on log CPM values.
applies differential testing as implemented in BPglm
on CPM values.
applies differential testing as implemented in scDD
on CPM values.
applies differential testing as implemented in decent
.
In a first step, the posterior probabilities of the zero-inflated negative binomial component are estimated (see zeroWeightsLS
) and used to define a weight matrix for dispersion estimation in estimateDisp
. For the edgeR approach, the generalized model as implemented in glmFit
is fitted. This is followed by an adapted LRT for differential testing to account for the weighting (see glmWeightedF
). For DESeq2, the generalized linear model coefficients are estimated using nbinomWaldTest
and the weighting is done by setting the degrees of freedom for the T distribution.
In a first step, a zero-inflated negative binomial regression model is fitted (see zinbFit
) to estimate observational weights (see computeObservationalWeights
) used for dispersion estimation in estimateDisp
. For the edgeR approach, the generalized model as implemented in glmFit
is fitted. This is followed by an adapted LRT for differential testing to account for the weighting (see glmWeightedF
). For DESeq2, the generalized linear model coefficients are estimated using nbinomWaldTest
and the weighting is done by setting the degrees of freedom for the T distribution.
estimateParam
and estimateSpike
, for parameter specifications;
Setup
for simulation setup;
evaluateDE
, evaluateROC
and evaluateSim
for evaluation of simulation results.
if (FALSE) { # estimate gene parameters data("SmartSeq2_Gene_Read_Counts") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_Gene_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_Gene_Read_Counts)) data("GeneLengths_mm10") estparam_gene <- estimateParam(countData = SmartSeq2_Gene_Read_Counts, readData = NULL, batchData = Batches, spikeData = NULL, spikeInfo = NULL, Lengths = GeneLengths_mm10, MeanFragLengths = NULL, RNAseq = 'singlecell', Protocol = 'Read', Distribution = 'ZINB', Normalisation = "scran", GeneFilter = 0.1, SampleFilter = 3, sigma = 1.96, NCores = NULL, verbose = TRUE) # estimate spike parameters data("SmartSeq2_SpikeIns_Read_Counts") data("SmartSeq2_SpikeInfo") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_SpikeIns_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_SpikeIns_Read_Counts)) estparam_spike <- estimateSpike(spikeData = SmartSeq2_SpikeIns_Read_Counts, spikeInfo = SmartSeq2_SpikeInfo, MeanFragLength = NULL, batchData = Batches, Normalisation = 'depth') # define log fold change p.lfc <- function(x) sample(c(-1,1), size=x,replace=T)*rgamma(x, shape = 1, rate = 2) # set up simulations setupres <- Setup(ngenes = 10000, nsims = 10, p.DE = 0.1, pLFC = p.lfc, n1 = c(20,50,100), n2 = c(30,60,120), Thinning = c(1,0.9,0.8), LibSize = 'given', estParamRes = estparam_gene, estSpikeRes = estparam_spike, DropGenes = FALSE, sim.seed = 52679, verbose = TRUE) # run simulation simres <- simulateDE(SetupRes = setupres, Prefilter = "FreqFilter", Imputation = NULL, Normalisation = 'scran', Label = 'none', DEmethod = "limma-trend", DEFilter = FALSE, NCores = NULL, verbose = TRUE) # quick evaluation evalderes <- evaluateDE(simRes = simres) # plot evaluation plotEvalDE(evalRes = evalderes, rate = "marginal", quick = TRUE, Annot = FALSE) }