Setup options for RNA-seq count simulations

This function generates the settings needed for simulateDE. Firstly a set of differential expressed gene IDs with associated fold changes for a given number of genes, simulations and fraction of DE genes is generated. There are also a number of options relating to count simulations such as downsampling. Secondly, the estimated parameters for count simulations of genes (and spikes) are added.

Setup(ngenes=NULL, nsims=25,
p.DE = 0.1, pLFC = 1, p.G = 1,
p.B=NULL, bLFC=NULL, bPattern="uncorrelated",
n1=c(20,50,100), n2=c(30,60,120),
Thinning = NULL, LibSize='equal',
estParamRes, estSpikeRes=NULL,
DropGenes=FALSE,
setup.seed, verbose = TRUE)

Arguments

ngenes	is a numeric vector specifying the number of genes to simulate. Default is `NULL`, i.e. the number of genes that were deemed expressed after filtering. See `estimateParam` and `plotParam` for more information about the number of genes that were used fo estimation and fitting.
nsims	Number of simulations to run. Default is 25.
p.DE	Numeric vector between 0 and 1 representing the percentage of genes being differentially expressed due to phenotype, i.e. biological signal. Default is `0.1`.
pLFC	The log phenotypic fold change for DE genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of DE genes. If the input is a vector and the length is not the number of DE genes, it will be sampled with replacement to generate log fold changes; (3) a function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). Default is `1`. Please note that the fold change should be on `log2` scale!
p.G	Numeric vector indicating the proportion of replicates per group that express the phenotypic fold change. Default is `1`, this means all show the expressiond difference. For example, if `0.5` and `n1 = 10` and `n2 = 8`, then only 5 replicates in group 1 and 4 replicates in group 2 express the phenotypic fold change.
p.B	Numeric vector between 0 and 1 representing the percentage of genes being differentially expressed between batches. Default is `NULL`, i.e. no batch effect.
bLFC	The log batch fold change for all genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of all genes. If the input is a vector and the length is not the number of total genes, it will be sampled with replacement to generate log fold changes; (3) a function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). Note that the simulations of only two batches is implemented. Default is `NULL`, i.e. no batch effect. Please note that the fold change should be on `log2` scale!
bPattern	Character vector for batch effect pattern if `p.B` is non-null. Possible options include: "uncorrelated", "orthogonal" and " correlated". Default is `"uncorrelated"`.
n1, n2	Integer vectors specifying the number of biological replicates in each group. Default values are n1=c(20,50,100) and n2=c(30,60,120). The vectors need to have the same number of entries, i.e. length.
Thinning	Numeric vector specifying the downsampling. It has to be the same length and order as the vector `n1`. This is an implementation of `thinCounts`. The vector entries should be a proportion between 0 and 1, e.g. `0.75` means that each sample in that group will have on average 75% sequencing depth compared to the original sequencing depth as listed in `estimateParam`. Note that no upsampling is possible, i.e. defining a proportion greater than 1. The default is `NULL`, meaning no downsampling.
LibSize	Size factors representing sample-specific differences/biases in expected mean values of counts: `"equal"` or `"given"`. The default is `"equal"`, i.e. equal size factor of 1. If the user defines it as `"given"`, the size factors are sampled from the estimated size factors of `estimateParam`.
estParamRes	The estimated simulation parameters for genes. This can be: (1) The output of `estimateParam`. (2) A string specifying the name of precalculated estimates, see details.
estSpikeRes	The spike-in simulation parameters generated by `estimateSpike`. These are needed for applying spike-in-dependent normalisation methods. Default is `NULL`, i.e. no spike-in count simulations.
DropGenes	By default, the estimated parameters and fit based on genes that were defined as expressed are used for simulations. By setting this parameter to `TRUE`, a fraction of genes will be dropouts. The dropout genes are defined in `estimateParam` using the `GeneFilter` option and can be plotted with `plotParam`. Default is `FALSE`, i.e. no gene expression dropouts.
setup.seed	Setup seed.
verbose	Logical vector to indicate whether to print function information. Default is `TRUE`.

Value

A complex list with the following entries:

DESetup

A list of simulation options relating to the Differential Expression Setup: number of genes simulated (ngenes); number of simulation iterations (nsims); IDs of DE genes (DEid) which is a list (length=nsims) of vectors (length=ngenes*p.DE); log fold change of genes (pLFC) which is a list (length=nsims) of vectors (length=ngenes); similarly for batch effects (Bid and bLFC); list containing simulation seeds (sim.seed).

SimSetup

A list of simulation options relating to the Simulation Setup: the number of samples per group (n1 and n2); simulating spike-ins (spikeIns); Thinning parameters (Thinning, thinSpike); Library Size Factors (LibSize); dropout genes (DropGenes and DropRate).

estParamRes

A list object containing the gene simulation parameters provided.

estSpikeRes

A list object containing the spike-in simulation parameters provided.

Examples

if (FALSE) {
# estimate gene parameters
data("SmartSeq2_Gene_Read_Counts")
Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_Gene_Read_Counts),
                                    "_"), "[[", 1),
                     stringsAsFactors = F,
                     row.names = colnames(SmartSeq2_Gene_Read_Counts))
data("GeneLengths_mm10")
estparam_gene <- estimateParam(countData = SmartSeq2_Gene_Read_Counts,
                               readData = NULL,
                               batchData = Batches,
                               spikeData = NULL, spikeInfo = NULL,
                               Lengths = GeneLengths_mm10, MeanFragLengths = NULL,
                               RNAseq = 'singlecell', Protocol = 'Read',
                               Distribution = 'ZINB', Normalisation = "scran",
                               GeneFilter = 0.1, SampleFilter = 3,
                              sigma = 1.96, NCores = NULL, verbose = TRUE)
# estimate spike parameters
data("SmartSeq2_SpikeIns_Read_Counts")
data("SmartSeq2_SpikeInfo")
Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_SpikeIns_Read_Counts),
                                    "_"), "[[", 1),
                     stringsAsFactors = F,
                     row.names = colnames(SmartSeq2_SpikeIns_Read_Counts))
estparam_spike <- estimateSpike(spikeData = SmartSeq2_SpikeIns_Read_Counts,
                                spikeInfo = SmartSeq2_SpikeInfo,
                                MeanFragLength = NULL,
                                batchData = Batches,
                                Normalisation = 'depth')
# define log fold change
p.lfc <- function(x) sample(c(-1,1), size=x,replace=T)*rgamma(x, shape = 1, rate = 2)
# set up simulations
setupres <- Setup(ngenes = 10000, nsims = 25,
                  p.DE = 0.1, pLFC = p.lfc,
                  n1 = c(20,50,100), n2 = c(30,60,120),
                  Thinning = c(1,0.9,0.8), LibSize = 'given',
                  estParamRes = estparam_gene,
                  estSpikeRes = estparam_spike,
                  DropGenes = TRUE,
                  setup.seed = 52679, verbose = TRUE)
}