This function generates the settings needed for simulateDE. Firstly a set of differential expressed gene IDs with associated fold changes for a given number of genes, simulations and fraction of DE genes is generated. There are also a number of options relating to count simulations such as downsampling. Secondly, the estimated parameters for count simulations of genes (and spikes) are added.

Setup(ngenes=NULL, nsims=25,
p.DE = 0.1, pLFC = 1, p.G = 1,
p.B=NULL, bLFC=NULL, bPattern="uncorrelated",
n1=c(20,50,100), n2=c(30,60,120),
Thinning = NULL, LibSize='equal',
estParamRes, estSpikeRes=NULL,
DropGenes=FALSE,
setup.seed, verbose = TRUE)

Arguments

ngenes

is a numeric vector specifying the number of genes to simulate. Default is NULL, i.e. the number of genes that were deemed expressed after filtering. See estimateParam and plotParam for more information about the number of genes that were used fo estimation and fitting.

nsims

Number of simulations to run. Default is 25.

p.DE

Numeric vector between 0 and 1 representing the percentage of genes being differentially expressed due to phenotype, i.e. biological signal. Default is 0.1.

pLFC

The log phenotypic fold change for DE genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of DE genes. If the input is a vector and the length is not the number of DE genes, it will be sampled with replacement to generate log fold changes; (3) a function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). Default is 1. Please note that the fold change should be on log2 scale!

p.G

Numeric vector indicating the proportion of replicates per group that express the phenotypic fold change. Default is 1, this means all show the expressiond difference. For example, if 0.5 and n1 = 10 and n2 = 8, then only 5 replicates in group 1 and 4 replicates in group 2 express the phenotypic fold change.

p.B

Numeric vector between 0 and 1 representing the percentage of genes being differentially expressed between batches. Default is NULL, i.e. no batch effect.

bLFC

The log batch fold change for all genes. This can be: (1) a constant, e.g. 2; (2) a vector of values with length being number of all genes. If the input is a vector and the length is not the number of total genes, it will be sampled with replacement to generate log fold changes; (3) a function that takes an integer n, and generates a vector of length n, e.g. function(x) rnorm(x, mean=0, sd=1.5). Note that the simulations of only two batches is implemented. Default is NULL, i.e. no batch effect. Please note that the fold change should be on log2 scale!

bPattern

Character vector for batch effect pattern if p.B is non-null. Possible options include: "uncorrelated", "orthogonal" and " correlated". Default is "uncorrelated".

n1, n2

Integer vectors specifying the number of biological replicates in each group. Default values are n1=c(20,50,100) and n2=c(30,60,120). The vectors need to have the same number of entries, i.e. length.

Thinning

Numeric vector specifying the downsampling. It has to be the same length and order as the vector n1. This is an implementation of thinCounts. The vector entries should be a proportion between 0 and 1, e.g. 0.75 means that each sample in that group will have on average 75% sequencing depth compared to the original sequencing depth as listed in estimateParam. Note that no upsampling is possible, i.e. defining a proportion greater than 1. The default is NULL, meaning no downsampling.

LibSize

Size factors representing sample-specific differences/biases in expected mean values of counts: "equal" or "given". The default is "equal", i.e. equal size factor of 1. If the user defines it as "given", the size factors are sampled from the estimated size factors of estimateParam.

estParamRes

The estimated simulation parameters for genes. This can be: (1) The output of estimateParam. (2) A string specifying the name of precalculated estimates, see details.

estSpikeRes

The spike-in simulation parameters generated by estimateSpike. These are needed for applying spike-in-dependent normalisation methods. Default is NULL, i.e. no spike-in count simulations.

DropGenes

By default, the estimated parameters and fit based on genes that were defined as expressed are used for simulations. By setting this parameter to TRUE, a fraction of genes will be dropouts. The dropout genes are defined in estimateParam using the GeneFilter option and can be plotted with plotParam. Default is FALSE, i.e. no gene expression dropouts.

setup.seed

Setup seed.

verbose

Logical vector to indicate whether to print function information. Default is TRUE.

Value

A complex list with the following entries:

DESetup

A list of simulation options relating to the Differential Expression Setup: number of genes simulated (ngenes); number of simulation iterations (nsims); IDs of DE genes (DEid) which is a list (length=nsims) of vectors (length=ngenes*p.DE); log fold change of genes (pLFC) which is a list (length=nsims) of vectors (length=ngenes); similarly for batch effects (Bid and bLFC); list containing simulation seeds (sim.seed).

SimSetup

A list of simulation options relating to the Simulation Setup: the number of samples per group (n1 and n2); simulating spike-ins (spikeIns); Thinning parameters (Thinning, thinSpike); Library Size Factors (LibSize); dropout genes (DropGenes and DropRate).

estParamRes

A list object containing the gene simulation parameters provided.

estSpikeRes

A list object containing the spike-in simulation parameters provided.

Examples

if (FALSE) { # estimate gene parameters data("SmartSeq2_Gene_Read_Counts") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_Gene_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_Gene_Read_Counts)) data("GeneLengths_mm10") estparam_gene <- estimateParam(countData = SmartSeq2_Gene_Read_Counts, readData = NULL, batchData = Batches, spikeData = NULL, spikeInfo = NULL, Lengths = GeneLengths_mm10, MeanFragLengths = NULL, RNAseq = 'singlecell', Protocol = 'Read', Distribution = 'ZINB', Normalisation = "scran", GeneFilter = 0.1, SampleFilter = 3, sigma = 1.96, NCores = NULL, verbose = TRUE) # estimate spike parameters data("SmartSeq2_SpikeIns_Read_Counts") data("SmartSeq2_SpikeInfo") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_SpikeIns_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_SpikeIns_Read_Counts)) estparam_spike <- estimateSpike(spikeData = SmartSeq2_SpikeIns_Read_Counts, spikeInfo = SmartSeq2_SpikeInfo, MeanFragLength = NULL, batchData = Batches, Normalisation = 'depth') # define log fold change p.lfc <- function(x) sample(c(-1,1), size=x,replace=T)*rgamma(x, shape = 1, rate = 2) # set up simulations setupres <- Setup(ngenes = 10000, nsims = 25, p.DE = 0.1, pLFC = p.lfc, n1 = c(20,50,100), n2 = c(30,60,120), Thinning = c(1,0.9,0.8), LibSize = 'given', estParamRes = estparam_gene, estSpikeRes = estparam_spike, DropGenes = TRUE, setup.seed = 52679, verbose = TRUE) }