This function generates the settings needed for simulateDE
.
Firstly a set of differential expressed gene IDs with
associated fold changes for a given number of genes, simulations and
fraction of DE genes is generated. There are also a number of options relating
to count simulations such as downsampling.
Secondly, the estimated parameters for count simulations of genes (and spikes) are added.
Setup(ngenes=NULL, nsims=25, p.DE = 0.1, pLFC = 1, p.G = 1, p.B=NULL, bLFC=NULL, bPattern="uncorrelated", n1=c(20,50,100), n2=c(30,60,120), Thinning = NULL, LibSize='equal', estParamRes, estSpikeRes=NULL, DropGenes=FALSE, setup.seed, verbose = TRUE)
ngenes | is a numeric vector specifying the number of genes to simulate.
Default is |
---|---|
nsims | Number of simulations to run. Default is 25. |
p.DE | Numeric vector between 0 and 1 representing
the percentage of genes being differentially expressed due to phenotype,
i.e. biological signal. Default is |
pLFC | The log phenotypic fold change for DE genes. This can be:
(1) a constant, e.g. 2;
(2) a vector of values with length being number of DE genes.
If the input is a vector and the length is not the number of DE genes,
it will be sampled with replacement to generate log fold changes;
(3) a function that takes an integer n, and generates a vector of length n,
e.g. function(x) rnorm(x, mean=0, sd=1.5).
Default is |
p.G | Numeric vector indicating the proportion of replicates per group
that express the phenotypic fold change. Default is |
p.B | Numeric vector between 0 and 1 representing the percentage of genes
being differentially expressed between batches.
Default is |
bLFC | The log batch fold change for all genes. This can be:
(1) a constant, e.g. 2;
(2) a vector of values with length being number of all genes.
If the input is a vector and the length is not the number of total genes,
it will be sampled with replacement to generate log fold changes;
(3) a function that takes an integer n, and generates a vector of length n,
e.g. function(x) rnorm(x, mean=0, sd=1.5).
Note that the simulations of only two batches is implemented.
Default is |
bPattern | Character vector for batch effect pattern if |
n1, n2 | Integer vectors specifying the number of biological replicates in each group. Default values are n1=c(20,50,100) and n2=c(30,60,120). The vectors need to have the same number of entries, i.e. length. |
Thinning | Numeric vector specifying the downsampling.
It has to be the same length and order as the vector |
LibSize | Size factors representing sample-specific differences/biases
in expected mean values of counts:
|
estParamRes | The estimated simulation parameters for genes. This can be:
(1) The output of |
estSpikeRes | The spike-in simulation parameters generated by |
DropGenes | By default, the estimated parameters and fit based on genes that
were defined as expressed are used for simulations.
By setting this parameter to |
setup.seed | Setup seed. |
verbose | Logical vector to indicate whether to print function information.
Default is |
A complex list with the following entries:
A list of simulation options relating to the Differential Expression Setup: number of genes simulated (ngenes); number of simulation iterations (nsims); IDs of DE genes (DEid) which is a list (length=nsims) of vectors (length=ngenes*p.DE); log fold change of genes (pLFC) which is a list (length=nsims) of vectors (length=ngenes); similarly for batch effects (Bid and bLFC); list containing simulation seeds (sim.seed).
A list of simulation options relating to the Simulation Setup: the number of samples per group (n1 and n2); simulating spike-ins (spikeIns); Thinning parameters (Thinning, thinSpike); Library Size Factors (LibSize); dropout genes (DropGenes and DropRate).
A list object containing the gene simulation parameters provided.
A list object containing the spike-in simulation parameters provided.
if (FALSE) { # estimate gene parameters data("SmartSeq2_Gene_Read_Counts") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_Gene_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_Gene_Read_Counts)) data("GeneLengths_mm10") estparam_gene <- estimateParam(countData = SmartSeq2_Gene_Read_Counts, readData = NULL, batchData = Batches, spikeData = NULL, spikeInfo = NULL, Lengths = GeneLengths_mm10, MeanFragLengths = NULL, RNAseq = 'singlecell', Protocol = 'Read', Distribution = 'ZINB', Normalisation = "scran", GeneFilter = 0.1, SampleFilter = 3, sigma = 1.96, NCores = NULL, verbose = TRUE) # estimate spike parameters data("SmartSeq2_SpikeIns_Read_Counts") data("SmartSeq2_SpikeInfo") Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_SpikeIns_Read_Counts), "_"), "[[", 1), stringsAsFactors = F, row.names = colnames(SmartSeq2_SpikeIns_Read_Counts)) estparam_spike <- estimateSpike(spikeData = SmartSeq2_SpikeIns_Read_Counts, spikeInfo = SmartSeq2_SpikeInfo, MeanFragLength = NULL, batchData = Batches, Normalisation = 'depth') # define log fold change p.lfc <- function(x) sample(c(-1,1), size=x,replace=T)*rgamma(x, shape = 1, rate = 2) # set up simulations setupres <- Setup(ngenes = 10000, nsims = 25, p.DE = 0.1, pLFC = p.lfc, n1 = c(20,50,100), n2 = c(30,60,120), Thinning = c(1,0.9,0.8), LibSize = 'given', estParamRes = estparam_gene, estSpikeRes = estparam_spike, DropGenes = TRUE, setup.seed = 52679, verbose = TRUE) }