With this function, the user can determine goodness of fit for each gene.

evaluateDist(countData, batchData = NULL,
spikeData = NULL, spikeInfo = NULL,
Lengths = NULL, MeanFragLengths = NULL,
RNAseq = "singlecell", Protocol,
Normalisation,
GeneFilter = 0.25, SampleFilter = 3,
FracGenes = 1,
verbose = TRUE)

Arguments

countData

is a count matrix (row=gene, column=sample). Please provide the measurements of one group only, e.g. the control group.

batchData

is a data.frame for batch annotation. Rows correspond to samples. The first column should contain the batches, e.g. 'a', 'b', 'c', etc.. This is only used for the normalisation.

spikeData

is a count matrix. Rows correspond to spike-ins, columns to samples. The order of columns should be the same as in the countData. This is only needed for spike-in aware normalisation methods ('MR', 'Linnorm', 'scran', 'SCnorm', 'bayNorm', 'Census'), see details section of estimateParam.

spikeInfo

is a molecule count matrix of spike-ins. Rows correspond to spike-ins. The order of rows should be the same as in the spikeData. The column names should be 'SpikeID' and 'SpikeInput' for molecule counts of spike-ins. This is only needed for spike-in aware normalisation methods ('MR', 'Linnorm', 'scran', 'SCnorm', 'bayNorm', 'Census'), see details section of estimateParam.

Lengths

is a numeric vector of transcript lengths with the same length and order as the rows in countData. This variable is only needed for internal gene length corrections (TPM), see details section of estimateParam.

MeanFragLengths

is a numeric vector of mean fragment lengths with the same length as columns in countData. This variable is only needed for internal gene length corrections (TPM), see details section of estimateParam.

RNAseq

is a character value: "bulk" or "singlecell". We recommended to use this evaluaiton for single cells only.

Protocol

is a character value defining the type of counts given in countData. Options are "UMI" (e.g. 10X Genomics, CEL-seq2) or "Read" (e.g. Smart-seq2).

Normalisation

is a character value: 'TMM', 'MR', 'PosCounts', 'UQ', 'scran', 'Linnorm', 'SCnorm', 'Census', 'depth', 'none'. For more information, please consult the details section of estimateParam.

GeneFilter

is a numeric vector indicating the minimal proportion of nonzero expression values for a gene across all samples to be considered expressed and used for estimating normalisation size factors. The default is 0.25, i.e. at least 25% of the expression values per gene need to be nonzero.

SampleFilter

is a numeric vector indicating the minimal number of MADs (median absolute deviation) away from the median number of features detected as well as sequencing depth across all samples so that outlying samples are removed prior to normalisation. The default is 5, i.e. at least 5 MADs away from the median. Choose higher values if you wsant to filter out less samples. This parameter is particularly important for single cells to ensure reliable parameter estimation. For more information, please consult isOutlier.

FracGenes

The fraction of genes to calculate goodness of fit statistics, default is 1, i.e. for all genes.

verbose

Logical value to indicate whether to print function information. Default is TRUE.

Value

List object with the results of goodness of fit and estimated parameters:

edgeR

Goodness-of-fit statistic, degrees of freedom and associated p-value using the deviance and residual degrees of freedom from glmFit. Furthermore, the AIC of the edgeR model fit using the residuals of zscoreNBinom.

GOF

The fitting results per distribution, including loglikelihood, goodness-of-fit statistics, AIC and predicted number of zeroes. The following distributions were considered: Poisson, negative binomial, zero-inflated poisson and negative binomial following the 'standard' (i.e. glm, glm.nb and zeroinfl implementation) and fitdist approach (see fitdist) and Beta-Poisson following Marioni or Hemberg parameterisation. Furthermore, model fit comparison by LRT for nested and Vuong Test for non-nested models.

Estimates

The estimated parameters of distribution fitting.

ObservedZeros

The number of zeroes and dropout rate per gene.

Examples

if (FALSE) { ## using example data set, but run it for fraction of genes data("CELseq2_Gene_UMI_Counts") evalDistRes <- evaluateDist(countData = CELseq2_Gene_UMI_Counts, batchData = NULL, spikeData = NULL, spikeInfo = NULL, Lengths = NULL, MeanFragLengths = NULL, RNAseq = "singlecell", Protocol = "UMI", Normalisation = "scran", GeneFilter = 0.1, SampleFilter = 3, FracGenes = 0.1, verbose = TRUE) plotEvalDist(evalDistRes) }