Model Diagnostics for RNAseq Data — evaluateDist • powsimR

With this function, the user can determine goodness of fit for each gene.

evaluateDist(countData, batchData = NULL,
spikeData = NULL, spikeInfo = NULL,
Lengths = NULL, MeanFragLengths = NULL,
RNAseq = "singlecell", Protocol,
Normalisation,
GeneFilter = 0.25, SampleFilter = 3,
FracGenes = 1,
verbose = TRUE)

Arguments

countData	is a count matrix (row=gene, column=sample). Please provide the measurements of one group only, e.g. the control group.
batchData	is a `data.frame` for batch annotation. Rows correspond to samples. The first column should contain the batches, e.g. 'a', 'b', 'c', etc.. This is only used for the normalisation.
spikeData	is a count `matrix`. Rows correspond to spike-ins, columns to samples. The order of columns should be the same as in the `countData`. This is only needed for spike-in aware normalisation methods ('MR', 'Linnorm', 'scran', 'SCnorm', 'bayNorm', 'Census'), see details section of `estimateParam`.
spikeInfo	is a molecule count `matrix` of spike-ins. Rows correspond to spike-ins. The order of rows should be the same as in the `spikeData`. The column names should be 'SpikeID' and 'SpikeInput' for molecule counts of spike-ins. This is only needed for spike-in aware normalisation methods ('MR', 'Linnorm', 'scran', 'SCnorm', 'bayNorm', 'Census'), see details section of `estimateParam`.
Lengths	is a numeric vector of transcript lengths with the same length and order as the rows in countData. This variable is only needed for internal gene length corrections (TPM), see details section of `estimateParam`.
MeanFragLengths	is a numeric vector of mean fragment lengths with the same length as columns in countData. This variable is only needed for internal gene length corrections (TPM), see details section of `estimateParam`.
RNAseq	is a character value: "bulk" or "singlecell". We recommended to use this evaluaiton for single cells only.
Protocol	is a character value defining the type of counts given in `countData`. Options are "UMI" (e.g. 10X Genomics, CEL-seq2) or "Read" (e.g. Smart-seq2).
Normalisation	is a character value: 'TMM', 'MR', 'PosCounts', 'UQ', 'scran', 'Linnorm', 'SCnorm', 'Census', 'depth', 'none'. For more information, please consult the details section of `estimateParam`.
GeneFilter	is a numeric vector indicating the minimal proportion of nonzero expression values for a gene across all samples to be considered expressed and used for estimating normalisation size factors. The default is `0.25`, i.e. at least 25% of the expression values per gene need to be nonzero.
SampleFilter	is a numeric vector indicating the minimal number of MADs (median absolute deviation) away from the median number of features detected as well as sequencing depth across all samples so that outlying samples are removed prior to normalisation. The default is `5`, i.e. at least 5 MADs away from the median. Choose higher values if you wsant to filter out less samples. This parameter is particularly important for single cells to ensure reliable parameter estimation. For more information, please consult `isOutlier`.
FracGenes	The fraction of genes to calculate goodness of fit statistics, default is 1, i.e. for all genes.
verbose	Logical value to indicate whether to print function information. Default is `TRUE`.

Value

List object with the results of goodness of fit and estimated parameters:

edgeR

Goodness-of-fit statistic, degrees of freedom and associated p-value using the deviance and residual degrees of freedom from glmFit. Furthermore, the AIC of the edgeR model fit using the residuals of zscoreNBinom.

GOF

The fitting results per distribution, including loglikelihood, goodness-of-fit statistics, AIC and predicted number of zeroes. The following distributions were considered: Poisson, negative binomial, zero-inflated poisson and negative binomial following the 'standard' (i.e. glm, glm.nb and zeroinfl implementation) and fitdist approach (see fitdist) and Beta-Poisson following Marioni or Hemberg parameterisation. Furthermore, model fit comparison by LRT for nested and Vuong Test for non-nested models.

Estimates

The estimated parameters of distribution fitting.

ObservedZeros

The number of zeroes and dropout rate per gene.

Examples

if (FALSE) {
## using example data set, but run it for fraction of genes
data("CELseq2_Gene_UMI_Counts")
evalDistRes <- evaluateDist(countData = CELseq2_Gene_UMI_Counts, batchData = NULL,
                            spikeData = NULL, spikeInfo = NULL,
                            Lengths = NULL, MeanFragLengths = NULL,
                            RNAseq = "singlecell", Protocol = "UMI",
                            Normalisation = "scran",
                            GeneFilter = 0.1, SampleFilter = 3,
                            FracGenes = 0.1,
                            verbose = TRUE)
plotEvalDist(evalDistRes)
}