With this function, the user can determine goodness of fit for each gene.
evaluateDist(countData, batchData = NULL, spikeData = NULL, spikeInfo = NULL, Lengths = NULL, MeanFragLengths = NULL, RNAseq = "singlecell", Protocol, Normalisation, GeneFilter = 0.25, SampleFilter = 3, FracGenes = 1, verbose = TRUE)
countData | is a count matrix (row=gene, column=sample). Please provide the measurements of one group only, e.g. the control group. |
---|---|
batchData | is a |
spikeData | is a count |
spikeInfo | is a molecule count |
Lengths | is a numeric vector of transcript lengths with the same length and order as the rows in countData.
This variable is only needed for internal gene length corrections (TPM), see details section of |
MeanFragLengths | is a numeric vector of mean fragment lengths with the same length as columns in countData.
This variable is only needed for internal gene length corrections (TPM), see details section of |
RNAseq | is a character value: "bulk" or "singlecell". We recommended to use this evaluaiton for single cells only. |
Protocol | is a character value defining the type of counts given in |
Normalisation | is a character value: 'TMM', 'MR', 'PosCounts', 'UQ', 'scran', 'Linnorm',
'SCnorm', 'Census', 'depth', 'none'.
For more information, please consult the details section of |
GeneFilter | is a numeric vector indicating the minimal proportion of nonzero expression values
for a gene across all samples to be considered expressed and used for estimating normalisation size factors.
The default is |
SampleFilter | is a numeric vector indicating the minimal number of MADs (median absolute deviation)
away from the median number of features detected as well as sequencing depth across all samples
so that outlying samples are removed prior to normalisation.
The default is |
FracGenes | The fraction of genes to calculate goodness of fit statistics, default is 1, i.e. for all genes. |
verbose | Logical value to indicate whether to print function information.
Default is |
List object with the results of goodness of fit and estimated parameters:
Goodness-of-fit statistic, degrees of freedom and associated p-value using the deviance and residual degrees of freedom from glmFit
. Furthermore, the AIC of the edgeR model fit using the residuals of zscoreNBinom
.
The fitting results per distribution, including loglikelihood, goodness-of-fit statistics, AIC and predicted number of zeroes. The following distributions were considered: Poisson, negative binomial, zero-inflated poisson and negative binomial following the 'standard' (i.e. glm
, glm.nb
and zeroinfl
implementation) and fitdist approach (see fitdist
) and Beta-Poisson following Marioni or Hemberg parameterisation. Furthermore, model fit comparison by LRT for nested and Vuong Test for non-nested models.
The estimated parameters of distribution fitting.
The number of zeroes and dropout rate per gene.
if (FALSE) { ## using example data set, but run it for fraction of genes data("CELseq2_Gene_UMI_Counts") evalDistRes <- evaluateDist(countData = CELseq2_Gene_UMI_Counts, batchData = NULL, spikeData = NULL, spikeInfo = NULL, Lengths = NULL, MeanFragLengths = NULL, RNAseq = "singlecell", Protocol = "UMI", Normalisation = "scran", GeneFilter = 0.1, SampleFilter = 3, FracGenes = 0.1, verbose = TRUE) plotEvalDist(evalDistRes) }