mitch is an R package for multi-dimensional enrichment analysis. At it's heart, it uses a rank-MANOVA based statistical approach to detect sets of genes that exhibit enrichment in the multidimensional space as compared to the background. Mitch is useful for pathway analysis of profiling studies with two to or more contrasts, or in studies with multiple omics profiling, for example proteomic, transcriptomic, epigenomic analysis of the same samples. Mitch is perfectly suited for pathway level differential analysis of scRNA-seq data.
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mitch")
library("mitch")
mitch has a function to import GMT files to R lists (adapted from Yu et al, 2012 in the clusterProfiler package). For example:
download.file("https://reactome.org/download/current/ReactomePathways.gmt.zip", destfile="ReactomePathways.gmt.zip")
unzip("ReactomePathways.gmt.zip")
genesets <- gmt_import("ReactomePathways.gmt")
mitch accepts pre-ranked data supplied by the user, but also has a function called mitch_import
for importing tables generated by Limma, edgeR, DESeq2, ABSSeq, Sleuth, Seurat and Muscat. By default, only the genes that are detected in all contrasts are included, but this behaviour can be modified. The below example imports two edgeR tables called "dge1" and "dge2". Where gene identifiers are coded in the row names.
x <- list("dge1"=dge1, "dge2"=dge2)
y <- mitch_import(x, DEtype="edger")
mitch can do unidimensional analysis if you provide it a single profile as a dataframe.
y <- mitch_import(df, DEtype="edger")
If the gene identifiers are not given in the rownames, then the column can be specified with the geneIDcol
parameter like this:
y <- mitch_import(df, DEtype="edger", geneIDcol="MyGeneIDs")
For edgeR, differential gene expression is scored using the directional nominal p-value.
S = -log10(p-value) * sgn(logFC)
For DESeq2, the test statistic is used. If this is not desired, then users can perform their own custom scoring procedure.
There are many cases where the gene IDs don't match the gene sets. To overcome this, mitch_import
also accepts a two-column table that relates gene identifiers in the profiling data to those in the gene sets.
?mitch_import
provides more instructions on using this feature.
The mitch_calc
function performs multivariate enrichment analysis of the supplied gene sets in the scored profiling data. At its simpest form mitch_calc
function accepts the scored data as the first argument and the genesets as the second argument. Users can prioritise enrichments based on small adjusted p-values, or by the observed effect size (magnitude of the enrichment score).
In general, effect size prioritisation will give you more specific and actionable results, that highlights relatively small gene sets with large magnitude of change. If you prioritise by significance, you will highlight larger gene sets with more subtle expression changes. As a general rule, our group typically presents enrichment results prioritised by effect size after excluding gene sets with FDR>0.05.
res <- mitch_calc(y, genesets, priority="significance")
res <- mitch_calc(y, genesets, priority="effect")
In some cases, you may want to prioritise the results based on the diversity of gene expression differences across multiple contrasts. This will highlight gene sets that have opposite and diverse s.dist scores across contrasts. This is particularly interesting for scRNA-seq analysis where you may want to know how different cell types respond to a stimulus, so you are interested in the gene sets with the biggest variation in s.dist scores across the contrasts.
res <- mitch_calc(y, genesets, priority="SD")
Note that SD prioritisation won't work for 1D analysis.
One mitch_calc()
is finished, you can peek at the top results with head
like this:
head(res$enrichment_result)
By default, mitch_calc
uses only thread for calculation, but you can speed it up using more
parallel threads, by setting the cores
option.
We suggest 8-12 parallel threads for returning results the fastest.
Using more than 12 doesn't yield faster results typically.
Keep in mind that the more parallel threads used, the more RAM is required.
res <- mitch_calc(y, genesets, priority="significance", cores=4)
By default, gene sets with fewer than 10 members present in the profiling data are discarded. This threshold can be modified using the minsetsize
option. There is no upper limit of gene set size.
res <- mitch_calc(y, genesets, priority="significance", minsetsize=5)
By default, in downstream visualisation steps, charts are made from the top 50 gene sets, but this can be modified using the resrows
option.
res <- mitch_calc(y, genesets, priority="significance", resrows=100)
Can be done like this:
mitch_report(res, "myreport.html")
Take a look at an example report.
In case you want the charts in PDF format, these can be generated as such:
mitch_plots(res, outfile="mycharts.pdf")
Take a look at an example plot set.
For one-dimensional analysis, you can generate a network plot to visualise the similarity of the driver gene members for the top (20) up- and downregulated sets meeting a significance cut-off (FDR<0.05). These cut-offs can be altrered.
networkplot(res,FDR=0.05,n_sets=20)
See ?networkplot()
for more details.
If you want to see which genes are contributing to the network chart, use the network_genes()
function.
network_genes(FDR=0.05,n_sets=20)
This type of data is notoriously sparse, so pseudobulk aggregation is recommended. Mitch works best if there are >1000 genes detected in each cell type (contrast). If there are <400 genes present it may cause mitch to run into an error in the MANOVA test due to insufficient degrees of freedom. It is recommended to run the analysis with all three prioritisation schemes (significance, effect and SD) to interpret the data.
Mitch can be applied to Infinium Beadarray methylation data if you have limma results for probes. Please see our solution workflow on Bioconductor.
If you have questions or need help with applying mitch to your work, please raise an issue.