|
| 1 | +(*** hide ***) |
| 2 | +// This block of code is omitted in the generated HTML documentation. Use |
| 3 | +// it to define helpers that you do not want to show in the documentation. |
| 4 | +#r "nuget: BioFSharp, 2.0.0-preview.3" |
| 5 | +#r "nuget: FSharp.Stats, 0.6.0" |
| 6 | +#I @"..\src\BioFSharp.Stats\bin\Release\netstandard2.0" |
| 7 | +#r "BioFSharp.Stats.dll" |
| 8 | + |
| 9 | +open BioFSharp |
| 10 | +open FSharpAux |
| 11 | + |
| 12 | +(** |
| 13 | +# Gene set enrichment analysis (GSEA) |
| 14 | +
|
| 15 | +When performing high-troughput experiments the question arises how to analyze the resulting huge data sets. |
| 16 | +Of course the analysis of each single item separately is possible but unefficient and cumberersome. |
| 17 | +If you want to compare the transcript/protein composition of a system before and after a treatment it is useful |
| 18 | +to make use of a gene set enrichment analysis (GSEA) to identify overrepresented transcript/protein groups. |
| 19 | +The grouping can be based on MapMan- or GO-annotations or other shared properties. |
| 20 | +For further information visit [GSEA publication](https://academic.oup.com/bioinformatics/article/23/4/401/181853). |
| 21 | +
|
| 22 | +## Method |
| 23 | +
|
| 24 | +Imagine an experiment, in which 1000 proteins were measured. Within these 1000 proteins there are 25 proteins |
| 25 | +of a specific functional group (heat shock response). |
| 26 | +By treating the organism(s) 200 proteins are significantly altered. 10 of the 25 heat responsive proteins |
| 27 | +can be found within these 200. |
| 28 | +
|
| 29 | +Null-Hypothesis: The number of identified altered items is due to chance. |
| 30 | +
|
| 31 | +Question: What is the probability, that these 10 or more proteins occuring in the 200 altered proteins by chance? |
| 32 | +
|
| 33 | +The hypergeomentric distribution can answer this question: |
| 34 | +
|
| 35 | +- N is the population size (1000 proteins) |
| 36 | +
|
| 37 | +- K is the number of success states in the population (25 heat responsive proteins) |
| 38 | +
|
| 39 | +- n is the number of draws (200 significantly altered proteins) |
| 40 | +
|
| 41 | +- k is the number of observed successes (10 or higher) |
| 42 | +
|
| 43 | +Result: 0.0161 = 1.6% |
| 44 | +
|
| 45 | +If the calculated probability is small, it is very unlikely, that the observed effect is due to chance. |
| 46 | +This implies, that under the given condition the heat response proteins are enriched. |
| 47 | +
|
| 48 | +###Split p value threshold |
| 49 | +
|
| 50 | +The discrete nature of the hypergeometric distribution prevents the significance to reach 0.05 exactly. Especially in small bin sizes there always is a range of $\alpha$-level space that must be sacrificed leading to a lower $\alpha$ than intended. |
| 51 | +To mitigate this loss of power you can use a mid p-value, that has the form of: |
| 52 | +
|
| 53 | + - $P(K>k) + 0.5P(K=k)$ or equivalently $0.5(P(K>k) + P(K\geq k))$ |
| 54 | +
|
| 55 | +The resulting p-values are not conservative any more. The actual alpha level may be higher than intended! |
| 56 | +
|
| 57 | +_References_ |
| 58 | + |
| 59 | + - Figure A1 in Schneider, Venn, and Mühlhaus; TMEA: A Thermodynamically Motivated Framework for Functional Characterization of Biological Responses to System Acclimation; Entropy 2020; https://doi.org/10.3390/e22091030 |
| 60 | + |
| 61 | + - Rivals et al.; Enrichment or depletion of a GO category within a class of genes: which test?; Bioinformatics 2006; doi:10.1093/bioinformatics/btl633 |
| 62 | + |
| 63 | + - Rubin-Delanchy et al.; Meta-Analysis of Mid-p-Values: Some New Results based on the Convex Order.; Journal of the American Statistical Association 2018; doi:10.1080/01621459.2018.1469994 |
| 64 | + |
| 65 | + - Agresti and Min; On Small-Sample Confidence Intervals for Parameters in Discrete Distributions; Biometrics 2001; https://doi.org/10.1111/j.0006-341X.2001.00963.x |
| 66 | +
|
| 67 | +##Usage |
| 68 | +
|
| 69 | +*) |
| 70 | + |
| 71 | +//Usage: |
| 72 | +//At first all items are converted into an OntologyItem type: |
| 73 | + |
| 74 | +open BioFSharp |
| 75 | +open BioFSharp.Stats |
| 76 | + |
| 77 | +//Example item |
| 78 | +//Protein: HSP70 |
| 79 | +//MapManAnnotation: GMM:29.6.2.3;GMM:20.2.1 |
| 80 | +//Significantly altered?: true |
| 81 | + |
| 82 | +//Creates an OntologyItem with an identifier, an Ontologyterm, a group index and an arbitrary and |
| 83 | +//optional field. The groupIndex is based on prior analyis and defines if the items belongs to a |
| 84 | +//specific group. If significantly altered proteins were identified, you can assign index '1' to |
| 85 | +//all altered proteins and '0' to all other. The choice of the index is up to you, but you have |
| 86 | +//to remind it for later steps. |
| 87 | +let item = OntologyEnrichment.createOntologyItem "Cre06.g250100" "protein.folding.chaperones and co-chaperones.HSP70s;stress.abiotic.heat" 1 "HSP70" |
| 88 | + |
| 89 | + |
| 90 | +//If more than one ontology term is possible, you can use this function to split the item, so that |
| 91 | +//every resulting item contains just one ontology term. |
| 92 | +let splitItems = OntologyEnrichment.splitMultipleAnnotationsBy ';' item |
| 93 | + |
| 94 | +//create a list of all items in your experiment |
| 95 | +let items = [splitItems(*...*)] |> Seq.concat |
| 96 | + |
| 97 | +//When dealing with MapMan annotations all levels of the descriptions should be considered. |
| 98 | +//"stress.abiotic.heat" -> ["stress.abiotic.heat"; "stress.abiotic"; "stress"] |
| 99 | +let expandedItems = OntologyEnrichment.expandOntologyTree items |
| 100 | + |
| 101 | + |
| 102 | +//now the dataset is ready for performing the enrichment analysis |
| 103 | +//deGroupIndex: which index should be considered as success? (in this case '1') |
| 104 | +//splitPValueThreshold: if a bin size is below this threshold, the resulting pvalue |
| 105 | +// is split up (-> increased to lower the impact of small bins) (default 5). |
| 106 | +//minNumberInTerm: what is the minimal number of items that should be |
| 107 | +// within a bin to consider the bin at all (default 2). |
| 108 | +//data: sequence of expanded and split ontology items |
| 109 | +let gseaResult = OntologyEnrichment.CalcOverEnrichment 1 (Some 5) (Some 2) expandedItems |
| 110 | + |
| 111 | + |
| 112 | +let filterAndSortEnrichedModules = |
| 113 | + gseaResult |
| 114 | + |> Seq.filter (fun x -> x.PValue < 0.05) |
| 115 | + |> Seq.sortByDescending (fun x -> x.NumberOfDEsInBin) |
| 116 | + |
| 117 | +//the result type contains the following fields: |
| 118 | +//GseaResult: |
| 119 | +// OntologyTerm : OntologyTerm(MapMan-Term/GO-Term/...) |
| 120 | +// ItemsInBin : Sequence of single items belonging to the ontology term |
| 121 | +// NumberOfDEsInBin : Number of significantly altered items in 'OntologyTerm' bin |
| 122 | +// NumberInBin : Number of items in 'OntologyTerm' bin |
| 123 | +// TotalNumberOfDE : Number of significantly altered items |
| 124 | +// TotalUniverse : Number of all items (expanded) |
| 125 | +// PValue : p value |
0 commit comments