Skip to content

Commit edfae01

Browse files
committed
update readme and port old docs
1 parent ae2b6d9 commit edfae01

File tree

4 files changed

+141
-24
lines changed

4 files changed

+141
-24
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ BioFSharp.Stats is part of the [BioFSharp]() extension package ecosystem.
1212

1313
| Build status (ubuntu and windows) | Test Coverage |
1414
|---|---|
15-
| ![](https://github.com/BioFSharp/BioFSharp.Stats/actions/workflows/build-and-test.yml/badge.svg) | [![codecov](https://codecov.io/gh/BioFSharp/BioFSharp.Stats/branch/main/graph/badge.svg)](https://codecov.io/gh/BioFSharp/BioFSharp.Stats) |
15+
| [![Build and test](https://github.com/BioFSharp/BioFSharp.Stats/actions/workflows/build-and-test.yml/badge.svg)](https://github.com/BioFSharp/BioFSharp.Stats/actions/workflows/build-and-test.yml) | [![codecov](https://codecov.io/gh/BioFSharp/BioFSharp.Stats/branch/main/graph/badge.svg)](https://codecov.io/gh/BioFSharp/BioFSharp.Stats) |
1616

1717

1818
From 2.0.0 onwards, this package will have a decoupled release schedule from the core `BioFSharp` package.

docs/GSEA.fsx

+125
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
(*** hide ***)
2+
// This block of code is omitted in the generated HTML documentation. Use
3+
// it to define helpers that you do not want to show in the documentation.
4+
#r "nuget: BioFSharp, 2.0.0-preview.3"
5+
#r "nuget: FSharp.Stats, 0.6.0"
6+
#I @"..\src\BioFSharp.Stats\bin\Release\netstandard2.0"
7+
#r "BioFSharp.Stats.dll"
8+
9+
open BioFSharp
10+
open FSharpAux
11+
12+
(**
13+
# Gene set enrichment analysis (GSEA)
14+
15+
When performing high-troughput experiments the question arises how to analyze the resulting huge data sets.
16+
Of course the analysis of each single item separately is possible but unefficient and cumberersome.
17+
If you want to compare the transcript/protein composition of a system before and after a treatment it is useful
18+
to make use of a gene set enrichment analysis (GSEA) to identify overrepresented transcript/protein groups.
19+
The grouping can be based on MapMan- or GO-annotations or other shared properties.
20+
For further information visit [GSEA publication](https://academic.oup.com/bioinformatics/article/23/4/401/181853).
21+
22+
## Method
23+
24+
Imagine an experiment, in which 1000 proteins were measured. Within these 1000 proteins there are 25 proteins
25+
of a specific functional group (heat shock response).
26+
By treating the organism(s) 200 proteins are significantly altered. 10 of the 25 heat responsive proteins
27+
can be found within these 200.
28+
29+
Null-Hypothesis: The number of identified altered items is due to chance.
30+
31+
Question: What is the probability, that these 10 or more proteins occuring in the 200 altered proteins by chance?
32+
33+
The hypergeomentric distribution can answer this question:
34+
35+
- N is the population size (1000 proteins)
36+
37+
- K is the number of success states in the population (25 heat responsive proteins)
38+
39+
- n is the number of draws (200 significantly altered proteins)
40+
41+
- k is the number of observed successes (10 or higher)
42+
43+
Result: 0.0161 = 1.6%
44+
45+
If the calculated probability is small, it is very unlikely, that the observed effect is due to chance.
46+
This implies, that under the given condition the heat response proteins are enriched.
47+
48+
###Split p value threshold
49+
50+
The discrete nature of the hypergeometric distribution prevents the significance to reach 0.05 exactly. Especially in small bin sizes there always is a range of $\alpha$-level space that must be sacrificed leading to a lower $\alpha$ than intended.
51+
To mitigate this loss of power you can use a mid p-value, that has the form of:
52+
53+
- $P(K>k) + 0.5P(K=k)$ or equivalently $0.5(P(K>k) + P(K\geq k))$
54+
55+
The resulting p-values are not conservative any more. The actual alpha level may be higher than intended!
56+
57+
_References_
58+
59+
- Figure A1 in Schneider, Venn, and Mühlhaus; TMEA: A Thermodynamically Motivated Framework for Functional Characterization of Biological Responses to System Acclimation; Entropy 2020; https://doi.org/10.3390/e22091030
60+
61+
- Rivals et al.; Enrichment or depletion of a GO category within a class of genes: which test?; Bioinformatics 2006; doi:10.1093/bioinformatics/btl633
62+
63+
- Rubin-Delanchy et al.; Meta-Analysis of Mid-p-Values: Some New Results based on the Convex Order.; Journal of the American Statistical Association 2018; doi:10.1080/01621459.2018.1469994
64+
65+
- Agresti and Min; On Small-Sample Confidence Intervals for Parameters in Discrete Distributions; Biometrics 2001; https://doi.org/10.1111/j.0006-341X.2001.00963.x
66+
67+
##Usage
68+
69+
*)
70+
71+
//Usage:
72+
//At first all items are converted into an OntologyItem type:
73+
74+
open BioFSharp
75+
open BioFSharp.Stats
76+
77+
//Example item
78+
//Protein: HSP70
79+
//MapManAnnotation: GMM:29.6.2.3;GMM:20.2.1
80+
//Significantly altered?: true
81+
82+
//Creates an OntologyItem with an identifier, an Ontologyterm, a group index and an arbitrary and
83+
//optional field. The groupIndex is based on prior analyis and defines if the items belongs to a
84+
//specific group. If significantly altered proteins were identified, you can assign index '1' to
85+
//all altered proteins and '0' to all other. The choice of the index is up to you, but you have
86+
//to remind it for later steps.
87+
let item = OntologyEnrichment.createOntologyItem "Cre06.g250100" "protein.folding.chaperones and co-chaperones.HSP70s;stress.abiotic.heat" 1 "HSP70"
88+
89+
90+
//If more than one ontology term is possible, you can use this function to split the item, so that
91+
//every resulting item contains just one ontology term.
92+
let splitItems = OntologyEnrichment.splitMultipleAnnotationsBy ';' item
93+
94+
//create a list of all items in your experiment
95+
let items = [splitItems(*...*)] |> Seq.concat
96+
97+
//When dealing with MapMan annotations all levels of the descriptions should be considered.
98+
//"stress.abiotic.heat" -> ["stress.abiotic.heat"; "stress.abiotic"; "stress"]
99+
let expandedItems = OntologyEnrichment.expandOntologyTree items
100+
101+
102+
//now the dataset is ready for performing the enrichment analysis
103+
//deGroupIndex: which index should be considered as success? (in this case '1')
104+
//splitPValueThreshold: if a bin size is below this threshold, the resulting pvalue
105+
// is split up (-> increased to lower the impact of small bins) (default 5).
106+
//minNumberInTerm: what is the minimal number of items that should be
107+
// within a bin to consider the bin at all (default 2).
108+
//data: sequence of expanded and split ontology items
109+
let gseaResult = OntologyEnrichment.CalcOverEnrichment 1 (Some 5) (Some 2) expandedItems
110+
111+
112+
let filterAndSortEnrichedModules =
113+
gseaResult
114+
|> Seq.filter (fun x -> x.PValue < 0.05)
115+
|> Seq.sortByDescending (fun x -> x.NumberOfDEsInBin)
116+
117+
//the result type contains the following fields:
118+
//GseaResult:
119+
// OntologyTerm : OntologyTerm(MapMan-Term/GO-Term/...)
120+
// ItemsInBin : Sequence of single items belonging to the ontology term
121+
// NumberOfDEsInBin : Number of significantly altered items in 'OntologyTerm' bin
122+
// NumberInBin : Number of items in 'OntologyTerm' bin
123+
// TotalNumberOfDE : Number of significantly altered items
124+
// TotalUniverse : Number of all items (expanded)
125+
// PValue : p value

docs/index.fsx

+13-21
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,24 @@
11
(**
22
# BioFSharp.Stats
33
4-
you can write markdown in these comment brackets `(** ... *)`
5-
*)
4+
![Logo](docs/img/Logo_large.png)
65
7-
// and actual F# content in a literate script file
8-
let a = 42
6+
![NuGet Version](https://img.shields.io/nuget/v/BioFSharp.Stats?label=nuget(stable))
7+
![NuGet Version](https://img.shields.io/nuget/vpre/BioFSharp.Stats?label=nuget(prerelease))
98
10-
(**
11-
include evaluations like this:
12-
*)
9+
BioFSharp.Stats contains statistical functions with a clear biological focus such as Gene Set Enrichment Analysis (GSEA).
1310
14-
(***include-value: a***)
11+
BioFSharp.Stats is part of the [BioFSharp]() extension package ecosystem.
1512
16-
(**
17-
Include the compiled binaries of your project to use in the docs like this.
18-
Make sure to load any external references via nuget.
19-
*)
20-
#r "nuget: BioFSharp, 2.0.0-preview.3"
21-
#I @"..\src\BioFSharp.Stats\bin\Release\net8.0"
22-
#r "BioFSharp.Stats.dll"
2313
24-
open BioFSharp.Stats
14+
| Build status (ubuntu and windows) | Test Coverage |
15+
|---|---|
16+
| [![Build and test](https://github.com/BioFSharp/BioFSharp.Stats/actions/workflows/build-and-test.yml/badge.svg)](https://github.com/BioFSharp/BioFSharp.Stats/actions/workflows/build-and-test.yml) | [![codecov](https://codecov.io/gh/BioFSharp/BioFSharp.Stats/branch/main/graph/badge.svg)](https://codecov.io/gh/BioFSharp/BioFSharp.Stats) |
2517
26-
let o = BioTalk.helloBio "Kev"
2718
28-
(***include-value: o***)
19+
From 2.0.0 onwards, this package will have a decoupled release schedule from the core `BioFSharp` package.
20+
This means that the versioning will be independent and may not follow the same versioning scheme as `BioFSharp`.
21+
22+
The last coupled release was `2.0.0-preview.3`.
23+
*)
2924

30-
(**
31-
Read more about documentation with fsdocs here: https://fsprojects.github.io/FSharp.Formatting/
32-
*)

src/BioFSharp.Stats/PhylTree.fs

+2-2
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ module PhylTree =
2020
loop 0. hCluster
2121

2222
/// <summary>
23-
/// Performs hierarchical clustering of the input TaggedSequences using the provided distance function and linker. Returns the result as a Phylogenetic tree.</summary>
23+
/// Performs hierarchical clustering of the input TaggedSequences using the provided distance function and linker. Returns the result as a Phylogenetic tree.
2424
/// </summary>
2525
/// <parameter name="branchTag">a tag to give the infered common ancestor branches (these are not tagged in contrast to the input sequence.)</parameter>
2626
/// <parameter name="distanceConverter">a converter function for the distance between nodes of the tree. Usually, a conversion to a string makes sense for downstream conversion to Newick format</parameter>
@@ -43,7 +43,7 @@ module PhylTree =
4343

4444

4545
/// <summary>
46-
/// Performs hierarchical clustering of the input TaggedSequences using the provided distance function. Returns the result as a Phylogenetic tree.</summary>
46+
/// Performs hierarchical clustering of the input TaggedSequences using the provided distance function. Returns the result as a Phylogenetic tree.
4747
/// </summary>
4848
/// <parameter name="distanceFunction">a function that determines the distance between two sequences e.g. evolutionary distance based on a substitution model</parameter>
4949
/// <parameter name="sequences">the input TaggedSequences</parameter>

0 commit comments

Comments
 (0)