-
Notifications
You must be signed in to change notification settings - Fork 0
Home
A polynomial-time algorithm has been described for computing probabilities of ranked gene tree topologies given species trees. This algorithm has not yet been implemented. Once, implemented, ranked gene tree probabilities could be used to infer species trees, although inferring species trees is beyond the scope of the project. The idea is to consider ranked gene tree topologies, where we distinguish the relative order of times of nodes on gene trees, but not the real-valued branch lengths.
- C
- GMP arbitrary precision library
-
Week 1 Investigate general structure of the algorithm, write pseudocode. Sketch a rough outline of the algorithm. Transform probability formulae to their logarithmic equivalents. Log calculations increase stability, this technique is usually used in Naive Bayes implementations (it also has potential for instruction level parallelism speedup when used with float/double). It’s based on storing (natural) logarithms of the probabilities, and using log identities. As a result, multiplication could be replaced with summation. For summation it’s possible to use the following formula: ln(a + c) = ln(a) + ln(1 + exp(ln(c) - ln(a))) // log is cheaper and more precise than exponentiation on x87 FPU. Deliverable: formulae in logarithmic form
-
Week 2 Implement lowest common ancestor (LCA) algorithm for calculation of g[i] values (see paper) + test it. Possible choices:
- O(N*logN) preprocessing + O(logN) query binary expansion method (easier and faster to implement)
- O(N) preprocessing + O(1) query algorithm (more efficient) Bender’s algorithm Deliverable: Implemented LCA procedure.
-
Week 3 Write test cases for the implemented LCA algorithm, and test it. Define tree data structures (adjacency lists or matrices), and implement them in C++. Implement input / output format specifications and parsers: read_tree(FILE *f, n) that reads a tree with n vertices (forma) from a file / stream pointer. Deliverable: LCA should pass test cases, newick format reader, print_tree procedure for debugging purposes (tree.h with types, and tree.c with implementation)
-
Week 4 Write a procedure for transforming general newick-read trees with edge lengths into format suitable for use with ranked tree algorithm. Deliverable: function transforming general tree with edge lengths into a ranked tree.
-
Week 5 Write K(i,z) implementation for calculating k value of <i,m[i],z> (see section 2.1 of paper). Review James Degnan code for unranked case to study specifics of floating point operations implementation, and other helpful cues. Deliverable: K(i,z) helper function implementation
-
Week 6 Implement function calculating g[i] array (equation 8 in the paper). Implement lambda(i, j) function needed later for conditional probabilities calculation (see step 3 in section 2.2 of the paper). Deliverable: Function g() for calculating values for the g[] array. Function lambda() for calculating lambda array (these are helper functions for general step of the algorithm).
-
Week 7 Assemble blocks to implement 4, 5, and 6 steps of the algorithm though recursive calls. Improve through memoization / dynamic programming. Deliverable: Preliminary implementation of the algorithm, which probably will be quite buggy.
-
Week 8 Midterm Evals When bugs are fixed, implement multiple tree handling in the input. Deliverable: Algorithm should work decently with no major bugs, and should handle inputs with multiple trees.
-
Week 9 Finish up the algorithm, make R package. Deliverable: R package.
-
Week 10 Investigate different fixed precision floating point handling techniques (avoiding floating point underflow is important). Use standard arbitrary-precision as a reference benchmark to estimate relative / absolute errors of different techniques. Implement compiler flag switch allowing for use of this high precision method with sacrifice of speed. Deliverable: Improved version of code with preprocessor flags allowing to choose faster fixed or more precise arbitrary precision floating-point arithmetic.
-
Week 11 Run floating point precision error estimation benchmarks with respect to code implemented at Week 8. If error rates are small enough for built-in data types choose from: float (32 bit, 6 dec. digit precision), double (64 bit, 16 dec. digit precision), long double (64 to 80 bit precision depending on FPU architecture 16 - 21 dec. digit precision), __float128 (128 bit non-standard, available on gcc 4.6+, 34 dec. digits precision) ordered by speed, especially if SIMD is used. Deliverable: Precision loss estimates for fixed machine depended floating point arithmetic (float, double, long double).
-
Week 12 Investigate possibilities for instruction level SIMD parallelism (SSE, AVX) to speed up inner loops for summation of probabilities if fixed point arithmetic is used. Implement thread pool class, add each recursive call as a separate job (OpenMP). Derive final solution, run tests, and write documentation. Submit final evaluations. Deliverable: Thread pool class having N threads that execute jobs from the queue (recursive calls)