forked from hyperboliccake/introgression
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtodo.txt
57 lines (44 loc) · 2.13 KB
/
todo.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
1. Theory/simulations of robustness to ILS
should probably run these without split populations?
make sure parameters are giving expected sequence divergence
- figure out what this percentage is from 100 genomes sequences/par reference:
~99.5% id w/ cer ref
~89% id w/ par ref
(these numbers are calculated from only match/mismatch columns in alignments
(gaps are ignored) so it's probably a bit of an overestimate if anything...
unless shittiness of alignments outweighs that)
goals of simulations:
establish tpr/fpr for reasonable parameters with HMM method
pick reasonable parameters for HMM - average across different demographic models?
show ILS is unlikely to be an issue (related to low fpr)
2. HMM algorithm -> show advantages over previous approaches; evaulate power/FPR; boundaries
3. Apply to data (maybe some new data?)
- 100 genomes + SGRP cer
- 37 par SGRP
4. Empirical patterns
- how much introgression overall and per strain
- relationship between introgression levels among strains and phylogeny/geographic
- frequency dist
- size differences and inferences of timing
- functional variation/GO
- if par analyzed, similarities or diff compared to cer
- UNKNOWN
5. Experimental/function
- GWAS
current goals for pipeline:
- alignments for beer strains, sgrp paradoxus strains, chinese strains (?)
- simulations for something that looks more like bayanus (for unknown category)
- introgression into paradoxus as well
- extract directories into separate file
- master run.sh that goes through all the steps without problems
- automatically read in hmm parameters, based on medians from simulations
for grant:
1. show ILS not likely to be a problem
2. description of HMM and validation of performance
3. application to real data
- strain level summaries of introgressed sequence
- gene level summaries
- distribution of introgressed sequence size
- bed file of coordinates for one strain
4. inferring timing/number of introgression events based on length distribution
figure out why in simulations we seem to be predicting introgressed sequence in cerevisiae reference