-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathqpgraph.qmd
273 lines (199 loc) · 14.2 KB
/
qpgraph.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
title: "Estimating Admixture Graphs with qpGraph"
author: "Eleni Seferidou"
format: html
editor: visual
bibliography: references.bib
toc: true
date: 2024-10-17
---
## Method Overview
Qpgraph is used to reconstruct models of genetic relationships between different groups (that can be single individuals or larger populations) through a phylogenetic tree, which allows for admixture from one group into another. The method operates on a user-defined graph topology and estimates *f*2, *f*3, and *f*4-statistic values[^1] for all pairs, triples, and quadruples of groups in the provided dataset. After, these calculated statistics are compared to the expected – based on the graph topology the user has set - allele frequency correlation of the tested groups.
[^1]: For more information, see chapter 3. *Introduction to F3- and F4-Statistics*
For a given topology, `qpGraph`[^2] [@Patterson2012] provides branch lengths (and mixture proportions in the case of an admixture event) that produce the best fit to allele frequency statistics measured on the data. Groups that share a more recent common ancestor will co-vary more than others in their allele frequencies due to common genetic drift.
[^2]: [AdmixtureGraph](https://github.com/DReichLab/AdmixTools/blob/master/README.QPGRAPHhttps), part of Admixtools Package
:::{.callout-tip}
Keep in mind: The reconstructed graph is never meant to reflect a comprehensive population history of the region under study but the best fitted model to the limit of the available groups, the user's input topology and the method’s resolution.
:::
▶️▶️▶️ Instructions on how to install the AdmixTools package and qpGraph can be found [here](https://github.com/DReichLab/AdmixTools/blob/master/README.INSTALL).
## Data Preparation
- Include individuals with at least 100,000 SNPs covered.
- Genotype data in [EIGENSTRAT](https://reich.hms.harvard.edu/software/InputFileFormats) format as input.
- When working with ancient DNA, exclude a subset of SNPs[^3] that are transitions in CpG sites. Those Cytosines are prone to be methylated and if deamination (the typical chemical modification of aDNA) occurs at the same position Cytosine is directly converted into Thymine without becoming Uracil. Thus, the resulting CtoT modification cannot be removed with an enzymatic reaction like performing uracil-DNA glycosylase (UDG) treatment.\
This results in additional noise in aDNA data that can be reduced by excluding those SNPs, especially when analysing ancient samples that have been processed using different laboratory protocols.
[^3]: A snp list with CpG sites can be found [here](https://github.com/EleniSef/qpGraph)
▶️ With plink:
``` bash
plink --bfile packageprefix --exclude snplist --make-bed --out packageprefix.filtered
```
### Parfile
To run `qpGraph` you will need a parameter file with the following format:
``` bash
DIR: /data/qpGraph
genotypename: DIR/<filename>.geno
snpname: DIR/<filename>.snp
indivname: DIR/<filename>.ind
outpop: NULL
useallsnps: YES
blgsize: 0.05
forcezmode: YES
lsqmode: YES
diag:.0001
bigiter: 15
hires: YES
lambdascale: 1
initmix: 1000
inbreed: NO
fstdmode: YES
```
Going through the most important parameter options:
`outpop: NULL` does not use an outgroup population to normalize f-stats by heterozygosity e.g. selecting a group in the graph in which SNPs must be polymorphic.
`useallsnps: YES` each comparison uses all SNPs overlapping in that specific test, and not between all groups. The default option is `NO`; with which `qpGraph` will calculate the SNPs overlapping across all individuals in the provided dataset. This will result in a lower number of SNPs but will increase the comparability between the individual statistical tests.
`blgsize: 0.05` block size in Morgans for Jackknife.
`lsqmode: YES`[^4] least-squares mode (The default option is `NO`); sets the off-diagonal elements of the block-jackknife covariance matrix to 0; this provides stability for large graphs.
[^4]: Least-squares mode: minimizes the sum of the squares of the differences (the "errors") between the observed (measured) values and the values predicted by a model.
`diag:.0001`[^5] use the full matrix form of the objective function to avoid the basis dependence of the least-squares version of the computation.
[^5]: By using the full matrix representation of the objective function, one can avoid potential issues of **basis dependence** that might arise when using the **least-squares** approach – for fitting, leading to more reliable computations. Adding a small value (0.0001) to the diagonal elements of the matrix ensures numerical stability, especially when the matrix might otherwise have very small or zero values on the diagonal, which can cause issues in computations like matrix inversion.
`hires: YES` controls output when more decimals are desired.
`lambdascale: 1` preserves the standard scaling of the f-statistics without an extra denominator.
`initmix: 1000` This may be useful if you are worried the program may not have found the global optimum.
`inbreed: NO` if you have populations with 1 individual.
`fstdmode: YES` branch lengths in units of Fst.
▶️▶️▶️ For more parameter options look at the tool's page on [GitHub](https://github.com/DReichLab/AdmixTools/blob/master/README.QPGRAPH) and on this [Workshop](https://comppopgenworkshop2019.readthedocs.io/en/latest/contents/07_qpgraph/qpGraph.html)
### Graph Topology
In addition, you need to provide a file in which the graph topology for the tree is set. For this, prepare a tab-separated file, beginning with setting labels for each population you need included in the tree. **Tip !** in the output summary statistics, only the first three letters appear, so make the labels distinguishable!
In the second part of the graph topology file, you will need to provide the edges of the tree in a bifurcation structure. \
The model is based on an unrooted tree and while we show graphs with a selected outgroup as the root, the results should not depend on the root position. After, set the edges based on the direction you want to test. The structure to be followed here is to indicate from which node does the edge start from and to which node does it move towards (b_a0_a1), the node1 (a0) and the node2 (a1) - all separated with tabs.\
🔍 Example. For this example, we are using populations from the **Simons Genome Diversity Project** [@mallick2016]
Preparing the graph file for the following tree:
{fig-align="right"}
``` bash
root a0
label Mbuti.DG a1
label Han.DG a2
label Pima.DG a4
label Mayan.DG a6
label Piapoco.DG a10
label Surui.DG a12
label Karitiana.DG a14
edge b_a0_a1 a0 a1
edge b_a0_a3 a0 a3
edge b_a3_a2 a3 a2
edge b_a3_a5 a3 a5
edge b_a5_a4 a5 a4
edge b_a5_a7 a5 a7
edge b_a7_a6 a7 a6
edge b_a7_a9 a7 a9
edge b_a9_a8 a9 a8
edge b_a9_a11 a9 a11
edge b_a11_a10 a11 a10
edge b_a8_a12 a8 a12
edge b_a8_a14 a8 a14
```
### Run
Directly on the command line:
``` bash
qpGraph -p parfile_qpGraph.sh -g scaffold.graph -o scaffold.graph.ggg -d scaffold.graph.dot > scaffold.graph.out
```
Where `-p` add the path to the parameter file, `-g` add the path for your graph file (the one prepared above), `-o` refers to the output file (`.ggg`), `-d` will be the dot file (to visualise the produced tree) and finally, `.out` file is the combined output file ❗ **Keep the naming consistent for each run** ❗
- With run file:
``` bash
#!/usr/bin/env bash
par=DIR/parfile_qpGraph.sh # path to parfile
gp=DIR/scaffold.graph # graph file
cd DIR/output # path to output folder
qsub -cwd -V -b y -l h_vmem=40G -pe smp 4 qpGraph -p ${par} -g ${gp} -o #par.qp${gp}.ggg -d par.qp${gp}.dot > par.qp${gp}.out
```
- Get the pdf through the `.dot` file:
``` bash
dot -Tpdf scaffold.graph.dot > scaffold.graph.pdf
```
## Output
Includes:
`.out` file
- with the total number of individuals used for all groups and the total number of SNPs available from at least one individual.
- List of SNPs with no data.
- Summary of Fst, *F*2, *F*3, *F*4-statistics. In case where multiple tests deviate from 0, you can make use of those statistics to get information on where the tree topology does not work, and which populations attract each other more than indicated by the provided topology. After you can adjust the tree accordingly to achieve a better Z-score.
- Z-scores determined from standard errors obtained from jackknife resampling.Since we are looking for a tree that fits our data, a Z score of 0 means a perfect fit. Ideally, we are looking for Z-score between 3 and -3 to accept the graph topology.
- Worst f-statistic \[the outlier *f*4-statistics\]. The entries show the test groups, the fitted (from the defined graph topology) and the observed *f*4 values, the difference between them, the standard error obtained from [jackknife resampling](https://www.sciencedirect.com/topics/mathematics/jackknife-resampling), and lastly the Z-score for the test.
``` bash
nodata: rs75089321
outpop: NULL
population: 0 Mbuti.DG 4
population: 1 Han.DG 3
population: 2 Pima.DG 2
population: 3 Mayan.DG 2
population: 4 Piapoco.DG 2
population: 5 Surui.DG 2
population: 6 Karitiana.DG 3
before setwt numsnps: 918965 outpop: NULL
setwt numsnps: 656994
number of blocks for moving block jackknife: 711
snps: 656994 indivs: 18
number of blocks for block jackknife: 711
lambdascale: 1.000
outliers:
Fit Obs Diff Std. error Z
worst f-stat: Pim Pia Sur Kar 0.000000 -0.001301 -0.001301 0.000488 -2.666
```
`.ggg` file
- Drift lengths (in units of Fst multiplied by 1000) and admixture proportions (as percentages from each group) in the case where an admixture edge has been set.
``` bash
edge b_a0_a1 a0 a1 0.042317
edge b_a0_a3 a0 a3 0.042317
edge b_a3_a2 a3 a2 0.008442
edge b_a3_a5 a3 a5 0.023047
edge b_a5_a4 a5 a4 0.010935
edge b_a5_a7 a5 a7 0.002088
edge b_a7_a6 a7 a6 0.001838
edge b_a7_a9 a7 a9 0.002316
edge b_a9_a8 a9 a8 0.003027
edge b_a9_a11 a9 a11 0.001235
edge b_a8_a12 a8 a12 0.021636
edge b_a8_a14 a8 a14 0.017294
edge b_a11_a10 a11 a10 0.001235
edge b_a17_a16 a17 a16 0.009809
admix a17 a10 a12 0.971592 0.028408
```
`.dot` file (in pdf format)
- At the top you have the worst Z-score informing you which populations attract more than the tree suggests. If it is positive then in the structure a,b,c,d we have more attraction in a-c and b-d. If it is negative then we have more attraction in b-c and a-d.
- Numbers on solid lines are genetic drift with units of FST × 1000; if it is 0 then a tri-furcation is suggested, meaning that the branch can be collapsed to the node before.

**Once confirming that the base tree works we can keep adding the groups of interest by adding them as edges and checking the output for the statistics.**
## Adding an Admixture Edge
If there are indications of an admixture event (either from other analyses or from population history), the tree can be adjusted. To add an admixture edge, we need to adjust the Graph Topology and the graph file accordingly. For this, we need to set an extra edge (admix) specifying which populations have contributed to the admixture.
``` bash
root a0
label Mbuti.DG a1
label Han.DG a2
label Pima.DG a4
label Mayan.DG a6
label Piapoco.DG a10
label Surui.DG a12
label Karitiana.DG a14
Label New_population a16
edge b_a0_a1 a0 a1
edge b_a0_a3 a0 a3
edge b_a3_a2 a3 a2
edge b_a3_a5 a3 a5
edge b_a5_a4 a5 a4
edge b_a5_a7 a5 a7
edge b_a7_a6 a7 a6
edge b_a7_a9 a7 a9
edge b_a9_a8 a9 a8
edge b_a9_a11 a9 a11
edge b_a11_a10 a11 a10
edge b_a8_a12 a8 a12
edge b_a8_a14 a8 a14
admix a17 a10 a12
edge b_a17_a16 a17 a16
```

## Tips
- Start with a small, well-understood graph and then add populations (either unadmixed or admixed) one at a time in their best-fitting positions. This involves trying different branch points for the new population and comparing the results [@lipson2017].
- If a population is unadmixed and placed in the wrong position, the fit of the model will be poorer, and the inferred split point will move as far as it can in the correct direction, constrained only by the specified topology. If no placement provides a good fit, then we can infer an admixture event, in which case we test for the best-fitting split points of the two ancestry components.
- **Mindful of missing data** as all f-statistics are computed on the same set of markers (especially if `useallsnps: NO` is set.
- **Over-fitting** can be a problem, especially if we hypothesize many admixing events but only have data for a few populations.
- It is important to not have zero-length edges because it might signify that the modeled edge does not exist. Also, terminal edges for ancient groups, especially if composed by a single individual, are artificially long and should not be considered
- Make different directories for every population you want to add, to have more control over the effect it has on the tree topology.
- Using `QpGraph` to explore the phylogenetic topology of populations, whose population history is not known is not ideal and can be time consuming. Users recommend using this tool to test two hypotheses against each other and estimating which one has a better fit to the provided data.