-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmain.tex
452 lines (305 loc) · 18.5 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
\newif\iftwoside
%comment out to print docs (twoside)
% \twosidetrue
\iftwoside
\documentclass[11 pt, a4paper, notitlepage, twoside]{report}
\else
\documentclass[11 pt, a4paper, notitlepage]{report}
\fi
%bibliography
\usepackage[a4paper,width=150mm,top=40mm,bottom=25mm]{geometry}
\usepackage[utf8x]{inputenc}
\DeclareUnicodeCharacter{8239}{}
%use arial
% \usepackage[scaled]{helvet}
% \renewcommand\familydefault{\sfdefault}
% \usepackage{lmodern}
\usepackage[sc]{mathpazo}
\usepackage[T1]{fontenc}
% \usepackage{underscore}
\usepackage{setspace}
\usepackage{graphicx}
\usepackage{fancyhdr}
% \pagestyle{fancy}
% \fancyhf{}
% \fancyhead[C]{\leftmark}
% \fancyfoot{}
% \rfoot{\thepage}
\fancypagestyle{main}{
\fancyhf{}
\renewcommand{\headrulewidth}{0.4pt}
\fancyhead[C]{\leftmark}
\fancyfoot[R]{\thepage}
}
\fancypagestyle{noheadt}{
\fancyhf{}
\renewcommand{\headrulewidth}{0pt}
\cfoot{\thepage}
}
\fancypagestyle{noheadi}{
\fancyhf{}
\renewcommand{\headrulewidth}{0pt}
\rfoot{\thepage}
}
% different left and right numbering
\iftwoside
\fancyfoot[RO,LE]{\thepage}
\else
\rfoot{\thepage}
\fi
\renewcommand{\footrulewidth}{0pt}
\usepackage{multirow}
\usepackage{colortbl}
\usepackage{lscape}
\usepackage{subfiles}
\graphicspath{{figure/}{../figure/}}
%use single spacing after period
\frenchspacing
% \usepackage[round]{natbib}
% comment out to use numeric-style citation
\usepackage[numbers,square,comma,sort]{natbib}
%\bibliographystyle{abbrvnat}
\usepackage{chapterbib}
% \setlength{\bibsep}{0pt}
%reference as a section instead of a chapter
\renewcommand{\bibsection}{\section*{\bibname}}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{pdflscape}
\usepackage{enumerate}
\usepackage[singlelinecheck=false]{caption}
\usepackage{longtable}
%for cv
\usepackage{pdfpages}
% break long url
\usepackage{xurl}
\urlstyle{same}
% no space between list of item
\usepackage[shortlabels]{enumitem}
\setlist[itemize]{noitemsep, topsep=0pt}
% add space between paragraph
% \setlength{\parskip}{\baselineskip}
% \usepackage{tocloft}
% \setlength[titles]{\cftfignumwidth}{2.55em}
\usepackage[titles]{tocloft}
\setlength{\cftfignumwidth}{3em}
% reduce size of the section
\usepackage{titlesec}
\titlespacing\subsection{0pt}{0pt}{-1em}
\hypersetup{
colorlinks,
linkcolor={red!70!black},
citecolor={blue!70!black},
urlcolor={magenta!70!black}
}
\renewcommand{\bibname}{References}
\renewcommand{\contentsname}{Table of Contents}
\begin{document}
\newcommand*{\BuildingFromMainFile}{}
\includepdf[noautoscale,pages=-]{frontback/coverpage.pdf}
\subfile{frontback/titlepage.tex}
\iftwoside
\newpage
\thispagestyle{empty}
\
\newpage
\fi
\pagenumbering{roman}
\newpage
\setcounter{tocdepth}{1}
\hypersetup{linkcolor=black}
\phantomsection
\addcontentsline{toc}{chapter}{Table of Contents}
{
\pagestyle{plain}
\tableofcontents
}
\newpage
\phantomsection
\addcontentsline{toc}{chapter}{List of Figures}
\listoffigures
\newpage
\phantomsection
\addcontentsline{toc}{chapter}{List of Tables}
\listoftables
\newpage
\phantomsection
\addcontentsline{toc}{chapter}{Summary}
\section*{\centering{\LARGE{Summary}}}
\thispagestyle{plain}
\hypersetup{linkcolor={red!70!black}}
%\linespread{1.6}
\setlength{\parskip}{\baselineskip}
\doublespacing
The assembly of the draft \emph{Bos taurus }reference genome was a milestone for genetics- and genomics-oriented research in cattle. The reference genome of domestic cattle was built from a single animal from the Hereford breed. However, the linear reference sequence does not represent the genetic diversity of global cattle breeds. The lack of diversity causes problems, particularly when DNA sequences from genetically distant animals are aligned and compared to the reference sequence. This issue is widely known as reference bias. Pangenomes are an intriguing novel reference structure to consider the full-spectrum of genetic diversity within a species. A rich, graph-based pangenome reference can integrate multiple genome assemblies and their sites of variations in a coherent and non-redundant data structure. This thesis investigated for the first time the utility of graph-based references for genomic analysis in a livestock population.
Chapter 2 assessed the feasibility of graph-based genomic analysis in cattle. Specifically, a graph-based sequence variant genotyping approach was implemented using the\emph{ Graphtyper} software and compared to two widely-used methods (\emph{SAMtools } and \emph{GATK}) that rely on a strictly linear representation of the reference using whole-genome sequencing data of 49 Original Braunvieh cattle. A comparison between sequence variant and array-derived genotypes indicated that the graph-based approach outperformed both \emph{SAMtools} and \emph{GATK} with regard to genotype concordance, non-reference sensitivity, non-reference discrepancy, and Mendelian consistency of genotypes observed in parent-offspring pairs. These findings demonstrated that graph-based genotyping using \emph{Graphtyper} is accurate, sensitive, and computationally feasible in the cattle genome.
Chapter 3 reports on the construction of breed-specific and multi-breed genome graphs for four European cattle breeds (Original Braunvieh, Brown Swiss, Fleckvieh, and Holstein). The \emph{vg toolkit} was used to augment the linear Hereford-based reference sequence with variants that were prioritized based on allele frequency in different breeds. Based on both real and simulated short-read sequencing data, this study showed that variant prioritization is crucial to build informative genome graphs. Intriguingly, adding many low frequency and rare variants to the genome graphs compromised mapping accuracy. Moreover, this chapter demonstrated that multi-breed graphs and breed-specific graphs enable almost identical mapping improvements over a linear reference genome. Finally, the first whole-genome graph was constructed for the Brown Swiss cattle breed using 14 million variants. The application of this whole-genome graph facilitated accurate short-read mapping and unbiased sequence variant discovery.
\thispagestyle{plain}
Chapter 4 reports on integrating six reference-quality bovine genome assemblies into a unified multi-assembly graph using the \emph{minigraph} software. The pangenome contains 70 megabases that are not present in the current ARS-UCD1.2 \emph{Bos taurus} reference genome. Using complementary bioinformatics approaches, this chapter provides compelling evidence that these non-reference sequences contain functionally active and biologically-relevant elements. Specifically, the analysis of transcriptome data revealed putatively novel genes, including some that are differentially expressed between individual animals. Moreover, variant discovery in the non-reference sequences revealed thousands of yet undetected polymorphic sites capturing genetic differentiation across cattle breeds. This chapter demonstrated that multi-assembly graphs make so far neglected genetic variations amenable to genetic investigations.
Overall, this thesis presents a novel analysis paradigm in livestock genomics by leveraging variation-aware reference structures. The analyses presented in this thesis provide a first step towards the transition from linear to graph-based reference structures in order to mitigate inherent biases of the linear reference genome. Importantly, this thesis establishes a computational framework to integrate multiple genome assemblies and their sites of variations into a more diverse reference structure broadly applicable across species.
\newpage
\thispagestyle{plain}
\phantomsection
\addcontentsline{toc}{chapter}{Zusammenfassung}
\section*{\centering{\LARGE{Zusammenfassung}}}
Das Assembly der Bos taurus Referenzsequenz war ein Meilenstein für genetische und genomische Forschungsfragen beim Rind. Die Referenzsequenz wurde von einem einzigen Tier der Rasse Hereford erzeugt. Allerdings kann die genetische Diversität der globalen Rinderpopulation nicht in einem einzigen linearen Referenzgenom repräsentiert werden. Das ist besonders dann problematisch, wenn Sequenzen von genetisch weit entfernten Tieren mit dem Referenzgenom verglichen werden. Pangenome sind interessante neuartige Referenzstrukturen, die das gesamte Spektrum der genetischen Diversität einer Spezies abbilden. Solche graph-basierte Referenzstrukturen können mehrere Assemblies sowie deren variable Positionen integrieren. Im Rahmen dieser Dissertation werden erstmals graph-basierte Referenzstrukturen für genetische Analysen in einer Nutztierpopulation verwendet.
Im zweiten Kapitel werden erstmals graph-basierte genomische Analysen beim Rind durchgeführt. Die Genomsequenzen von 49 Original Braunvieh Rindern werden mit einem graph-basierten Ansatz nach polymorphen Positionen durchsucht. Mit der \emph{Graph} \emph{typer} software werden diese Positionen genotypisiert. Die so erhaltenen Genotypen werden mit Genotypen verglichen, die mit zwei weit verbreiteten Methoden (\emph{SAMtools} und \emph{GATK}) bestimmt wurden, welche strikt auf eine lineare Referenzsequenz angewiesen sind. Im Vergleich mit SNP-Chip basierten Genotypen zeigt sich, dass der graph-basierte Ansatz in \emph{Graphtyper} sowohl \emph{SAMtools} wie auch \emph{GATK} im Hinblick auf die Übereinstimmung, die Sensitivität, die Spezifität und die Genauigkeit der Genotypen überlegen war. Daraus lässt sich schlussfolgern, dass die graph-basierte Genotypisierung von Rindergenomen mit \emph{Graphtyper} genau, sensitiv und rechnerisch machbar ist.
Im dritten Kapitel werden rassespezifische und rassenübergreifende graph-basierte Referenzen für vier Europäische Rinderrassen (Original Braunvieh, Brown Swiss, Fleckvieh und Holstein) aufgestellt und verglichen. Das \emph{vg toolkit} wurde verwendet, um die lineare Referenzsequenz mit Varianten zu erweitern, die hinsichtlich ihrer Allelfrequenz ausgewählt wurden. Sowohl mit realen wie auch simulierten Sequenzdaten konnte gezeigt werden, dass eine Priorisierung der Varianten für informative graph-basierte Referenzgenome ausschlaggebend ist. So beeinträchtigten viele seltene Varianten den Abgleich der ausgelesenen DNA-Abschnitte mit der Referenz. Zusätzlich zeigt dieses Kapitel, dass rassenübergreifende und rassespezifische Referenzgraphen hinsichtlich des Abgleichs der DNA-Abschnitte eine fast identische Verbesserung gegenüber der linearen Referenzsequenz aufweisen. Schlussendlich konnte der erste genomweite Referenzgraph für die Rasse Brown Swiss mit rund 14 Millionen Sequenzvarianten konstruiert werden. Dieses Kapitel zeigt dass Referenzgraphen das Zuordnen von ausgelesenen DNA-Abschnitten verbessern und somit eine unverzerrte Genotypisierung von Sequenzvarianten ermöglichen.
\thispagestyle{plain}
Im vierten Kapitel werden sechs Rindergenome mit dem Programm \emph{minigraph} zu einen Multi-Referenz-Graphen vereinigt. Dieses Pangenom beinhaltet 70 Megabasen, welche im aktuellen \emph{Bos taurus} Referenzgenom (ARS-UCD1.2) nicht vorhanden sind. Durch die Anwendung von komplementären bioinformatischen Ansätzen liefert dieses Kapitel überzeugende Hinweise, dass diese in der Referenz nicht vorhandenen Sequenzen funktionelle und biologisch-relevante Elemente enthalten. Ausserdem enthalten sie tausende bislang unbekannte Sequenzvarianten, die sich zwischen Rinderrassen unterscheiden. Dieses Kapitel zeigte, dass Multi-Referenzen-Graphen bis anhin nicht berücksichtigte DNA Variation für genetische Untersuchungen zugänglich machen können.
Diese Dissertation präsentiert ein neues Paradigma zur Analyse genomischer Daten mit nicht-linearen Referenzstrukturen. Die verschiedenen Analysen, welche in dieser Arbeit präsentiert werden, sind ein erster Schritt um von linearen zu graphbasierten Referenzgenomen zu wechseln. In dieser Dissertation wurden grundlegende und breit anwendbare Strukturen geschaffen, die es erlauben, mehrere Referenzsequenzen und deren variable Positionen in eine nicht-lineare Datenstruktur zu integrieren.
\newpage
\phantomsection
\addcontentsline{toc}{chapter}{Thesis Outline}
\section*{\LARGE{Thesis Outline}}
\thispagestyle{plain}
The thesis is structured as follows:
Chapter 1 provides a literature review to introduce the concepts of a reference genome, pangenome, graph-based pangenome, and applications of the pangenome. \\
Chapter 2 reports on genome-graph based variant discovery and genotyping in a livestock population. This chapter is published in \emph{Genetics Selection Evolution}. \\
% a first feasibility assessment of a genome graph-based variant discovery method in a livestock population. This chapter is published in \emph{Genetics Selection Evolution}. \\
Chapter 3 reports on the construction of the first whole-genome graphs in cattle and their application to read mapping and variant discovery. This chapter is published in \emph{Genome Biology}. \\
Chapter 4 reports on the construction of a bovine multi-assembly graph from six reference-quality assemblies and its application to investigate sequences not included in the current \emph{Bos taurus} reference genome. This chapter is published in \emph{Proceedings of the National Academy of Sciences of the United States of America (PNAS)}. \\
Chapter 5 provides a general discussion, and outlook for future research
\iftwoside
\cleardoublepage
\newpage
\fi
\newpage
\pagestyle{main}
\pagenumbering{arabic}
\onehalfspacing
% general introduction
\chapter[General Introduction]{\LARGE{General Introduction}}
\label{chap:intro}
\bigskip
\include{chapters/intro}
\iftwoside
\cleardoublepage
\newpage
\fi
% paper 1 as chapter 2
\chapter[Genotyping From Variation-Aware Graphs]{\LARGE{Accurate sequence variant genotyping in cattle using variation-aware genome graphs}}
\label{chap:locgraph}
\subsection*{}
\onehalfspacing
\normalsize
{
\vspace{2em}
\setlength\parindent{0pt}
\large
\textbf{Danang Crysnanto}$^{1}$, Christine Wurmser$^{2}$, Hubert Pausch$^{1}$ \\
\vspace{0.5em}
$^1$ Animal Genomics, ETH Zurich, Zurich, Switzerland. \\
$^2$ Chair of Animal Breeding, TU München, Freising, Germany. \\
\bigskip
Published in \emph{Genetics Selection Evolution (2019) 51:21}
\bigskip
\begin{center}\fbox{\begin{minipage}{35em}
\emph{Contribution}: I participated in conceiving the study, analysing the results and writing the manuscript. I wrote the graph genotyping pipelines.
\end{minipage}}\end{center}
}
\include{chapters/chapter2}
\iftwoside
\cleardoublepage
\newpage
\fi
% paper 2 as chapter 3
\chapter[Unbiased Variant Analysis Using Genome Graphs]{\LARGE{Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery}}
\label{chap:wholegraph}
\subsection*{}
\normalsize
{
\vspace{2em}
\setlength\parindent{0pt}
\large
\textbf{Danang Crysnanto}$^{1}$, Hubert Pausch$^{1}$ \\
\vspace{0.5em}
$^1$ Animal Genomics, ETH Zurich, Zurich, Switzerland. \\
\bigskip
Published in \emph{Genome Biology (2020) 21:184}
\bigskip
\begin{center}\fbox{\begin{minipage}{35em}
\emph{Contribution}: I participated in conceiving the study, analysing the results and writing the manuscript. I wrote the whole-genome graph pipelines.
\end{minipage}}\end{center}
}
\include{chapters/chapter3}
\iftwoside
\cleardoublepage
\newpage
\fi
\chapter[A Pangenome Established From Six Assemblies]{\LARGE{Novel functional sequences uncovered through a bovine multi-assembly graph}}
\label{chap:multigraph}
\subsection*{}
\normalsize
{
\vspace{2em}
\setlength\parindent{0pt}
\large
\textbf{Danang Crysnanto}$^{1}$, Alexander S. Leonard$^{1}$, Zih-Hua Fang$^{1}$, Hubert Pausch$^{1}$ \\
\vspace{0.5em}
$^1$ Animal Genomics, ETH Zurich, Zurich, Switzerland. \\
\bigskip
Published in \emph{PNAS (2021) 118:20}
\bigskip
\begin{center}\fbox{\begin{minipage}{35em}
\emph{Contribution}: I participated in conceiving the study, analysing the results and writing the manuscript. I wrote the multi-assembly graph pipelines.
\end{minipage}}\end{center}
}
\onehalfspacing
\include{chapters/chapter4}
\iftwoside
\cleardoublepage
\newpage
\fi
\chapter[General Discussion]{\LARGE{General Discussion}}
\label{chap:discuss}
\include{chapters/discuss}
\newpage
\iftwoside
\cleardoublepage
\fi
% Appendixes
\chapter*{\centering{Supplementary Material \\ Chapter \ref{chap:locgraph}}}
\addcontentsline{toc}{chapter}{Supplementary Materials Chapter \ref{chap:locgraph}}
\singlespacing
\fancyhead[C]{APPENDICES}
\include{chapters/supp_chap2}
\chapter*{\centering{Supplementary Material \\ Chapter \ref{chap:wholegraph}}}
\addcontentsline{toc}{chapter}{Supplementary Materials Chapter 3}
\singlespacing
\fancyhead[C]{APPENDICES}
\include{chapters/supp_chap3}
\chapter*{\centering{Supplementary Material \\ Chapter \ref{chap:multigraph}}}
\addcontentsline{toc}{chapter}{Supplementary Materials Chapter 4}
\singlespacing
\fancyhead[C]{APPENDICES}
\include{chapters/supp_chap4}
\thispagestyle{plain}
\section*{\LARGE{Acknowledgements}}
\addcontentsline{toc}{chapter}{Acknowledgements}
\bigskip
\normalsize
\onehalfspacing
First, I would like to thank Prof. Hubert Pausch for having me as a doctoral student,
supervising me over the years and providing a great environment for the research.
I learned a lot about genomics, programming, problem solving, critical thinking, and scientific writing from your guidances.
Also, thank you for giving me the freedom to explore research ideas and the trust to organize my time.
I really appreciate all opportunities that I was given: including me in the other projects in the lab, providing funding to attend courses, and sending me to many international conferences.
All these have become extremely valuable experiences. \\
Thanks to Prof. Bernt Guldbrandtsen and Prof. David MacHugh who have agreed to review this thesis.\\
I would like to thank the current and former members of the Animal Genomics Group for being very supportive for my day-to-day as a doctoral student.
A special mention to Dr. Alexander S. Leonard who has been extremely helpful in the last project and for dedicating time to proofread this thesis.
Also to Maya and Meenu, who have become helpful peers since starting my doctoral.
I would also like to thank to staff at Agrovet Strickhof that have provided a great research facility.
Thank you to Dorota Niedzwiecka for organizing all administrative tasks to ensure my smooth stay in Zurich. \\
Lastly, I would like to thank my families, especially my wife, who has accompanied me studying abroad.
\newpage
% cv
\newif\ifincludecv
\includecvtrue %comment out to remove cv
\ifincludecv
\newpage
\includepdf[noautoscale,pages=-]{frontback/cv.pdf}
\fi
% \includepdf[noautoscale,pages=-]{frontback/backpage.pdf}
\end{document}