main.tex

\newif\iftwoside
%comment out to print docs (twoside) 
% \twosidetrue
\iftwoside
\documentclass[11 pt, a4paper, notitlepage, twoside]{report}
\else
\documentclass[11 pt, a4paper, notitlepage]{report}
\fi 


%bibliography
\usepackage[a4paper,width=150mm,top=40mm,bottom=25mm]{geometry}
\usepackage[utf8x]{inputenc}
\DeclareUnicodeCharacter{8239}{}

%use arial 
% \usepackage[scaled]{helvet}
% \renewcommand\familydefault{\sfdefault} 

% \usepackage{lmodern}
\usepackage[sc]{mathpazo}
\usepackage[T1]{fontenc}
% \usepackage{underscore}

\usepackage{setspace}
\usepackage{graphicx}

\usepackage{fancyhdr}
% \pagestyle{fancy}
% \fancyhf{}
% \fancyhead[C]{\leftmark}
% \fancyfoot{}
% \rfoot{\thepage}

\fancypagestyle{main}{
    \fancyhf{}
    \renewcommand{\headrulewidth}{0.4pt}
    \fancyhead[C]{\leftmark}
    \fancyfoot[R]{\thepage}
}

\fancypagestyle{noheadt}{
    \fancyhf{}
    \renewcommand{\headrulewidth}{0pt}
    \cfoot{\thepage}
}

\fancypagestyle{noheadi}{
    \fancyhf{}
    \renewcommand{\headrulewidth}{0pt}
    \rfoot{\thepage}
}

% different left and right numbering 

\iftwoside
\fancyfoot[RO,LE]{\thepage}
\else
\rfoot{\thepage}
\fi 

\renewcommand{\footrulewidth}{0pt}
\usepackage{multirow}
\usepackage{colortbl}
\usepackage{lscape}
\usepackage{subfiles}
\graphicspath{{figure/}{../figure/}}

%use single spacing after period
\frenchspacing

% \usepackage[round]{natbib}
% comment out to use numeric-style citation
\usepackage[numbers,square,comma,sort]{natbib}
%\bibliographystyle{abbrvnat}
\usepackage{chapterbib}
% \setlength{\bibsep}{0pt}
%reference as a section instead of a chapter
\renewcommand{\bibsection}{\section*{\bibname}}


\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{pdflscape}
\usepackage{enumerate}
\usepackage[singlelinecheck=false]{caption}
\usepackage{longtable}
%for cv
\usepackage{pdfpages}
% break long url
\usepackage{xurl}
\urlstyle{same}

% no space between list of item
\usepackage[shortlabels]{enumitem}
\setlist[itemize]{noitemsep, topsep=0pt}


% add space between paragraph
% \setlength{\parskip}{\baselineskip}

% \usepackage{tocloft}
% \setlength[titles]{\cftfignumwidth}{2.55em}

\usepackage[titles]{tocloft}
\setlength{\cftfignumwidth}{3em}


% reduce size of the section
\usepackage{titlesec}
\titlespacing\subsection{0pt}{0pt}{-1em}

\hypersetup{
    colorlinks,
    linkcolor={red!70!black},
    citecolor={blue!70!black},
    urlcolor={magenta!70!black}
}

\renewcommand{\bibname}{References}
\renewcommand{\contentsname}{Table of Contents}


\begin{document}

\newcommand*{\BuildingFromMainFile}{}

\includepdf[noautoscale,pages=-]{frontback/coverpage.pdf}
\subfile{frontback/titlepage.tex}

\iftwoside
\newpage
\thispagestyle{empty}
\ 
\newpage
\fi

\pagenumbering{roman}

\newpage

\setcounter{tocdepth}{1}
\hypersetup{linkcolor=black}

\phantomsection
\addcontentsline{toc}{chapter}{Table of Contents}
{
    \pagestyle{plain}
    \tableofcontents
}


\newpage

\phantomsection
\addcontentsline{toc}{chapter}{List of Figures}
\listoffigures 
\newpage

\phantomsection
\addcontentsline{toc}{chapter}{List of Tables}
\listoftables 
\newpage

\phantomsection
\addcontentsline{toc}{chapter}{Summary}
\section*{\centering{\LARGE{Summary}}}
\thispagestyle{plain}

\hypersetup{linkcolor={red!70!black}}
%\linespread{1.6}
\setlength{\parskip}{\baselineskip}
\doublespacing
The assembly of the draft \emph{Bos taurus }reference genome was a milestone for genetics- and genomics-oriented research in cattle. The reference genome of domestic cattle was built from a single animal from the Hereford breed. However, the linear reference sequence does not represent the genetic diversity of global cattle breeds. The lack of diversity causes problems, particularly when DNA sequences from genetically distant animals are aligned and compared to the reference sequence. This issue is widely known as reference bias. Pangenomes are an intriguing novel reference structure to consider the full-spectrum of genetic diversity within a species. A rich, graph-based pangenome reference can integrate multiple  genome assemblies and their sites of variations in a coherent and non-redundant data structure. This thesis investigated for the first time the utility of graph-based references for genomic analysis in a livestock population.

Chapter 2 assessed the feasibility of graph-based genomic analysis in cattle. Specifically, a graph-based sequence variant genotyping approach was implemented using the\emph{ Graphtyper} software and compared to two widely-used methods (\emph{SAMtools } and \emph{GATK}) that rely on a strictly linear representation of the reference using whole-genome sequencing data of 49 Original Braunvieh cattle. A comparison between sequence variant and array-derived genotypes indicated that the graph-based approach outperformed both \emph{SAMtools} and \emph{GATK} with regard to genotype concordance, non-reference sensitivity, non-reference discrepancy, and Mendelian consistency of genotypes observed in parent-offspring pairs. These findings demonstrated that graph-based genotyping using \emph{Graphtyper} is accurate, sensitive, and computationally feasible in the cattle genome. 

Chapter 3 reports on the construction of breed-specific and multi-breed genome graphs for four European cattle breeds (Original Braunvieh, Brown Swiss, Fleckvieh, and Holstein).  The \emph{vg toolkit} was used to augment the linear Hereford-based reference sequence with variants that were prioritized based on allele frequency in different breeds. Based on both real and simulated short-read sequencing data, this study showed that variant prioritization is crucial to build informative genome graphs. Intriguingly, adding many low frequency and rare variants to the genome graphs compromised mapping accuracy. Moreover, this chapter demonstrated that multi-breed graphs and breed-specific graphs enable almost identical mapping improvements over a linear reference genome. Finally, the first whole-genome graph was constructed for the Brown Swiss cattle breed using 14 million variants. The application of this whole-genome graph facilitated accurate short-read mapping and unbiased sequence variant discovery. 

\thispagestyle{plain}

Chapter 4 reports on integrating six reference-quality bovine genome assemblies into a unified multi-assembly graph using the \emph{minigraph} software. The pangenome contains 70 megabases that are not present in the current ARS-UCD1.2 \emph{Bos taurus} reference genome. Using complementary bioinformatics approaches, this chapter provides compelling evidence that these non-reference sequences contain functionally active and biologically-relevant elements. Specifically, the analysis of transcriptome data revealed putatively novel genes, including some that are differentially expressed between individual animals. Moreover, variant discovery in the non-reference sequences revealed thousands of yet undetected polymorphic sites capturing genetic differentiation across cattle breeds. This chapter demonstrated that multi-assembly graphs make so far neglected genetic variations amenable to genetic investigations. 

Overall, this thesis presents a novel analysis paradigm in livestock genomics by leveraging variation-aware reference structures. The analyses presented in this thesis provide a first step towards the transition from linear to graph-based reference structures in order to mitigate inherent biases of the linear reference genome. Importantly, this thesis establishes a computational framework to integrate multiple genome assemblies and their sites of variations into a more diverse reference structure broadly applicable across species. 

\newpage

\thispagestyle{plain}
\phantomsection
\addcontentsline{toc}{chapter}{Zusammenfassung}
\section*{\centering{\LARGE{Zusammenfassung}}}

Das Assembly der Bos taurus Referenzsequenz war ein Meilenstein für genetische und genomische Forschungsfragen beim Rind. Die Referenzsequenz wurde von einem einzigen Tier der Rasse Hereford erzeugt. Allerdings kann die genetische Diversität der globalen Rinderpopulation nicht in einem einzigen linearen Referenzgenom repräsentiert werden. Das ist besonders dann problematisch, wenn Sequenzen von genetisch weit entfernten Tieren mit dem Referenzgenom verglichen werden. Pangenome sind interessante neuartige Referenzstrukturen, die das gesamte Spektrum der genetischen Diversität einer Spezies abbilden. Solche graph-basierte Referenzstrukturen können mehrere Assemblies sowie deren variable Positionen integrieren. Im Rahmen dieser Dissertation werden erstmals graph-basierte Referenzstrukturen für genetische Analysen in einer Nutztierpopulation verwendet.

Im zweiten Kapitel werden erstmals graph-basierte genomische Analysen beim Rind durchgeführt. Die Genomsequenzen von 49 Original Braunvieh Rindern werden mit einem graph-basierten Ansatz nach polymorphen Positionen durchsucht. Mit der \emph{Graph} \emph{typer} software werden diese Positionen genotypisiert. Die so erhaltenen Genotypen werden mit Genotypen verglichen, die mit zwei weit verbreiteten Methoden (\emph{SAMtools} und \emph{GATK}) bestimmt wurden, welche strikt auf eine lineare Referenzsequenz angewiesen sind. Im Vergleich mit SNP-Chip basierten Genotypen zeigt sich, dass der graph-basierte Ansatz in \emph{Graphtyper} sowohl \emph{SAMtools} wie auch \emph{GATK} im Hinblick auf die Übereinstimmung, die Sensitivität, die Spezifität und die Genauigkeit der Genotypen überlegen war. Daraus lässt sich schlussfolgern, dass die graph-basierte Genotypisierung von Rindergenomen mit \emph{Graphtyper} genau, sensitiv und rechnerisch machbar ist.

Im dritten Kapitel werden rassespezifische und rassenübergreifende graph-basierte Referenzen für vier Europäische Rinderrassen (Original Braunvieh, Brown Swiss, Fleckvieh und Holstein) aufgestellt und verglichen. Das \emph{vg toolkit} wurde verwendet, um die lineare Referenzsequenz mit Varianten zu erweitern, die hinsichtlich ihrer Allelfrequenz ausgewählt wurden. Sowohl mit realen wie auch simulierten Sequenzdaten konnte gezeigt werden, dass eine Priorisierung der Varianten für informative graph-basierte Referenzgenome ausschlaggebend ist. So beeinträchtigten viele seltene Varianten den Abgleich der ausgelesenen DNA-Abschnitte mit der Referenz. Zusätzlich zeigt dieses Kapitel, dass rassenübergreifende und rassespezifische Referenzgraphen hinsichtlich des Abgleichs der DNA-Abschnitte eine fast identische Verbesserung gegenüber der linearen Referenzsequenz aufweisen. Schlussendlich konnte der erste genomweite Referenzgraph für die Rasse Brown Swiss mit rund 14 Millionen Sequenzvarianten konstruiert werden. Dieses Kapitel zeigt dass Referenzgraphen das Zuordnen von ausgelesenen DNA-Abschnitten verbessern und somit eine unverzerrte Genotypisierung von Sequenzvarianten ermöglichen.

\thispagestyle{plain}

Im vierten Kapitel werden sechs Rindergenome mit dem Programm \emph{minigraph} zu einen Multi-Referenz-Graphen vereinigt. Dieses Pangenom beinhaltet 70 Megabasen, welche im aktuellen \emph{Bos taurus} Referenzgenom (ARS-UCD1.2) nicht vorhanden sind. Durch die Anwendung von komplementären bioinformatischen Ansätzen liefert dieses Kapitel überzeugende Hinweise, dass diese in der Referenz nicht vorhandenen Sequenzen funktionelle und biologisch-relevante Elemente enthalten. Ausserdem enthalten sie tausende bislang unbekannte Sequenzvarianten, die  sich zwischen  Rinderrassen  unterscheiden. Dieses Kapitel zeigte, dass Multi-Referenzen-Graphen bis anhin nicht berücksichtigte DNA Variation für genetische Untersuchungen zugänglich machen können.

Diese Dissertation präsentiert ein neues Paradigma zur Analyse genomischer Daten mit nicht-linearen Referenzstrukturen. Die verschiedenen Analysen, welche in dieser Arbeit präsentiert werden, sind ein erster Schritt um von linearen zu graphbasierten Referenzgenomen zu wechseln. In dieser Dissertation wurden grundlegende und breit anwendbare Strukturen geschaffen, die es erlauben, mehrere Referenzsequenzen und deren variable Positionen in eine nicht-lineare Datenstruktur zu integrieren.

\newpage

\phantomsection
\addcontentsline{toc}{chapter}{Thesis Outline}
\section*{\LARGE{Thesis Outline}}
\thispagestyle{plain}

The thesis is structured as follows:

Chapter 1 provides a literature review to introduce the concepts of a reference genome, pangenome, graph-based pangenome, and applications of the pangenome. \\

Chapter 2 reports on genome-graph based variant discovery and genotyping in a livestock population. This chapter is published in \emph{Genetics Selection Evolution}. \\ 
% a first feasibility assessment of a genome graph-based variant discovery method in a livestock population. This chapter is published in \emph{Genetics Selection Evolution}. \\ 

Chapter 3 reports on the construction of the first whole-genome graphs in cattle and their application to read mapping and variant discovery. This chapter is published in \emph{Genome Biology}. \\

Chapter 4 reports on the construction of a bovine multi-assembly graph from six reference-quality assemblies and its application to investigate sequences not included in the current \emph{Bos taurus} reference genome. This chapter is published in \emph{Proceedings of the National Academy of Sciences of the United States of America (PNAS)}. \\


Chapter 5 provides a general discussion, and outlook for future research


\iftwoside
\cleardoublepage
\newpage
\fi

\newpage


\pagestyle{main}

\pagenumbering{arabic}


\onehalfspacing


% general introduction


\chapter[General Introduction]{\LARGE{General Introduction}}
\label{chap:intro}

\bigskip

\include{chapters/intro}

\iftwoside
\cleardoublepage
\newpage
\fi

% paper 1 as chapter 2

\chapter[Genotyping From Variation-Aware Graphs]{\LARGE{Accurate sequence variant genotyping in cattle using variation-aware genome graphs}}
\label{chap:locgraph}

\subsection*{}
\onehalfspacing
\normalsize

{
\vspace{2em}
\setlength\parindent{0pt}
\large

\textbf{Danang Crysnanto}$^{1}$, Christine Wurmser$^{2}$, Hubert Pausch$^{1}$ \\

\vspace{0.5em}

$^1$ Animal Genomics, ETH Zurich, Zurich, Switzerland. \\
$^2$ Chair of Animal Breeding, TU München, Freising, Germany. \\

\bigskip
Published in \emph{Genetics Selection Evolution (2019) 51:21}

\bigskip

\begin{center}\fbox{\begin{minipage}{35em}

\emph{Contribution}: I participated in conceiving the study, analysing the results and writing the manuscript. I wrote the graph genotyping pipelines. 
        
\end{minipage}}\end{center}

}

\include{chapters/chapter2}


\iftwoside
\cleardoublepage
\newpage
\fi


% paper 2 as chapter 3

\chapter[Unbiased Variant Analysis Using Genome Graphs]{\LARGE{Bovine breed-specific augmented reference graphs facilitate accurate sequence read mapping and unbiased variant discovery}}
\label{chap:wholegraph}

\subsection*{}
\normalsize

{
\vspace{2em}
\setlength\parindent{0pt}
\large

\textbf{Danang Crysnanto}$^{1}$, Hubert Pausch$^{1}$ \\

\vspace{0.5em}

$^1$ Animal Genomics, ETH Zurich, Zurich, Switzerland. \\

\bigskip
Published in \emph{Genome Biology (2020) 21:184}

\bigskip

\begin{center}\fbox{\begin{minipage}{35em}

\emph{Contribution}: I participated in conceiving the study, analysing the results and writing the manuscript. I wrote the whole-genome graph pipelines. 
        
\end{minipage}}\end{center}

}

\include{chapters/chapter3}

\iftwoside
\cleardoublepage
\newpage
\fi

\chapter[A Pangenome Established From Six Assemblies]{\LARGE{Novel functional sequences uncovered through a bovine multi-assembly graph}}
\label{chap:multigraph}

\subsection*{}
\normalsize

{
\vspace{2em}
\setlength\parindent{0pt}
\large

\textbf{Danang Crysnanto}$^{1}$, Alexander S. Leonard$^{1}$, Zih-Hua Fang$^{1}$, Hubert Pausch$^{1}$ \\

\vspace{0.5em}

$^1$ Animal Genomics, ETH Zurich, Zurich, Switzerland. \\

\bigskip
Published in \emph{PNAS (2021) 118:20}

\bigskip

\begin{center}\fbox{\begin{minipage}{35em}

\emph{Contribution}: I participated in conceiving the study, analysing the results and writing the manuscript. I wrote the multi-assembly graph pipelines. 
    
\end{minipage}}\end{center}

}

\onehalfspacing
\include{chapters/chapter4}

\iftwoside
\cleardoublepage
\newpage
\fi


\chapter[General Discussion]{\LARGE{General Discussion}}
\label{chap:discuss}

\include{chapters/discuss}


\newpage

\iftwoside
\cleardoublepage
\fi

% Appendixes 

\chapter*{\centering{Supplementary Material \\ Chapter \ref{chap:locgraph}}}
\addcontentsline{toc}{chapter}{Supplementary Materials Chapter \ref{chap:locgraph}}
\singlespacing
\fancyhead[C]{APPENDICES}
\include{chapters/supp_chap2}

\chapter*{\centering{Supplementary Material \\ Chapter \ref{chap:wholegraph}}}
\addcontentsline{toc}{chapter}{Supplementary Materials Chapter 3}
\singlespacing
\fancyhead[C]{APPENDICES}
\include{chapters/supp_chap3}

\chapter*{\centering{Supplementary Material \\ Chapter \ref{chap:multigraph}}}
\addcontentsline{toc}{chapter}{Supplementary Materials Chapter 4}
\singlespacing
\fancyhead[C]{APPENDICES}
\include{chapters/supp_chap4}

\thispagestyle{plain}

\section*{\LARGE{Acknowledgements}}
\addcontentsline{toc}{chapter}{Acknowledgements}
\bigskip

\normalsize
\onehalfspacing
First, I would like to thank Prof. Hubert Pausch for having me as a doctoral student, 
supervising me over the years and providing a great environment for the research. 
I learned a lot about genomics, programming, problem solving, critical thinking, and scientific writing from your guidances.
Also, thank you for giving me the freedom to explore research ideas and the trust to organize my time. 
I really appreciate all opportunities that I was given: including me in the other projects in the lab, providing funding to attend courses, and sending me to many international conferences. 
All these have become extremely valuable experiences. \\

Thanks to Prof. Bernt Guldbrandtsen and Prof. David MacHugh who have agreed to review this thesis.\\

I would like to thank the current and former members of the Animal Genomics Group for being very supportive for my day-to-day as a doctoral student. 
A special mention to Dr. Alexander S. Leonard who has been extremely helpful in the last project and for dedicating time to proofread this thesis. 
Also to Maya and Meenu, who have become helpful peers since starting my doctoral. 
I would also like to thank to staff at Agrovet Strickhof that have provided a great research facility. 
Thank you to Dorota Niedzwiecka for organizing all administrative tasks to ensure my smooth stay in Zurich.  \\

Lastly, I would like to thank my families, especially my wife, who has accompanied me studying abroad.  

\newpage


% cv 

\newif\ifincludecv
\includecvtrue %comment out to remove cv
\ifincludecv
    \newpage
    \includepdf[noautoscale,pages=-]{frontback/cv.pdf}
\fi

% \includepdf[noautoscale,pages=-]{frontback/backpage.pdf}

\end{document}