diff --git a/content/_index.md b/content/_index.md
index b9713fe..25b2c7d 100644
--- a/content/_index.md
+++ b/content/_index.md
@@ -23,7 +23,6 @@ Yet, generational barriers outside of an individual's influence hinder a person'
Breadcrumbs represents [OASCI's](https://www.oasci.org/) initiative to fundamentally transform access to educational materials, embodying the spirit of discovery and the sharing of knowledge fragments that have traditionally been accessible only through the goodwill of others.
Breadcrumb websites are freely accessible resources, and we encourage contributions, improvements, and adaptations so long as they abide by the terms outlined in the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
-
## License
Code contained in this project is released under the [MIT License](https://spdx.org/licenses/MIT.html) as specified in [`LICENSE_CODE`](https://github.com/oasci-bc/omics/blob/main/LICENSE_CODE.md).
diff --git a/content/alignment/_index.md b/content/alignment/_index.md
new file mode 100644
index 0000000..1d0af69
--- /dev/null
+++ b/content/alignment/_index.md
@@ -0,0 +1,177 @@
+---
+title: Sequence alignment
+type: docs
+---
+
+
+In bioinformatics, a sequence alignment is a way of arranging two or more sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between them.
+In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor.
+Aligned nucleotide or amino acid residue sequences are typically represented as rows within a matrix.
+Gaps are inserted between the residues to align identical or similar characters in columns.
+A pairwise alignment aligns two sequences, while a multiple sequence alignment (MSA) aligns three or more.
+
+
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/b/b5/Histone_Alignment.png)
+>
+>
+> A multiple sequence alignment of mammalian histone proteins.
+> Sequences are the amino acids for residues 120-180 of the proteins.
+> Residues that are **conserved** across all sequences are highlighted in grey.
+> Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).
+> Conservative mutations produce chemically similar amino acids, while non-conservative mutations result in chemically distant residues.
+>
+> Credit: [Wikipedia](https://upload.wikimedia.org/wikipedia/commons/b/b5/Histone_Alignment.png).
+
+## Homology
+
+In biology, homology is similar due to shared ancestry between structures or genes in different taxa.
+A typical anatomical example of homologous structures is the forelimbs of vertebrates, where the wings of bats and birds, the arms of primates, the front flippers of whales, and the forelegs of four-legged vertebrates like dogs and crocodiles are all derived from the same ancestral tetrapod structure.
+Evolutionary biology explains homologous structures adapted to different purposes due to descent with modification from a common ancestor.
+
+Sequence homology between protein or DNA sequences is similarly defined in terms of shared ancestry.
+Two segments of DNA can have shared ancestry because of either a speciation event or a gene duplication event.
+Homology among proteins or DNA is inferred from their sequence similarity.
+Significant similarity proves that two sequences are related by divergent evolution from a common ancestor.
+Alignments of multiple sequences are used to discover homologous regions.
+
+### Interpretation
+
+Suppose two sequences in an alignment share a common ancestor.
+In that case, mismatches can be interpreted as point mutations and gaps as indels (insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another.
+In protein sequence alignments, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region is among lineages.
+The absence of substitutions, or the presence of only very conservative mutations (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggests that this region has structural or functional importance.
+
+### Conserved sequences
+
+In evolutionary biology, conserved sequences are identical or similar in nucleic acids (DNA and RNA) or proteins across species, within a genome, or between donor and receptor taxa.
+Conservation indicates that a sequence has been maintained by natural selection, likely because mutations in the sequence dramatically harm the organism's health and fitness.
+
+A highly conserved sequence has remained relatively unchanged far back up the phylogenetic tree and far back in geological time.
+Examples of highly conserved sequences include the RNA components of ribosomes, which are present in all domains of life, the homeobox sequences widespread among eukaryotes, and the tmRNA in bacteria.
+The study of sequence conservation overlaps with the fields of genomics, proteomics, evolutionary biology, phylogenetics, bioinformatics, and mathematics.
+
+Just because two sequences are similar does not mean that the sequence has been conserved and the sequences are homologs.
+Convergent evolution can result in two unrelated sequences becoming similar.
+In some cases, two sequences can code for proteins or parts of proteins with similar functions but with no shared evolutionary history.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/9/94/Caspase-motif-alignment.png)
+>
+> MSA of the seven Drosophila caspases colored by motifs.
+> Note the long stretches of gaps (-) required to align the sequences.
+
+### Global and local alignments
+
+Global alignments, which attempt to align every residue in every sequence, are most valuable when the sequences in a set are similar and of roughly equal size.
+(This does not mean global alignments cannot start and/or end in gaps.)
+Local alignments are more useful for dissimilar sequences suspected to contain relatively small regions of similarity or similar sequence motifs within their larger sequence context.
+
+## Multiple sequence alignment
+
+Multiple sequence alignment extends pairwise alignment to incorporate more than two sequences simultaneously.
+Multiple alignment methods try to align all sequences in a given query set.
+Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related.
+Such conserved sequence motifs can be used in conjunction with structural and mechanistic information to locate enzymes' catalytic active sites.
+Alignments are also used to establish evolutionary relationships by constructing phylogenetic trees.
+Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even functionally significant individual amino acids or nucleotides.
+
+Multiple sequence alignment also refers to aligning a set of sequences.
+MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex.
+Multiple sequence alignments are computationally tricky, one of the original challenges tackled by computational biologists.
+The utility of these alignments in bioinformatics has led to the development of various methods suitable for aligning three or more sequences.
+
+### Consensus sequences
+
+MSAs allow the determination of a consensus sequence.
+In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated sequence of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment.
+Consensus sequences are often printed at the bottom of MSAs and can be elaborated on using sequence logos (see below).
+Developing software for pattern recognition is a significant topic in genetics, molecular biology, and bioinformatics.
+Specific sequence motifs can function as regulatory sequences controlling biosynthesis or signal sequences directing a molecule to a specific site within the cell or regulating its maturation.
+Since these sequences have an essential regulatory function, they are thought to be conserved across long periods of evolution.
+In some cases, evolutionary relatedness can be estimated by the amount of conservation of these sites.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/b/b4/Hemagglutinin-alignments.png)
+>
+> MSA of 27 avian influenza hemagglutinin protein sequences colored by residue conservation (top) and residue properties (bottom).
+
+### Phylogenetic use
+
+Multiple sequence alignments can be used to create a phylogenetic treeLinks to an external site.
+This is made possible by two reasons.
+The first is that functional domains known in annotated sequences can be used for alignment in non-annotated sequences.
+The other is that conserved regions known to be functionally important can be found.
+This makes it possible to use multiple sequence alignments to analyze and find evolutionary relationships through homology between sequences.
+Point mutations and insertion or deletion events (called indels) can be detected.
+
+By locating conserved domains, multiple sequence alignments can also be used to identify functionally important sites, such as binding sites, active sites, or sites corresponding to other key functions.
+When looking at multiple sequence alignments, it is useful to consider different aspects of the sequences when comparing sequences.
+These aspects include identity, similarity, and homology.
+Identity means that the sequences have identical residues at their respective positions.
+On the other hand, similarity has to do with the sequences being compared having similar residues quantitatively.
+For example, in terms of nucleotide sequences, pyrimidines and purines are considered similar.
+Similarity ultimately leads to homology in that the more similar sequences are, the closer they are to being homologous.
+This similarity in sequences can then go on to help find common ancestry.
+
+### Gallery
+
+Below are examples of MSAs.
+Try to determine what the color coding indicates, how highly conserved regions are with the alignments, and if a consensus sequence is shown.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/7/79/RPLP0_90_ClustalW_aln.gif)
+>
+> First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 (L10E) from several organisms.
+> Generated with ClustalX.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/a/a7/WPP_domain_alignment.PNG)
+>
+> TODO:
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/thumb/9/98/Muscle_alignment_view.png/1600px-Muscle_alignment_view.png)
+>
+> These are sequences being compared in a MUSCLE multiple sequence alignment (MSA).
+> The sequence names (leftmost column) are from various louse species, while the sequences themselves are in the second column.
+> Note the large insertion present in a single sequence.
+
+## Sequence logos
+
+In bioinformatics, a sequence logo represents the sequence similarity of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences).
+A sequence logo is created from a collection of aligned sequences.
+It depicts the consensus sequence and diversity of the sequences.
+Sequence logos are frequently used to depict sequence characteristics such as protein-binding sites in DNA or functional units in proteins.
+
+A sequence logo consists of a stack of letters at each position, though sometimes one letter is much larger than the others.
+The relative sizes of the letters indicate their frequency in the sequences.
+The total height of the letters depicts the information content of the position, with more informative positions giving researchers more confidence in conclusions about a position in the logo.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/b/bf/KozakConsensus.jpg)
+>
+> A sequence logo showing the most conserved bases around the initiation codon from all human mRNAs (Kozak consensus sequence).
+> Note that the initiation codon is not drawn to scale.
+
+### Sequence motif
+
+The similarity in a sequence alignment or sequence logo may be due to relatively recent common ancestry or similar molecular function.
+Functionally similar sequences are termed sequence motifs.
+A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to the biological function of the macromolecule.
+
+### Consensus logo
+
+A consensus logo is a simplified variation of a sequence logo.
+A consensus logo is created from a collection of aligned protein or DNA/RNA sequences like a sequence logo.
+It conveys information about the conservation of each position of a sequence motif or sequence alignment.
+However, a consensus logo displays only conservation information and not explicitly the frequency information of each nucleotide or amino acid at each position.
+Instead of a stack made of several characters, denoting the relative frequency of each character, the consensus logo depicts the degree of conservation of each position using the height of the consensus character at that position.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/8/85/LexA_gram_positive_bacteria_sequence_logo.png)
+>
+> A consensus logo for the LexA-binding motif of several Gram-positive species.
+
+## Acknowledgements
+
+Some of this material was adapted with permission from the following sources:
+
+- [Sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment)
+- [Multiple sequence alignment](https://en.wikipedia.org/wiki/Multiple_sequence_alignment)
+- [Homology](https://en.wikipedia.org/wiki/Homology_(biology))
+- [Consensus sequence](https://en.wikipedia.org/wiki/Consensus_sequence)
+- [Sequence Logos](https://en.wikipedia.org/wiki/Sequence_logo)
diff --git a/content/alignment/dotplot/_index.md b/content/alignment/dotplot/_index.md
new file mode 100644
index 0000000..287d1d5
--- /dev/null
+++ b/content/alignment/dotplot/_index.md
@@ -0,0 +1,37 @@
+---
+title: Dot plot
+type: docs
+weight: 1
+---
+
+
+A dot plot is a graphical method used in bioinformatics to compare two biological sequences, revealing their similarity and differences at a glance.
+This technique plots one sequence along the x-axis and the other along the y-axis of a two-dimensional matrix.
+Points, or "dots," are marked on this matrix wherever the elements (nucleotides in DNA/RNA or amino acids in proteins) from one sequence match those from the other sequence based on a predefined criterion.
+It serves as a preliminary step in sequence analysis, helping researchers visualize alignments, repetitions, inversions, and regions of conservation without the computational complexity of more advanced alignment algorithms.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/3/33/Zinc-finger-dot-plot.png)
+>
+> This is a DNA dot plot of a human zinc finger transcription factor (GenBank ID NM_002383) showing regional self-similarity.
+> The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence.
+>
+> Credit: [Wikipedia](https://en.wikipedia.org/wiki/Dot_plot_%28bioinformatics%29)
+
+## Interpretation
+
+When analyzing a dot plot for sequence alignment, you start at the upper left corner, marking the beginning of both sequences being compared.
+This point is crucial as it signifies the commencement of the analysis.
+Conversely, the path concludes at the lower right corner, which denotes the endpoint of both sequences, encapsulating the entire comparison within the confines of the plot.
+The path through the dot plot is navigated via three principal directions:
+
+- eastward (right) along the x-axis, which indicates progression in the x-axis sequence without a corresponding match in the y-axis sequence, suggesting a gap or insertion in the sequence represented on the y-axis;
+- southward (down) along the y-axis, which signifies advancement in the y-axis sequence without a matching element in the x-axis sequence, hinting at a gap or insertion in the x-axis sequence;
+- and southeastward (diagonally down and right), which denotes a matching pair of elements, thus advancing both sequences in unison.
+
+Interpreting alignments within this framework, any path that adheres strictly to these three directions from the plot's upper left to its lower right corner symbolizes a potential alignment between the two sequences.
+This path methodically delineates through regions of matches (diagonal movements) and mismatches or gaps (lateral or downward movements).
+
+When the sequences under comparison are highly similar or closely related, the path of optimal alignment is expected to predominantly traverse diagonally, reflecting a substantial degree of matching between the sequences.
+Such clear, uninterrupted diagonal lines illustrate sequences of consecutive matches, enabling direct interpretation of the alignment from the dot plot.
+This direct reading of alignments becomes particularly feasible when the sequences are closely related, as a continuous diagonal line (or lines) stretching from the upper left to the lower right corner of the dot plot allows for the alignment to be intuitively "read" off the plot.
+By observing the path's direction changes, one can effortlessly deduce the sequence of matches and mismatches, including the strategic positioning of gaps, thereby offering a visually intuitive method to understand sequence alignment through dot plots.
diff --git a/content/alignment/multiple/_index.md b/content/alignment/multiple/_index.md
new file mode 100644
index 0000000..49f232e
--- /dev/null
+++ b/content/alignment/multiple/_index.md
@@ -0,0 +1,34 @@
+---
+title: Multiple sequence alignment
+type: docs
+weight: 4
+---
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+Multiple Sequence Alignments (MSAs) are a cornerstone technique in bioinformatics.
+They enable scientists to align three or more biological sequences—such as proteins, DNA, or RNA—to unearth regions of similarity.
+These similarities can indicate functional, structural, or evolutionary relationships between the sequences, offering insights into the molecular mechanisms of life.
+
+## Significance
+
+MSAs play a pivotal role in molecular biology and genetics by highlighting conserved sequences across species or protein families.
+These conserved elements are often crucial for biological function or structure, implying their importance throughout evolutionary history.
+By comparing sequences, researchers can infer phylogenetic relationships, understand evolutionary processes, and predict the function of new genes.
+Additionally, MSAs assist in identifying sequence motifs—short, conserved subsequences critical for protein function or specific biochemical activities.
+
+## Procedure
+
+The process begins with collecting sequences that potentially share a common ancestry or function.
+Using bioinformatics tools such as ClustalW, MUSCLE, or T-Coffee, these sequences are aligned to identify similar regions.
+The alignment is adjusted to maximize matching characters (amino acids or nucleotides) while minimizing gaps or mismatches.
+This can involve sophisticated computational algorithms, considering evolutionary events like mutations, insertions, and deletions.
+
+## Applications
+
+- **Functional Annotation:** By aligning an unknown sequence with known sequences, researchers can predict the unknown's function based on shared motifs and conserved regions.
+- **Structural Prediction:** MSAs can indicate critical structural elements preserved across species, aiding in the structural prediction of proteins.
+- **Evolutionary Biology:** MSAs reveal the evolutionary relationships between sequences, helping to construct phylogenetic trees that trace lineage divergence and speciation events.
+- **Drug Discovery:** Identifying conserved regions can help in designing drugs that target these sequences, crucial in combating diseases with a genetic basis.
diff --git a/content/alignment/pairwise/_index.md b/content/alignment/pairwise/_index.md
new file mode 100644
index 0000000..449521a
--- /dev/null
+++ b/content/alignment/pairwise/_index.md
@@ -0,0 +1,14 @@
+---
+title: Pairwise sequence alignment
+type: docs
+weight: 3
+---
+
+
+Pairwise sequence alignment for DNA is a computational method to identify regions of similarity between two DNA sequences.
+This process is essential in bioinformatics for various purposes, such as identifying functional, structural, or evolutionary relationships between sequences.
+By aligning sequences, scientists can infer how closely related two DNA sequences are, predict the function of unknown genes, identify conserved sequences among different organisms, and much more.
+
+The alignment process involves arranging the sequences to identify regions of similarity and differences.
+It is executed by introducing gaps (-) in the sequences to maximize the alignment between matching characters (nucleotides: A, T, C, G) while minimizing mismatches and gaps.
+The goal is to achieve the highest possible level of similarity, quantified by a score calculated based on a scoring system. This system assigns scores for matches, mismatches, and gaps.
diff --git a/content/alignment/pairwise/global/_index.md b/content/alignment/pairwise/global/_index.md
new file mode 100644
index 0000000..cf4100a
--- /dev/null
+++ b/content/alignment/pairwise/global/_index.md
@@ -0,0 +1,229 @@
+---
+title: Global
+type: docs
+---
+
+
+Global pairwise sequence alignment is a computational technique used in bioinformatics to align two DNA sequences and identify regions of similarity that may indicate functional, structural, or evolutionary relationships between them.
+This method is essential for various applications, including gene identification, phylogenetic analysis, and the study of evolutionary biology.
+
+## Needleman-Wunsch algorithm
+
+The Needleman-Wunsch algorithm,[^needleman1970general] developed by Saul B. Needleman and Christian D. Wunsch in 1970, is a pioneering method in bioinformatics for aligning protein or nucleotide sequences.
+This algorithm was among the first applications of dynamic programming to compare biological sequences.
+It introduced a systematic way to tackle the complex problem of sequence alignment by breaking it down into smaller, manageable parts.
+
+The algorithm operates on optimizing the global alignment between two sequences.
+It generates all possible alignments, scoring each based on a predefined scoring system and identifying the alignment(s) with the highest score as the optimal solution.
+This approach ensures that the alignment spans the entire length of both sequences, accommodating matches, mismatches, and gaps (insertions or deletions) to maximize the overall alignment score.
+The process of the Needleman-Wunsch algorithm can be divided into three main steps: initialization, matrix filling, and traceback.
+
+### Initialization
+
+The initialization phase involves creating a scoring matrix, $T$, with one sequence (Seq1: "ATTAC") on the x-axis and the other (Seq2: "AATTC") on the y-axis.
+
+The scoring matrix is initialized as follows:
+
+- Start with a cell $(0,0)$ having a score of $0$.
+- Each subsequent cell in the first row and column is assigned a gap penalty, which accumulates linearly.
+ For a gap penalty of $-1$, the initialization looks like this:
+
+| **D** | | **A** | **A** | **T** | **T** | **C** |
+|---|---|---|---|---|---|---|
+| | 0 | -1| -2| -3| -4| -5|
+| **A** | -1| | | | | |
+| **T** | -2| | | | | |
+| **T** | -3| | | | | |
+| **A** | -4| | | | | |
+| **C** | -5| | | | | |
+
+### Filling the Matrix
+
+To fill the matrix, we calculate the score for each cell using the following equation:
+
+$$
+\text{D}(i,j) = \max\begin{cases}
+T[i-1,j-1] + s(a,b) & \text{(diagonal, match/mismatch)} \\
+T[i-1,j] - \text{gap penalty} & \text{(up, gap)} \\
+T[i,j-1] - \text{gap penalty} & \text{(left, gap)}
+\end{cases}
+$$
+
+where $s(a,b)$ is the score of aligning character $a$ from Seq1 with character $b$ from Seq2.
+For example, $s(A,A) = +1$ for a match and $s(A,T) = -1$ for a mismatch.
+The gap penalty is typically $-1$.
+
+Let's calculate the score for cell $(1,1)$, assuming $s(A,A) = +1$ and the gap penalty as $-1$:
+
+- From diagonal ($T[0,0]$): $0 + 1 = 1$
+- From up ($T[0,1]$): $-1 - 1 = -2$
+- From left ($T[1,0]$): $-1 - 1 = -2$
+
+The maximum of these scores is $1$, so $T[1,1] = 1$.
+
+| **D** | | **A** | **A** | **T** | **T** | **C** |
+|---|---|----|----|----|----|----|
+| | 0 | -1 | -2 | -3 | -4 | -5 |
+| **A** | -1| 1 | 0 | -1 | -2 | -3 |
+| **T** | -2| 0 | 0 | 1 | 0 | -1 |
+| **T** | -3| -1 | -1 | 1 | 2 | 1 |
+| **A** | -4| -2 | 0 | 0 | 1 | 1 |
+| **C** | -5| -3 | -1 | -1 | 0 | 2 |
+
+### Traceback
+
+Once the matrix is filled, we find the optimal path by starting from the bottom-right corner and moving to the cell that contributed to its score until we reach $(0,0)$.
+The path to each cell indicates whether the characters are aligned (diagonal move), or a gap is introduced (up or left move).
+For instance, if the final cell's score was obtained from a diagonal move, it suggests a match or mismatch based on the alignment score.
+If the score came from moving up or left, it indicates introducing a gap in Seq1 or Seq2, respectively.
+
+Given the sequences Seq1: "ATTAC" and Seq2: "AATTC", let's illustrate a simplified example of filling one more cell and performing a traceback:
+
+- For cell $(2,1)$, we consider:
+- Diagonal ($T[1,0] - 1 = -2$) for a mismatch,
+- Up ($T[1,1] - 1 = 0$) for a gap,
+- Left ($T[2,0] - 1 = -3$) for a gap.
+
+Choosing the highest score, we fill $T[2,1]$ with $0$ to introduce a gap.
+
+The traceback process would then start from $T[5,5]$, considering the entire matrix filled, and proceed to $T[0,0]$, highlighting the optimal alignment.
+
+**Example 1:**
+
+```text
+- A T T A C
+# * * * # *
+A A T T - C```
+```
+
+Score = 2
+
+| **D** | | **A** | **A** | **T** | **T** | **C** |
+|---|---|----|-----|----|----|----|
+| | **0** | **-1** | -2 | -3 | -4 | -5 |
+| **A** | -1 | 1 | **0** | -1 | -2 | -3 |
+| **T** | -2 | 0 | 0 | **1** | 0 | -1 |
+| **T** | -3 | -1 | -1 | 1 | **2** | 1 |
+| **A** | -4 | -2 | 0 | 0 | **1** | 1 |
+| **C** | -5 | -3 | -1 | -1 | 0 | **2** |
+
+**Example 2:**
+
+```text
+A - T T A C
+* # * * # *
+A A T T - C
+```
+
+Score = 2
+
+| **D** | | **A** | **A** | **T** | **T** | **C** |
+|---|---|-----|----|----|----|----|
+| | **0** | -1 | -2 | -3 | -4 | -5 |
+| **A** | -1 | **1** | **0** | -1 | -2 | -3 |
+| **T** | -2 | 0 | 0 | **1** | 0 | -1 |
+| **T** | -3 | -1 | -1 | 1 | **2** | 1 |
+| **A** | -4 | -2 | 0 | 0 | **1** | 1 |
+| **C** | -5 | -3 | -1 | -1 | 0 | **2** |
+
+### Example
+
+| | - | **G** | **C** | **A** | **T** | **G** | **C** | **G** |
+|---|---| ---|---|---|---|---|---|---|
+| - | | | | | | | | |
+| **G** | | | | | | | | |
+| **A** | | | | | | | | |
+| **T** | | | | | | | | |
+| **T** | | | | | | | | |
+| **A** | | | | | | | | |
+| **C** | | | | | | | | |
+| **A** | | | | | | | | |
+
+Each of these scenarios is assigned a score and the sum of the scores of all the pairings is the score of the whole alignment candidate.
+Different systems exist for assigning scores; some have been outlined in the Scoring systems section below.
+For now, the system used by Needleman and Wunsch will be used:
+
+- Match: +1
+- Mismatch or Indel: −1
+
+Start with a zero in the second row, second column.
+Move through the cells row by row, calculating the score for each cell.
+The score is calculated by comparing the scores of the cells neighboring to the left, top or top-left (diagonal) of the cell and adding the appropriate score for match, mismatch or indel.
+Calculate the candidate scores for each of the three possibilities:
+
+- The path from the top or left cell represents an indel pairing, so take the scores of the left and the top cell, and add the score for indel to each of them.
+- The diagonal path represents a match/mismatch, so take the score of the top-left diagonal cell and add the score for match if the corresponding bases (letters) in the row and column are matching or the score for mismatch if they do not.
+
+The resulting score for the cell is the highest of the three candidate scores.
+
+Given there is no 'top' or 'top-left' cells for the second row only the existing cell to the left can be used to calculate the score of each cell.
+Hence −1 is added for each shift to the right as this represents an indel from the previous score.
+This results in the first row being 0, −1, −2, −3, −4, −5, −6, −7.
+The same applies to the first column as only the existing score above each cell can be used.
+Thus the resulting table is:
+
+| D | - | **G** | **C** | **A** | **T** | **G** | **C** | **G** |
+|------ |---| --- |---|---|---|---|---|---|
+| - | 0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 |
+| **G** | -1 | | | | | | | |
+| **A** | -2 | | | | | | | |
+| **T** | -3 | | | | | | | |
+| **T** | -4 | | | | | | | |
+| **A** | -5 | | | | | | | |
+| **C** | -6 | | | | | | | |
+| **A** | -7 | | | | | | | |
+
+The first case with existing scores in all 3 directions is the intersection of our first letters (in this case G and G).
+The surrounding cells are below:
+
+| D | - | **G** |
+|------ |---| --- |
+| - | 0 | -1 |
+| **G** | -1 | **X** |
+
+This cell has three possible candidate sums:
+
+- The diagonal top-left neighbor has score 0. The pairing of G and G is a match, so add the score for match: 0+1 = 1
+- The top neighbor has score −1 and moving from there represents an indel, so add the score for indel: (−1) + (−1) = (−2)
+- The left neighbor also has score −1, represents an indel and also produces (−2).
+
+The highest candidate is 1 and is entered into the cell:
+
+| D | - | **G** |
+|------ |---| --- |
+| - | 0 | -1 |
+| **G** | -1 | **1** |
+
+The cell which gave the highest candidate score must also be recorded. In the completed diagram in figure 1 above, this is represented as an arrow from the cell in row and column 3 to the cell in row and column 2.
+
+In the next example, the diagonal step for both X and Y represents a mismatch:
+
+| D | - | **G** | **C** |
+|------ |---| --- |---|
+| - | 0 | -1 | -2 |
+| **G** | -1 | 1 | **X** |
+| **A** | -2 | **Y** | |
+
+X:
+
+- Top: (−2)+(−1) = (−3)
+- Left: (+1)+(−1) = (0)
+- Top-Left: (−1)+(−1) = (−2)
+
+Y:
+
+- Top: (1)+(−1) = (0)
+- Left: (−2)+(−1) = (−3)
+- Top-Left: (−1)+(−1) = (−2)
+
+For both X and Y, the highest score is zero:
+
+| D | - | **G** | **C** |
+|------ |---| --- |---|
+| - | 0 | -1 | -2 |
+| **G** | -1 | 1 | **0** |
+| **A** | -2 | **0** | |
+
+
+
+[^needleman1970general]: Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. *Journal of molecular biology, 48*(3), 443-453. doi: [10.1016/0022-2836(70)90057-4](https://doi.org/10.1016/0022-2836(70)90057-4)
diff --git a/content/alignment/pairwise/local/_index.md b/content/alignment/pairwise/local/_index.md
new file mode 100644
index 0000000..7ba6c5b
--- /dev/null
+++ b/content/alignment/pairwise/local/_index.md
@@ -0,0 +1,220 @@
+---
+title: Local
+type: docs
+---
+
+
+
+The local alignment method seeks to find the regions of highest similarity within the sequences without necessarily aligning them from end to end.
+Local alignment is beneficial in identifying functional domains, motifs, or other conserved sequence elements within larger, more variable sequences.
+It allows researchers to pinpoint regions of similarity that may be biologically relevant, even when the overall sequence similarity is low.
+The objective is to identify the segments within the sequences that align the best, disregarding other regions that might not align well.
+
+## Smith-Waterman algorithm
+
+Developed by Temple F. Smith and Michael S. Waterman in 1981, the Smith-Waterman algorithm is a cornerstone of bioinformatics.
+It is designed specifically for local sequence alignment.
+It distinguishes itself from global alignment algorithms by focusing on finding the highest-scoring local alignments between two sequences.
+This is especially useful for comparing sequences with varying lengths or when only a portion is similar.
+The algorithm uses dynamic programming to calculate the best possible alignment score between subsequences, employing a scoring matrix that rewards matches and penalizes mismatches and gaps.
+
+Notably, the algorithm can be expanded to use two different penalties for gap opening and gap extension, addressing the biological reality that starting a new gap is energetically less favorable than extending an existing one.
+This differentiation in gap penalties helps to minimize gaps in the alignment, mirroring biological sequences more accurately.
+
+- **Match Score:** The score given when two characters at the current position in the compared sequences are the same.
+- **Mismatch Penalty:** The score deducted when the characters do not match.
+- **Gap Penalty:** The penalty applied for introducing a gap in one of the sequences.
+
+### Initialization
+
+When we start with the Smith-Waterman algorithm, we first set up a grid or matrix.
+This grid has one sequence listed across the top and the other down the side.
+The first step is to fill in the top row and the leftmost column with zeros.
+This is because we need a starting point for our calculations, and beginning with zeros allows us to build up scores from nothing, which makes sense for finding local alignments where the alignment could start anywhere within the sequences.
+
+### Matrix filling
+
+The matrix filling step is the core of the Smith-Waterman algorithm for local sequence alignment.
+Let's break it down, keeping the mathematical notation but explaining each part in detail.
+
+For each cell $(i,j)$ in our matrix $H$, we calculate the score using this formula:
+
+$$
+H(i,j) = \max \left\{ \begin{array}{l}
+0, \\
+H(i-1,j-1) + s(x_i, y_j), \\
+\max_{k\geq1}\{H(i-k,j) - W_k\}, \\
+\max_{l\geq1}\{H(i,j-l) - W_l\}
+\end{array} \right.
+$$
+
+Let's break this down step by step:
+
+1. **$H(i,j)$**: This is the score we're calculating for the cell at position $(i,j)$ in our matrix.
+2. **$\max\{\}$**: We're taking the maximum value of four options. This means we'll choose the highest scoring option among them.
+3. **Option 1: $0$**
+ - This allows us to start a new alignment if all other options result in a negative score.
+ - It's crucial for finding local alignments, as it lets us "reset" when similarity decreases.
+4. **Option 2: $H(i-1,j-1) + s(x_i, y_j)$**
+ - This represents a match or mismatch between characters $x_i$ and $y_j$.
+ - $H(i-1,j-1)$ is the score from the diagonal upper-left cell.
+ - $s(x_i, y_j)$ is the scoring function:
+ - It returns a positive value (e.g., +2) if $x_i$ and $y_j$ match.
+ - It returns a negative value (e.g., -1) if they mismatch.
+5. **Option 3: $\max_{k\geq1}\{H(i-k,j) - W_k\}$**
+ - This represents introducing a gap in the sequence on the top of the matrix.
+ - We're looking at cells directly above our current position.
+ - $k$ is the gap length.
+ This only comes into play when we have affine gap penalties where the score changes based on opening or extending a gap.
+ We would have to try all possible values of $k$ to see which would maximize the score.
+ - $W_k$ is the penalty for a gap of length $k$.
+6. **Option 4: $\max_{l\geq1}\{H(i,j-l) - W_l\}$**
+ - This is similar to Option 3, but for introducing a gap in the sequence on the left side of the matrix.
+ - We're looking at cells directly to the left of our current position.
+
+In other words:
+
+- If the letters from each sequence at that position are a match, we add to the score. We look back at the cell diagonally up-left from our current position, take its score, and add a certain number (let's say +2 for a simple example) because we found a match.
+- If the letters don't match, we might take a smaller score or subtract because it's not as good as a match.
+- We also consider skipping a letter in one of the sequences. This could happen if there's a gap in one sequence compared to the other. We take a small penalty for this, subtracting a little from our score.
+- The key here is we always pick the option that gives us the highest score, which could even be zero. We never let the score go negative because we're only interested in positive matches.
+
+The case where $k > 1$ (considering gaps longer than one in a single step) comes into play primarily when using affine gap penalties.
+However, it's crucial to understand that this is not typically how the basic Smith-Waterman algorithm is implemented.
+
+In the standard implementation with linear gap penalties:
+
+- $k$ is always 1
+- We only look at the immediately adjacent cell $(i-1, j)$
+- The formula simplifies to $H(i-1, j) - W$, where $W$ is a fixed gap penalty
+
+Affine gap penalties were introduced to more accurately model the biological reality of insertions and deletions.
+They use two different penalties:
+
+1. Gap opening penalty ($o$): A larger penalty for starting a new gap
+2. Gap extension penalty ($e$): A smaller penalty for extending an existing gap
+
+The penalty for a gap of length $k$ would be: $W_k = o + (k-1)e$
+
+In this model, considering $k > 1$ during the matrix filling step could theoretically lead to optimal alignments in certain scenarios.
+However, implementing this naively would be computationally expensive.
+
+In practice, algorithms that use affine gap penalties (like Gotoh's algorithm, an extension of Smith-Waterman) don't actually consider all possible $k$ values at each step.
+Instead, they use additional matrices to keep track of the best score ending with a gap in each sequence.
+This allows them to efficiently handle affine gap penalties without explicitly calculating $\max_{k\geq1}\{H(i-k,j) - W_k\}$ at each step.
+
+Considering $k > 1$ might be optimal in scenarios where:
+
+1. The gap opening penalty is significantly larger than the gap extension penalty.
+2. There's a long stretch of mismatches between the sequences.
+
+In such cases, it might be more favorable to open one long gap rather than several short ones.
+However, this is captured by the efficient implementation mentioned above without explicitly considering all k values.
+
+Consider aligning these sequences with affine gap penalties:
+
+- Sequence 1: ACGTACGT
+- Sequence 2: ACGCGT
+
+With penalties:
+
+- Match: +2
+- Mismatch: -1
+- Gap opening (o): -4
+- Gap extension (e): -1
+
+An optimal alignment might be:
+
+```text
+ACGTACGT
+ACG--CGT
+```
+
+This alignment has one gap of length 2, which scores better than two separate gaps:
+
+- One gap of length 2: -4 (opening) + -1 (extension) = -5
+- Two gaps of length 1: -4 (opening) + -4 (opening) = -8
+
+#### Practical Implementation
+
+In practice, especially for a basic implementation with linear gap penalties, you can simplify this to:
+
+$$
+H(i,j) = \max \left\{ \begin{array}{l}
+0, \\
+H(i-1,j-1) + s(x_i, y_j), \\
+H(i-1,j) - W, \\
+H(i,j-1) - W
+\end{array} \right.
+$$
+
+Where $W$ is your fixed gap penalty.
+
+#### Example Calculation
+
+Let's say we're aligning `ACTG` with `ACG` using these scores:
+
+- Match: +2
+- Mismatch: -1
+- Gap: -2
+
+For the cell comparing `T` from `ACTG` and `C` from `ACG`:
+
+1. $0$ (start new)
+2. $H(i-1,j-1) + s('T','C') = 4 + (-1) = 3$ (mismatch)
+3. $H(i-1,j) - W = 2 - 2 = 0$ (gap in `ACTG`)
+4. $H(i,j-1) - W = 2 - 2 = 0$ (gap in `ACG`)
+
+The maximum of these is 3, so $H(i,j) = 3$.
+
+#### Traceback
+
+The traceback step finds the optimal local alignment by starting from the highest-scoring cell in the matrix and tracing back through the cells used to calculate its score until a cell with a score of 0 is reached.
+
+- Identify the cell $H(i, j)$ with the highest score in the matrix.
+- From $H(i, j)$, move to one of the following cells that was used to calculate $H(i, j)$, according to the rule that maximizes the score:
+ - To $H(i-1, j-1)$ if the move was a diagonal (indicating a match or mismatch).
+ - To $H(i-k, j)$ or $H(i, j-l)$ if the move was vertical or horizontal, indicating a gap of length $k$ or $l$.
+- The traceback continues until a cell with a score of 0 is reached, indicating the start of the optimal local alignment.
+
+In other words:
+
+- We move from the highest score back towards the beginning, following the path that led to that high score. This could mean moving diagonally (which indicates a match), straight up, or straight left (which indicates a gap).
+- We stop tracing back when we hit a score of zero, which is our cue that this is where the best local alignment starts.
+
+### Example
+
+Consider two sequences to be aligned:
+
+- Sequence A: `ACTG`
+- Sequence B: `ACG`
+
+We will use the following scoring scheme:
+
+- Match score: +2
+- Mismatch penalty: -1
+- Linear gap penalty: -2
+
+Create a matrix $H$ with dimensions $5 \times 4$ (since `ACTG` is four characters and `ACG` is three characters, plus one for the initial zeros).
+
+Fill the matrix based on the scoring rules. For simplicity, let's calculate a few cells:
+
+- For $H(1,1)$ (comparing `A` with `A`): the score is a match, so $H(1,1) = 2$.
+- For $H(2,1)$, considering a mismatch or gap introduces negative scores, we follow the rule that negative scores are reset to 0.
+
+The filled matrix might look like this:
+
+| H | | A | C | T | G |
+|-----|---|---|---|---|---|
+| | 0 | 0 | 0 | 0 | 0 |
+| **A** | 0 | 2 | 0 | 0 | 0 |
+| **C** | 0 | 0 | 4 | 2 | 0 |
+| **T** | 0 | 0 | 2 | 3 | 1 |
+| **G** | 0 | 0 | 0 | 1 | 5 |
+
+Starting from the highest scoring cell, $H(4,4) = 5$, traceback reveals the alignment:
+
+- Sequence A: `ACTG`
+- Sequence B: `A-CG`
+
+The score of 5 corresponds to three matches (`A`, `C`, `G`) and one gap (`-`).
diff --git a/content/alignment/scoring/_index.md b/content/alignment/scoring/_index.md
new file mode 100644
index 0000000..bed7924
--- /dev/null
+++ b/content/alignment/scoring/_index.md
@@ -0,0 +1,71 @@
+---
+title: Scoring
+type: docs
+weight: 2
+---
+
+
+In DNA sequence alignment, scoring is crucial for quantifying the similarity or difference between two DNA sequences.
+This method helps identify the best possible alignment between sequences, revealing evolutionary relationships, gene function, and genetic variations.
+The scoring concept for nucleotide substitutions, insertions, deletions, and gaps is designed to reflect their biological significance.
+
+## Substitutions
+
+A nucleotide substitution occurs when one nucleotide (A, T, C, or G) is replaced by another in the sequence.
+
+> ![](https://www.pathwayz.org/Node/Image/url/aHR0cHM6Ly9pLmltZ3VyLmNvbS9vaXVaRVNHLnBuZw==)
+>
+> Credit: [Pathwayz](https://www.pathwayz.org/Tree/Plain/BASE+SUBSTITUTIONS)
+
+Scoring typically involves assigning positive scores to matches (identical nucleotides at the same position in both sequences) and negative scores to mismatches (different nucleotides at the same position).
+For example, a scoring system might use +1 for a match and -1 for a mismatch.
+This simple scheme rewards alignments of more similar sequences and penalizes those that differ.
+
+Some scoring systems differentiate between
+
+- transitions: substitutions between two purines (A ↔ G) or two pyrimidines (C ↔ T)
+- transversions: substitutions between a purine and a pyrimidine.
+
+Since transitions occur more frequently in evolution than transversions, some scoring schemes might penalize them less severely.
+
+## Insertions or deletions
+
+These events represent the insertion or deletion (also called indel) of nucleotides in a sequence relative to another.
+Scoring for indels involves introducing gaps in the alignment to maintain optimal sequence alignment.
+The scoring system usually penalizes such gaps to reflect the biological cost of indels.
+
+> ![](https://hackbrightacademy.com/content/uploads/2013/07/indel_examples_wild-type_3-bp-deletion_4-bp-insertion.png)
+>
+>
+> Credit: [Hackbright Academy](https://hackbrightacademy.com/blog/indel-finder-how-the-python-version-of-this-program-works/)
+
+## String distances
+
+### Hamming
+
+The Hamming distance between two strings of equal length is defined as the number of positions at which the corresponding symbols differ.
+Imagine you have two strings: GA**T**TA**C**A and GA**C**TA**T**A.
+The Hamming distance between these two strings would be two because there are two positions at which the nucleotides differ.
+This measure quantifies the minimum number of substitution operations needed to change one string into another.
+
+The primary limitation of the Hamming distance is its applicability only to sequences of the same length.
+This is because the Hamming distance measures the difference at specific positions between two sequences.
+If the sequences are of different lengths, the concept of position-wise comparison doesn't hold, making the Hamming distance calculation infeasible.
+Therefore, it cannot account for insertions or deletions common in DNA sequences and text strings, limiting its use in cases where these variations are significant.
+
+### Levenshtein
+
+The Levenshtein distance, often called the edit distance, measures the minimum number of single-character edits required to transform one sequence into another.
+These edits can include insertions, deletions, or substitutions of characters.
+A key characteristic that sets the Levenshtein distance apart from other metrics like the Hamming distance is its ability to handle sequences of different lengths, making it incredibly versatile for comparing any two strings.
+
+For example, to transform the word "kitten" into "sitting," the Levenshtein distance would be 3:
+
+- Substitute "k" with "s"
+- Substitute "e" with "i"
+- Insert "g" at the end
+
+This metric provides a clear, numerical value representing the "distance" or difference between two sequences.
+
+Unlike metrics limited to comparisons of equal-length sequences, the Levenshtein distance accommodates differences in sequence length, broadening its applicability.
+By measuring insertions, deletions, and substitutions, the Levenshtein distance captures the full spectrum of possible edits, providing a comprehensive view of the similarity or dissimilarity between sequences.
diff --git a/content/appendices/_index.md b/content/appendices/_index.md
new file mode 100644
index 0000000..b02e889
--- /dev/null
+++ b/content/appendices/_index.md
@@ -0,0 +1,4 @@
+---
+title: Appendices
+type: docs
+---
diff --git a/content/appendices/algorithms/_index.md b/content/appendices/algorithms/_index.md
new file mode 100644
index 0000000..4412576
--- /dev/null
+++ b/content/appendices/algorithms/_index.md
@@ -0,0 +1,5 @@
+---
+title: Algorithms
+type: docs
+toc: false
+---
diff --git a/content/appendices/algorithms/compression/_index.md b/content/appendices/algorithms/compression/_index.md
new file mode 100644
index 0000000..d5786ca
--- /dev/null
+++ b/content/appendices/algorithms/compression/_index.md
@@ -0,0 +1,5 @@
+---
+title: Compression
+type: docs
+toc: false
+---
diff --git a/content/appendices/algorithms/compression/bwt/_index.md b/content/appendices/algorithms/compression/bwt/_index.md
new file mode 100644
index 0000000..ac4d992
--- /dev/null
+++ b/content/appendices/algorithms/compression/bwt/_index.md
@@ -0,0 +1,298 @@
+---
+title: Burrows–Wheeler transform
+type: docs
+toc: false
+---
+
+
+The Burrows-Wheeler Transform (BWT) is a revolutionary algorithm in the field of data transformation and compression.
+Developed by Michael Burrows and David Wheeler in 1994, this algorithm reorganizes a string of characters in such a way that similar characters are grouped together, forming runs of the same character.
+This unique characteristic makes the BWT an invaluable tool in data compression, as it enhances the efficiency of subsequent compression algorithms by making the data more uniform and, thus, easier to compress.
+
+## Applications
+
+The primary application of the BWT is in data compression.
+By transforming data into runs of similar characters, the BWT prepares the data for more efficient compression by other algorithms.
+It's often used as a pre-processing step in compression algorithms to increase their effectiveness.
+The transformation makes the data more uniform, which in turn makes it easier for compression algorithms to reduce the size of the data without losing any information.
+
+In the realm of bioinformatics, the BWT has found significant applications, particularly in sequence alignment and genome assembly.
+The ability of the BWT to organize similar characters (or nucleotides, in the case of DNA/RNA) together makes it an excellent tool for identifying regions of similarity within long sequences of genomic data.
+This capability is crucial for aligning sequences to reference genomes, identifying genetic variations, and assembling short DNA sequences into longer ones.
+
+## Methodology
+
+### Rotation
+
+The first step in the BWT is to construct a matrix that includes all possible rotations of the input string.
+To do this, we take the input string and rotate it one character at a time, prepending each rotation to a matrix.
+
+??? note "Code"
+
+ ```python
+ def create_rotations(input_string):
+ rotations = []
+ length = len(input_string)
+ # Concatenate the string with itself to simplify rotation
+ temp_string = input_string + input_string
+ # Generate all rotations
+ for i in range(length):
+ rotations.prepend(temp_string[i:i+length])
+ return rotations
+ ```
+
+Here is an example for each stage of processing the input `banana$` using the Burrows-Wheeler Transform (BWT).
+
+!!! note
+
+ It's common to prepend a special character (like `$`) to the end of the input string to signify the end of the string.
+ This character should be unique and lexicographically smaller than any other character in the string to ensure it sorts properly.
+
+The matrix consisting of all possible rotations of the input string `banana$`.
+We have **bolded** all of the suffixes in the BWT.
+
+**banana**\$
+**anana**\$b
+**nana**\$ba
+**ana**\$ban
+**na**\$bana
+**a**\$banan
+\$banana
+
+Each cyclical permutation shifts the characters of the string, effectively moving the last character to the front and sliding the rest one position to the right.
+This process preserves the "neighborhood" of characters, meaning that characters adjacent before the permutation remain close to each other in the permutations.
+
+### Sorting
+
+After creating the matrix of all possible rotations, the next step is to sort these rotations lexicographically (i.e., in dictionary order).
+This step reorganizes the matrix into a more structured form that is essential for the next step of the transform.
+
+??? note "Code"
+
+ ```python
+ def sort_rotations(rotations):
+ return sorted(rotations)
+ ```
+
+The sorted rotations of the input string, lexicographically:
+
+\$banana
+**a**\$banan
+**ana**\$ban
+**anana**\$b
+**banana**\$
+**na**\$bana
+**nana**\$ba
+
+When these cyclical permutations are sorted lexicographically, patterns emerge.
+Characters frequently occurring in the original text tend to group in the matrix of sorted permutations.
+This grouping is especially pronounced for repeated patterns or sequences in the text, making them more apparent and compressible.
+For example, we see **an** and **na** patterns are present and sorted near each other.
+
+### Extraction
+
+The final step in the Burrows-Wheeler Transform is to extract the last column of the sorted matrix.
+This column contains the transformed string, which tends to have runs of similar characters, making it more amenable to compression.
+
+??? note "Code"
+
+ ```python
+ def extract_last_column(sorted_rotations):
+ last_column = ''.join(rotation[-1] for rotation in sorted_rotations)
+ return last_column
+ ```
+
+The last column of this sorted matrix, which is the transformed string: **`annb$aa`**
+
+## Inversion
+
+Inversion of the BWT is a crucial feature that distinguishes it from other data transformation techniques.
+It allows for the original document to be regenerated from its BWT representation, which is essentially the last column of a sorted list of all cyclic rotations of the document.
+This process is reversible due to the unique properties of the BWT and does not require the original document or any additional information beyond the BWT output and the position of the original string in the sorted list.
+
+### Start with the Last Column
+
+Given the last column from the BWT output, the first task is to reconstruct the first column of the sorted rotations matrix. Since the last column contains all the characters of the original string, sorting these characters gives us the first column.
+The sorting needs to account for multiple occurrences of the same character by ensuring their relative order is preserved as in the last column.
+
+
+
+### Reconstruct the First Column
+
+Sort the characters in the last column alphabetically to obtain the first column of the matrix.
+This is possible because both columns contain the same set of characters, and sorting the last column's characters gives you the original order of characters as they appeared before the rotations were sorted.
+
+
+
+### Pair Successive Characters
+
+The next step involves pairing each character in the last column with the character in the same row of the first column.
+These pairs represent successive characters in the document, taking cyclically so that the last and first character form a pair. This cyclical pairing is a key aspect of the BWT's ability to preserve character sequences from the original document.
+
+
+
+### Sort and Reconstruct Columns
+
+By sorting these pairs, you start to reconstruct the document one column at a time.
+Each iteration adds another character to the reconstructed sequence, progressively building up the sorted rotations matrix.
+This iterative sorting and pairing process continues until the entire document is reconstructed.
+
+
+
+### Identify the Original Text
+
+The row that ends with the special "end of file" character (e.g., `$` in our case) indicates the original document: **`banana$`**.
+
+??? note "Code"
+
+ ```python
+ def invert_burrows_wheeler(last_column):
+ # Initialize a list to hold tuples of (character, index) for sorting
+ char_tuples = [(char, i) for i, char in enumerate(last_column)]
+
+ # Sort the tuples by character to simulate the first column
+ first_column_tuples = sorted(char_tuples)
+
+ # Reconstruct the document using a table of indices
+ text_length = len(last_column)
+ # Initialize the index for the row that starts with the EOF character (assuming it's at the end)
+ current_index = last_column.index('$')
+ original_text = [''] * text_length
+
+ for i in range(text_length):
+ char, next_index = first_column_tuples[current_index]
+ original_text[i] = char
+ current_index = next_index
+
+ # Return the reconstructed text as a string
+ return ''.join(original_text)
+ ```
+
+
+
+[^bwt-fmindex-langmead]: [Slides from Ben Langmead](https://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf)
+[^bwt-report]: [A Block-sorting Lossless Data Compression Algorithm](https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf)
diff --git a/content/appendices/algorithms/search/_index.md b/content/appendices/algorithms/search/_index.md
new file mode 100644
index 0000000..0f48fab
--- /dev/null
+++ b/content/appendices/algorithms/search/_index.md
@@ -0,0 +1,6 @@
+---
+title: Search
+type: docs
+toc: false
+---
+
diff --git a/content/appendices/algorithms/search/fm-index/_index.md b/content/appendices/algorithms/search/fm-index/_index.md
new file mode 100644
index 0000000..c047e99
--- /dev/null
+++ b/content/appendices/algorithms/search/fm-index/_index.md
@@ -0,0 +1,271 @@
+---
+title: FM-index
+type: docs
+toc: false
+---
+
+
+!!! warning "Preliminaries"
+
+ We assume you are familiar with the following material:
+
+ - [Burrows–Wheeler transform](../../compression/bwt/)
+
+In the realm of computer science, the ability to efficiently search through vast amounts of text is a cornerstone of numerous applications, from the intricate analysis of genetic sequences to the retrieval of information in large databases.
+The FM-index stands as a pivotal innovation in this domain, offering a sophisticated yet practical solution to the challenge of text searching.
+Developed by Paolo Ferragina and Giovanni Manzini in 2000, this data structure has revolutionized the way we approach text indexing and searching by combining compression with search efficiency.
+
+At its core, the FM-index is a compressed full-text substring index.
+It facilitates the searching of substrings within a larger text corpus with remarkable efficiency.
+The innovation of the FM-index lies not just in its search capabilities but also in its ability to compress the indexed text, thus conserving valuable storage space.
+
+## Last-to-first mapping
+
+BWT is useful for compression since these runs are easier to compress.
+The "Last to First" (LF) mapping is a crucial part of the BWT, especially when it comes to the inverse transformation, where it helps in reconstructing the original string from the transformed string.
+For example, the BWT of `abracadabra$` is shown below.
+
+ I F L
+1 \$ abracadabr a
+2 **a** \$abracadab r
+3 **a bra**\$abraca d
+4 **a bracadabra** \$
+5 **a cadabra**\$ab r
+6 **a dabra**\$abra c
+7 **b ra**\$abracad a
+8 **b racadabra**\$ a
+9 **c adabra**\$abr a
+10 **d abra**\$abrac a
+11 **r a**\$abracada b
+12 **r acadabra**\$a b
+
+- **I column:** Shows the index of the sorted rotations of the original string.
+- **F column:** The first column in the sorted list of all rotations of the original string. This column is important because it contains the characters of the original string sorted alphabetically.
+- **L column:** The last column in the sorted list of all rotations of the original string. This is the actual output of the BWT.
+
+The LF mapping is a way to navigate from a character in the last column (L) back to its corresponding character in the first column (F).
+This mapping is possible because the sorting step ensures that the cyclic permutations are in a lexicographically sorted order, which preserves the original order of characters that are identical.
+Thus, if you know the position of a character in the L column, you can find its original position in the F column.
+
+Given the sorted rotations and the BWT result, we proceed as follows.
+
+**1. Identify the L (last) and F (first) columns from your sorted rotations.**
+
+Based on our BWT above, we have
+
+- F column (first characters of each row, already sorted): `$aaaaabbcdrr`
+- L column (last characters of each row, the BWT of the string): `ard$rcaaaabb`
+
+**2. Count occurrences in L column up to each character to compute the mapping to F.**
+
+To compute the LF mapping, we need to count how many times each character appears in L up to a given point.
+This count tells us the rank of each character in L, which corresponds directly to its original position in F because both columns are essentially different permutations of the same string with the same character frequencies.
+
+First, let's number our F letters starting from zero and increasing by one each time a repeat of a previous letter occurs.
+For example, the first a will be a0, then the second time an a appears we will label it a1, etc.
+
+F L
+ \$ abracadabr a
+**a0** \$abracadab r
+**a1 bra**\$abraca d
+**a2 bracadabra** \$
+**a3 cadabra**\$ab r
+**a4 dabra**\$abra c
+**b0 ra**\$abracad a
+**b1 racadabra**\$ a
+**c0 adabra**\$abr a
+**d0 abra**\$abrac a
+**r0 a**\$abracada b
+**r1 acadabra**\$a b
+
+Now we will do our L letters.
+
+F L
+ \$ abracadabr a0
+**a0** \$abracadab r0
+**a1 bra**\$abraca d0
+**a2 bracadabra** \$
+**a3 cadabra**\$ab r1
+**a4 dabra**\$abra c0
+**b0 ra**\$abracad a1
+**b1 racadabra**\$ a2
+**c0 adabra**\$abr a3
+**d0 abra**\$abrac a4
+**r0 a**\$abracada b0
+**r1 acadabra**\$a b1
+
+The number attached to each letter is actually called the rank; the higher the number the higher the rank.
+Ranks communicate how many times that letter occurs in the original string.
+"a" in our case has the highest rank of 4, which means there are a total of five a's in our string.
+
+!!! note
+
+ In the context of the BWT, the rank of a character within the L column does not simply indicate how many times the character appears, but rather its sequential position among identical characters within that column. This is a crucial point because the BWT, by design, groups similar characters together due to the sorting of all cyclic permutations of the original string. The rank tells us not just about the quantity of occurrences but about the order of each occurrence within the transformed string.
+
+The power of this rank is that the F letter with the same rank as a L letter is the same letter from the original string.
+Don't believe me?
+Look at the r0 in the L column; if we wrap around and continue the cyclical permutation until we hit \$ we see that it would be r0a\$.
+Now, find r0 in the F column and continue until we hit \$.
+We also get r0a\$!
+Go ahead and try this for other letters—it will work every time.
+
+How?
+Well, this specifically has to do with the fact we add the `$` to the end of our string and made sure it is lexicographically lower than any possible letter.
+The right-context of a character is essentially the substring that follows it in a particular cyclic permutation of the original string.
+Since the BWT sorts these permutations lexicographically, characters are effectively grouped by their right-context in the L column.
+When we map a character from L back to F using its rank, we're leveraging the inherent organization of the BWT, where each character's position is intimately tied to its right-context.
+This mapping allows us to trace each character's journey through the sorted permutations, from its position in the L column (where it ends a particular permutation) back to its position in the F column (where it starts another, lexicographically earlier, permutation).
+
+## Reversing the BWT
+
+You can also use the LF mapping to reverse the BWT and get the original string.
+
+F L
+\$ a0
+a0 r0
+a1 d0
+a2 \$
+a3 r1
+a4 c0
+b0 a1
+b1 a2
+c0 a3
+d0 a4
+r0 b0
+r1 b1
+
+You start from the first row with \$ and then move to the end and append the letter.
+
+**Original:** \$a0
+
+Move to the row that starts with a0 and then add the L-column letter.
+
+**Original:** \$a0r0
+
+Move to the row that starts with r0 and then add the L-column letter.
+
+**Original:** \$a0r0b0
+
+Repeat this process until you reach \$ in the L column.
+
+**Original:** \$a0r0b0a1d0
+a4c0a3r1b1a2
+
+Now, reverse this string
+
+**Reversed:** a2b1r1a3c0a4d0a1b0r0a0\$
+
+and drop the ranks.
+
+**Reversed:** abracadabra\$
+
+## Searching
+
+This LF mapping is also super helpful in quickly searching for patterns.
+For example, let's search our BWT for the string "bra" and copy our BWT below.
+
+F L
+ \$ abracadabr a0
+a0 \$abracadab r0
+a1 bra\$abraca d0
+a2 bracadabra \$
+a3 cadabra\$ab r1
+a4 dabra\$abra c0
+b0 ra\$abracad a1
+b1 racadabra\$ a2
+c0 adabra\$abr a3
+d0 abra\$abrac a4
+r0 a\$abracada b0
+r1 acadabra\$a b1
+
+Similar to our reversing the BWT, we perform successive LF mapping but we first reverse our search string (i.e., arb).
+
+!!! note
+
+ We actually did this in the previous section by starting from the first row that begins with \$.
+
+First, we find all rows that start with "a".
+
+F L
+ \$ abracadabr a0
+**a0 \$abracadab r0**
+**a1 bra\$abraca d0**
+**a2 bracadabra \$**
+**a3 cadabra\$ab r1**
+**a4 dabra\$abra c0**
+b0 ra\$abracad a1
+b1 racadabra\$ a2
+c0 adabra\$abr a3
+d0 abra\$abrac a4
+r0 a\$abracada b0
+r1 acadabra\$a b1
+
+Now we eliminate all rows that do not end in "r" because remember the letter in F is the letter immediately preceding the L letter in the same row.
+Thus, any row with "a" in F and "r" in L represents "ra" in the original string.
+
+F L
+ \$ abracadabr a0
+**a0 \$abracadab r0**
+a1 bra\$abraca d0
+a2 bracadabra \$
+**a3 cadabra\$ab r1**
+a4 dabra\$abra c0
+b0 ra\$abracad a1
+b1 racadabra\$ a2
+c0 adabra\$abr a3
+d0 abra\$abrac a4
+r0 a\$abracada b0
+r1 acadabra\$a b1
+
+Now we find the rows with r0 and r1 in the F column.
+
+F L
+ \$ abracadabr a0
+a0 \$abracadab r0
+a1 bra\$abraca d0
+a2 bracadabra \$
+a3 cadabra\$ab r1
+a4 dabra\$abra c0
+b0 ra\$abracad a1
+b1 racadabra\$ a2
+c0 adabra\$abr a3
+d0 abra\$abrac a4
+**r0 a\$abracada b0 **
+**r1 acadabra\$a b1 **
+
+We normally will need to eliminate all rows that do not have "b" in the L column, but we don't need to in this example.
+We go to the L column of our valid rows and see that our matches start with b0 and b1 in the F column.
+
+F L
+ \$ abracadabr a0
+a0 \$abracadab r0
+a1 bra\$abraca d0
+a2 bracadabra \$
+a3 cadabra\$ab r1
+a4 dabra\$abra c0
+**b0 ra\$abracad a1 **
+**b1 racadabra\$ a2 **
+c0 adabra\$abr a3
+d0 abra\$abrac a4
+r0 a\$abracada b0
+r1 acadabra\$a b1
+
+We have found the two rows that match our string.
+This may seem redundant since we can easily see the correct rows from the beginning, but this quickly becomes intractable when we have thousands and thousands of rows.
+
+## Rank array
+
+TODO: Introduce rank arrays for L and F and explain navigating them.
+
+## Checkpoints
+
+TODO: Introduce rank checkpoints and offsets
+
+
+
+[^cheng2018fmtree]: Cheng, H., Wu, M., & Xu, Y. (2018). FMtree: a fast locating algorithm of FM-indexes for genomic data. *Bioinformatics, 34*(3), 416-424. doi: [10.1093/bioinformatics/btx596](https://doi.org/10.1093/bioinformatics/btx596)
+[^wikipedia]: [FM-index Wikipedia](https://en.wikipedia.org/wiki/FM-index)
+[^curious-coding]: [Interactive demo from Curious Coding](https://curiouscoding.nl/notes/bwt/)
+[^alex-bowe]: [Blog post from Alex Bowe](https://www.alexbowe.com/fm-index/)
+[^bwt-fmindex-langmead]: [Slides from Ben Langmead](https://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_and_fm_index.pdf)
+[^simpson2010efficient]: Simpson, J. T., & Durbin, R. (2010). Efficient construction of an assembly string graph using the FM-index. *Bioinformatics, 26*(12), i367-i373. doi: [10.1093/bioinformatics/btq217](https://doi.org/10.1093/bioinformatics/btq217)
diff --git a/content/appendices/file-types/_index.md b/content/appendices/file-types/_index.md
new file mode 100644
index 0000000..7258bc9
--- /dev/null
+++ b/content/appendices/file-types/_index.md
@@ -0,0 +1,4 @@
+---
+title: File types
+type: docs
+---
diff --git a/content/appendices/file-types/fasta.md b/content/appendices/file-types/fasta.md
new file mode 100644
index 0000000..7eca93a
--- /dev/null
+++ b/content/appendices/file-types/fasta.md
@@ -0,0 +1,51 @@
+---
+title: FASTA
+type: docs
+toc: false
+---
+
+
+In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (`>`), followed by a unique SeqID (sequence identifier).
+
+The SeqID must be unique for each nucleotide sequence and should not contain any spaces. Please limit the SeqID to 25 characters or less. The SeqID can only include letters, digits, hyphens (`-`), underscores (`_`), periods (`.`), colons (`:`), asterisks (`*`), and number signs (`#`).
+The sequence identifier will be replaced with an Accession number by the database staff when your submission is processed.
+
+```text
+›SeqABCD
+```
+
+Information about the source organism from which the sequence was obtained follows the SeqID and must be in the format `[modifier=text]`.
+Do not put spaces around the "=". At minimum, the scientific name of the organism should be included. Optional modifiers can be added to provide additional information. A complete list of available source modifiers and their format is available.
+
+```text
+›SeqABCD [organism=Mus musculus] [strain=C57BL/6]
+```
+
+The final optional component of the FASTA definition line is the sequence title, which will be used as the DEFINITION field in the flatfile. The title should contain a brief description of the sequence. There is a preferred format for nucleotide and protein titles. The provided title will be changed to the proper format by the database staff during processing.
+
+```text
+›SeqABCD [organism=Mus musculus] [strain=C57BL/6] Mus musculus neuropilin 1 (Nrp1) mRNA, complete cds.
+```
+
+Note in all cases, the FASTA definition line must not contain any hard returns. All information must be on a single line of text. If you have trouble importing your FASTA sequences, please double check that no returns were added to the FASTA definition line by your editing software.
+
+Examples of properly formatted FASTA definition lines for nucleotide sequences:
+
+```text
+›Seq1 [organism=Streptomyces lavendulae] [strain=456A] Streptomyces lavendulae strain 456A mitomycin radical oxidase (mcrA) gene, complete cds.
+```
+
+```text
+›ABCD [organism=Plasmodium falciparum] [isolate=ABCD] Plasmodium falciparum isolate ABCD merozoite surface protein 2 (msp2) gene, partial cds.
+```
+
+```text
+›DNA.new [organism=Homo sapiens] [chromosome=17] [map=17q21] [moltype=mRNA] Homo sapiens breast and ovarian cancer susceptibility protein (BRCA1) mRNA, complete cds.
+```
+
+The line after the FASTA definition line begins the nucleotide sequence.
+Unlike the FASTA definition line, the nucleotide sequence itself can contain returns.
+It is recommended that each line of sequence be no longer than 80 characters.
+Please only use IUPAC symbols within the nucleotide sequence. For sequences that are not contained within an alignment, do not use `?` or `-` characters.
+These will be stripped from the sequence.
+Use the IUPAC approved symbol `N` for ambiguous characters instead.
diff --git a/content/appendices/file-types/fastq.md b/content/appendices/file-types/fastq.md
new file mode 100644
index 0000000..dc3c93e
--- /dev/null
+++ b/content/appendices/file-types/fastq.md
@@ -0,0 +1,137 @@
+---
+title: FASTQ
+type: docs
+toc: false
+---
+
+
+Results of sanger sequencing are usually `fasta` files obtained from processing chromatograms.
+Most high-throughput sequencing (HTS) machines output `fastq` files, the “de facto” current standard in HTS.
+Like `fasta`, `fastq` files are simply text files, but where each block of information in this format is encoded as 4 lines:
+
+```text
+@read_identifier
+read_sequence
++ separator line
+base_qualities
+```
+
+For example, here you have 8 lines of a fastq file, corresponding to 2 sequences:
+
+```text
+@HWI-M01876:76:000000000-AF16W:1:1101:10853:1000 1:N:0:CGTGACAGAT
+NTGTACTTCATCCGAAACTCGTGCTCATCTCTGCTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTGAT
++
+#8ABCFGGGFCEDCFGGGGGGGFFCGEFGGGGGGFGGGGGGGGDEFGGGGGGGGGGGGGGGGGFFFEGGGGGGGGF
+@HWI-M01876:76:000000000-AF16W:1:1101:16471:1000 1:N:0:CGTGAACTTG
+NTTCCAGATATTCGATGCATGTGCCGCTCCTGTCGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTGAT
++
+#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGG
+```
+
+## Q is for Quality
+
+Each base has a quality character associated with it, representing how confidently the machine identified (called) the base.
+The probability of error per base is given as a Phred score, $Q$, calculated from an integer value derived from the quality character associated to the base.
+This error probability, $P$ is computed with
+
+$$
+P = 10^{-Q/10}.
+$$
+
+Useful reference values of Q include:
+
+- $Q = 10$ represents 90% accuracy (0.1 error)
+- $Q = 20$ represents 99% accuracy (0.01 error)
+- $Q = 30$ represents 99.9% accuracy (0.001 error)
+- $Q = 40$ represents 99.99% accuracy (0.0001 error)
+- $Q = 50$ represents 99.999% accuracy (0.00001 error)
+
+Although there’s theoretically no limit, $Q$ usually goes up to around 40 in recent illumina machines.
+
+### ASCII table
+
+To obtain this $Q$ value from the character associated to the quality of the base, we have to know that each character (such as `#`) has an ASCII decimal value associated.
+For example, `#` has a value of 35.
+
+```text
+Dec Char Dec Char Dec Char Dec Char
+--------- --------- --------- ----------
+ 0 NUL (null) 32 SPACE 64 @ 96 `
+ 1 SOH (start of heading) 33 ! 65 A 97 a
+ 2 STX (start of text) 34 " 66 B 98 b
+ 3 ETX (end of text) 35 # 67 C 99 c
+ 4 EOT (end of transmission) 36 $ 68 D 100 d
+ 5 ENQ (enquiry) 37 % 69 E 101 e
+ 6 ACK (acknowledge) 38 & 70 F 102 f
+ 7 BEL (bell) 39 ' 71 G 103 g
+ 8 BS (backspace) 40 ( 72 H 104 h
+ 9 TAB (horizontal tab) 41 ) 73 I 105 i
+ 10 LF (NL line feed, new line) 42 * 74 J 106 j
+ 11 VT (vertical tab) 43 + 75 K 107 k
+ 12 FF (NP form feed, new page) 44 , 76 L 108 l
+ 13 CR (carriage return) 45 - 77 M 109 m
+ 14 SO (shift out) 46 . 78 N 110 n
+ 15 SI (shift in) 47 / 79 O 111 o
+ 16 DLE (data link escape) 48 0 80 P 112 p
+ 17 DC1 (device control 1) 49 1 81 Q 113 q
+ 18 DC2 (device control 2) 50 2 82 R 114 r
+ 19 DC3 (device control 3) 51 3 83 S 115 s
+ 20 DC4 (device control 4) 52 4 84 T 116 t
+ 21 NAK (negative acknowledge) 53 5 85 U 117 u
+ 22 SYN (synchronous idle) 54 6 86 V 118 v
+ 23 ETB (end of trans. block) 55 7 87 W 119 w
+ 24 CAN (cancel) 56 8 88 X 120 x
+ 25 EM (end of medium) 57 9 89 Y 121 y
+ 26 SUB (substitute) 58 : 90 Z 122 z
+ 27 ESC (escape) 59 ; 91 [ 123 {
+ 28 FS (file separator) 60 < 92 \ 124 |
+ 29 GS (group separator) 61 = 93 ] 125 }
+ 30 RS (record separator) 62 > 94 ^ 126 ~
+ 31 US (unit separator) 63 ? 95 _ 127 DEL
+```
+
+The $Q$ value of a character is the decimal value corresponding to the entry of that character in the ASCII table, subtracted by 33.
+For example $Q$(`#`) = 35 – 33 = 2.
+
+!!! important "Why do we subtract by 33?"
+
+ All ASCII decimal values lower than 33 are not visible characters.
+ For example, the `NUL` character is not really visible in a text file; same with `CAN` for cancel.
+ `!` has the lowest ASCII decimal value that is usable in a text file.
+
+## Computing error
+
+Looking at the first read of our fastq example,
+
+```text
+@HWI-M01876:76:000000000-AF16W:1:1101:10853:1000 1:N:0:CGTGACAGAT
+NTGTACTTCATCCGAAACTCGTGCTCATCTCTGCTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTGAT
++
+#8ABCFGGGFCEDCFGGGGGGGFFCGEFGGGGGGFGGGGGGGGDEFGGGGGGGGGGGGGGGGGFFFEGGGGGGGGF
+@HWI-M01876:76:000000000-AF16W:1:1101:16471:1000 1:N:0:CGTGAACTTG
+NTTCCAGATATTCGATGCATGTGCCGCTCCTGTCGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTGAT
++
+#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGG
+```
+
+we can see it starts with `N` (unknown), with an associated quality character `#`.
+To know how confident the machine was in reading that base, we calculate:
+
+- $Q$ = 35 (ASCII decimal value of `#`) - 33 (ASCII decimal value of `!`) = 2
+- $P = 10^{-2/10}$ = 0.631 (63.1% probability of error)
+
+Given this probability of error, it is not surprising that the machine could not confidently say which base was in that position and therefore placed an `N` in that position.
+It is fairly common that in the first bases the machine is still calibrating, and sometimes there is less confidence in the called base.
+
+Many sequencing machines can read both ends of a fragment.
+In this case, the machine will generate two paired `fastq` files, one with the forward reads and another with the reverse reads.
+You can find an example of this is the example fastq files `paired_end_example_1` (containing the forward reads) and `paired_end_example_2` (containing the reverse reads).
+These fastq are paired because the reads for the same fragment are in the same order in the two files.
+For example, the first read in the forward fastq corresponds to the forward reading of the same fragment as the first read in the reverse fastq.
+
+## Acknowledgements
+
+Parts of this material were adapted with permission from the following sources:
+
+- [Entry Level Bioinformatics](https://gtpb.github.io/ELB18S/)
diff --git a/content/appendices/file-types/gene-annotation.md b/content/appendices/file-types/gene-annotation.md
new file mode 100644
index 0000000..453909b
--- /dev/null
+++ b/content/appendices/file-types/gene-annotation.md
@@ -0,0 +1,571 @@
+---
+title: Eukaryotic Genome Annotation Guide Annotation
+type: docs
+toc: false
+---
+
+
+Genome Workbench and table2asn (the replacement of tbl2asn) use a simple five-column tab-delimited table of feature locations and qualifiers in order to generate annotation.
+
+The format of this feature table allows different kinds of features (e.g. gene, coding region, tRNA, repeat_region) and qualifiers (e.g. /product, /note) to be indicated.
+The validator will check for errors such as internal stops in coding regions.
+
+## Prepare annotation table
+
+The features must be in a simple five-column tab-delimited table, called the feature table.
+The feature table specifies the location and type of each feature for table2asn (previously tbl2asn) or Genome Workbench to include in the GenBank submission that is created.
+The first line of the table contains the following basic information:
+
+```text
+>Features SeqID table_name
+```
+
+The `SeqID` must be the same as the sequence's `SeqID` in the FASTA file.
+The `table_name` is optional.
+Subsequent lines of the table list the features.
+Columns are separated by tabs.
+
+- Column 1: Start location of feature
+- Column 2: Stop location of feature
+- Column 3: Feature key
+- Column 4: Qualifier key
+- Column 5: Qualifier value
+
+[Figure 2](https://www.ncbi.nlm.nih.gov/genbank/eukaryotic_genome_submission_examples/#fig2) shows a sample feature table and illustrates a number of points about the feature table format.
+The GenBank flatfile corresponding to this table is shown in Figure 3.
+The allowed features and their qualifiers are listed in the Feature Table documentation.
+
+Features that are on the complementary strand, such as the genes `Ngs_3038` and `Ngs_11232` and their corresponding features shown in Figure 2 , are indicated by reversing the interval locations.
+
+Please avoid unnecessary capitalization in all text entered in your table.
+
+Additional requirements, as well as suggestions for various types of annotation, are included in the following sections.
+
+## Gene features
+
+Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and polyA binding sites.
+
+Gene names should follow the standard nomenclature rules of the particular organism.
+For example, mouse gene names begin with an uppercase letter, and the remaining letters are lowercase.
+
+Coding regions (CDS) and RNAs, such as tRNAs and rRNAs, must have a corresponding gene feature.
+However, other features such as `repeat_regions` and `misc_features` do not have a corresponding gene or `locus_tag`.
+
+## `locus_tag`
+
+All genes should be assigned a systematic gene identifier which should receive the `locus_tag` qualifier on the gene feature in the table. Genes may also have functional names as assigned in the scientific literature.
+In this example, `KCS_0001` is the systematic gene identifier, while `Abc5` is the functional gene name.
+
+**Table view of gene with both biological name and locus_tag:**
+
+```text
+1 1575 gene
+ gene Abc5
+ locus_tag KCS_0001
+```
+
+**Flatfile view:**
+
+```text
+gene 1..1575
+ /gene="Abc5"
+ /locus_tag="KCS_0001"
+```
+
+**Table view of gene with only locus_tag:**
+
+```text
+1 1575 gene
+ locus_tag KCS_0001
+```
+
+**Flatfile view:**
+
+```text
+gene 1..1575
+ /locus_tag="KCS_0001"
+```
+
+For consistency the same `locus_tag` prefix must be used throughout the entire genome.
+Therefore, all the chromosomes of a genome should have the same `locus_tag` prefix.
+
+To improve the use of `locus_tags` we are now requiring that all `locus_tag` prefixes be registered and that they be unique.
+We recommend having the BioProject registration process auto-assign a `locus_tag` prefix, as they are not meant to convey meaning.
+The `locus_tag` prefix should be 3-12 alphanumeric characters and the first character may not be a digit.
+The `locus_tag` prefix is followed by an underscore and then an alphanumeric identification number that is unique within the given genome.
+Other than the single underscore used to separate the prefix from the identification number, no special characters can be used in the `locus_tag`.
+
+The chromosome number can be embedded in the locus_tag, if desired, in the format `Prefix_#g#####`, where the first `#` is the chromosome number and `###` is the unique number of the gene.
+For example, `Ajs_4g00123` for a gene on chromosome 4.
+
+## `protein_id`
+
+All proteins in a WGS or complete genome must be assigned an identification number by the submitter.
+We use this number to track proteins when sequences are updated.
+This number is indicated in the table by the CDS qualifier `protein_id`, and should have the format `gnl|dbname|string`, where `dbname` is a version of your lab name that you think will be unique (e.g., `SmithUCSD`), and string is the unique protein `SeqID` assigned by the submitter.
+This identifier is saved with the record (in ASN.1 format), but it is not visible in the flatfile.
+We recommend using the `locus_tag` as the protein SeqID.
+In this example, the `protein_id` for `ABC5` is `gnl|SmithUCSD|KCS_0001`.
+
+Example:
+
+```text
+ <1 >1575 gene
+ gene Abc5
+ locus_tag KCS_0001
+ pseudogene unprocessed
+ 1 1575 CDS
+ product ABC5
+ protein_id gnl|SmithUCSD|KCS_0001
+```
+
+Since the `protein_id` is used for internal tracking in our database, it is important that the complete `protein_id` (dbname + SeqID) not be duplicated by a genome center.
+Thus, if your genome center is submitting more than one complete genome, please be sure to use unique `protein_id`'s for all of the genomes.
+
+The `protein_id` is also included as a qualifier on the corresponding mRNA feature, to allow the CDS and mRNA to be paired during processing.
+
+Note that when WGS submissions are processed, the `dbname` in the `protein_id` is automatically changed to `'WGS:XXXX'`, where `XXXX` is the project's accession number prefix.
+
+## `transcript_id`
+
+The `transcript_id` is included as a qualifier for both the CDS and its corresponding mRNA.
+It has the same format as the `protein_id`, gnl|dbname|identifier.
+Because each `transcript_id` and `protein_id` must be unique, we suggest adding `'mrna'` or `'t'` to the protein_id identifier as a simple way to create the corresponding (unique) `transcript_id`.
+However, you can use whatever naming convention you choose, as long as all of the identifiers are unique.
+
+```text
+63574 87173 gene
+ locus_tag Ngs_17131
+63574 63907 mRNA
+75690 75730
+84396 85536
+85598 85773
+85836 86109
+86173 86467
+86555 86670
+86731 87173
+ product hypothetical protein
+ protein_id gnl|ncbi|Ngs_17131
+ transcript_id gnl|ncbi|mrna.Ngs_17131
+84402 85536 CDS
+85598 85773
+85836 86109
+86173 86467
+86555 86670
+86731 86882
+ product hypothetical protein
+ protein_id gnl|ncbi|Ngs_17131
+ transcript_id gnl|ncbi|mrna.Ngs_17131
+```
+
+## CDS (coding region) features
+
+All CDS features must have a product qualifier (protein name).
+NCBI protein naming conventions are adopted from the International Protein Nomenclature Guidelines.
+
+Consistent nomenclature is indispensable for communication, literature searching and data retrieval.
+Many species-specific communities have established gene nomenclature committees that try to assign consistent and, if possible, meaningful gene symbols.
+Other scientific communities have established protein nomenclatures for a set of proteins based on sequence similarity and/or function.
+But there is no established organization involved in the standardization of protein names, nor are there any efforts to establish naming rules that are valid across the largest spectrum of species possible.
+
+Ambiguities regarding gene/protein names are a major problem in the literature and it is even worse in the sequence databases which tend to propagate the confusion.
+For this reason, we ask that you follow some basic guidelines in naming your proteins.
+The protein naming guidelines are based on the premise that a good and stable recommended name for a protein is a name that is as neutral as possible.
+
+## Partial coding regions
+
+To annotate a partial coding region, you should use the `"<"` or `">"` in your feature table to designate the feature as either 5' or 3' partial.
+The coding region should begin at the first nucleotide present in the sequence or exon, and you will indicate where the first complete codon begins in that coding region.
+
+Partial genes within a sequence should begin or end at consensus splice sites.
+Examples:
+
+In the first example below, the `"<"` designates this coding region as 5' partial and "codon_start 3" tells the software to start translation with the third nucleotide of the CDS.
+Note that if the codon_start is not specified, then the software assumes a codon_start of 1.
+The second coding region below is partial at the 3' end so `">"` is used to indicate a 3' partial feature. The third example is of a 3' partial coding region on the complementary or minus strand.
+
+```text
+<1 497 CDS
+ product transcription factor
+ note similar to Bacillus subtilis aldolase
+ codon_start 3
+ protein_id gnl|dbname|KCS_0001
+ transcript_id gnl|dbname|mrna.KCS_0001
+
+
+600 >1575 CDS
+ product actin-like protein
+ protein_id gnl|dbname|KCS_0002
+ transcript_id gnl|dbname|mrna.KCS_0002
+
+436 >1 CDS
+ product hypothetical protein
+ protein_id gnl|dbname|KCS_0056
+ transcript_id gnl|dbname|mrna.KCS_0056
+```
+
+Here are more examples of formatting partial CDS features.
+
+### mRNA features
+
+Include an mRNA feature for each translated CDS.
+Several things to note are:
+
+- Use the same product name for the mRNA and its corresponding CDS.
+- If there is no UTR information, then the mRNA's location will agree with its CDS's location, but the mRNA will be partial at its 5' and 3' ends.
+- Extend the gene feature to include the entire mRNA.
+- If the mRNA is partial, then make the gene partial.
+
+Examples:
+
+The first example is a complete CDS whose 5' and 3' UTRs are known.
+
+```text
+>Feature Cont54
+10400 12512 gene
+ locus_tag CCC_03116
+10400 10462 mRNA
+10533 10577
+10651 11098
+11182 11642
+11716 12512
+ product hypothetical protein
+ protein_id gnl|dbname|CCC_03116
+ transcript_id gnl|dbname|mrna.CCC_03116
+10450 10462 CDS
+10533 10577
+10651 11098
+11182 11642
+11716 12233
+ product hypothetical protein
+ protein_id gnl|dbname|CCC_03116
+ transcript_id gnl|dbname|mrna.CCC_03116
+```
+
+The second example is a CDS that is partial at the 5' end and lacks any 3' UTR information.
+
+```text
+>Feature Cont3
+<1 >497 gene
+ locus_tag CCC_111011
+<1 497 CDS
+ note similar to Bacillus subtilis aldolase
+ product aldolase-like protein
+ codon_start 3
+ protein_id gnl|dbname|CCC_111011
+ transcript_id gnl|dbname|mrna.CCC_111011
+<1 >497 mRNA
+ product aldolase-like protein
+ protein_id gnl|dbname|CCC_111011
+ transcript_id gnl|dbname|mrna.CCC_111011
+```
+
+## Gene fragments
+
+Sometimes a genome will have adjacent or nearby genes that seem to be only part of a protein. In many cases these indicate a possible problem with the sequence and/or annotation.
+A related issue is the presence of internal stop codons in the conceptual translation of a CDS that looks like it should be a real CDS.
+These problems may be due to a variety of reasons, including mutations or sequencing artifacts. They can be annotated in a number of ways:
+
+1. Annotate the gene with /pseudo to indicate that there is a problem with the gene.
+ Note that this qualifier does NOT mean that the gene is a pseudogene. (see point 2, below, if it is known that the gene IS a pseudogene)
+ If multiple gene fragments were present initially, then add a single gene feature which covers all of the potential coding regions and add the pseudo qualifier.
+ If known, a note qualifier may be added indicating why this gene is disrupted, for example:
+
+ ```text
+ 1 200 gene
+ gene Abc5
+ locus_tag KCS_0001
+ gene_desc alkaline phosphatase
+ pseudo
+ note nonfunctional due to frameshift
+ ```
+
+2. If you are sure that the disrupted or error-filled gene is a biological pseudogene, then add the pseudogene qualifier and the appropriate pseudogene type. For example:
+
+ ```text
+ 1 200 gene
+ gene Abc5
+ locus_tag KCS_0001
+ gene_desc alkaline phosphatase
+ pseudogene unprocessed
+ ```
+
+3. If the feature is just noting a similarity to genes in the database and is probably not translated, then it should be annotated as a misc_feature without a corresponding gene feature.
+
+ ```text
+ 1 200 misc_feature
+ note similar to Abc5
+ ```
+
+## Transpliced Genes
+
+Transpliced genes are the exception to the rule for annotating gene feature spans.
+Transpliced genes are similar to intron containing genes except the two pieces of the gene are found on different regions of the chromosome.
+These genes are transcribed as two or more separate RNA products that are transpliced into a single mRNA or tRNA.
+To annotate this using a table, enter the nucleotide spans so that the complementary (minus strand) spans are arranged from high to low and vice versa for the plus strand.
+
+```text
+36700 36618 gene
+86988 87064
+ locus_tag NEQ_t38
+ exception trans-splicing
+36631 36618 misc_feature
+ note sequence cleaved during processing of trans-spliced tRNAs
+36673 36635
+87030 87064 tRNA
+ product tRNA-Glu
+ exception trans-splicing
+ note this trans-spliced tRNA consists of two halves on mixed strands; it shares a 3' half with another tRNA
+```
+
+Flatfile view:
+
+```text
+ gene join(complement(36618..36700),86988..87064)
+ /locus_tag="NEQ_t38"
+ /trans_splicing
+ misc_feature complement(36618..36631)
+ /locus_tag="NEQ_t38"
+ /note="sequence cleaved during processing of trans-spliced tRNAs"
+ tRNA join(complement(36635..36673),87030..87064)
+ /locus_tag="NEQ_t38"
+ /product="tRNA-Glu"
+ /trans_splicing
+ /note="this trans-spliced tRNA consists of two halves on
+ mixed strands; it shares a 3' half with another tRNA"
+```
+
+## Split genes on two contigs
+
+Sometimes in incomplete genomes the ends of a gene may be on different contigs.
+When certain that the two pieces are part of the same gene, annotate these as separate genes with unique `locus_tags`, plus separate CDS/mRNAs with different `protein_id`'s and `transcript_id`'s.
+In addition, link the features together with notes that refer to the other part of the gene.
+However, do not create extremely short features, for example if one end is only the start methinione or only a few amino acids before the stop codon.
+
+```text
+>Feature Cont01.00111
+5000 >7500 gene
+ locus_tag KCS_2223A
+5000 5500 mRNA
+6000 >7200
+ product enolase
+ protein_id gnl|dbname|KCS_2223A
+ transcript_id gnl|dbname|mrna.KCS_2223A
+5488 5500 CDS
+6000 >7200
+ product enolase
+ protein_id gnl|dbname|KCS_2223A
+ transcript_id gnl|dbname|mrna.KCS_2223A
+ note 5' end; 3' end is gene KCS_2223B on contig Cont01.00224
+
+>Feature Cont01.00224
+<1 1000 gene
+ locus_tag KCS_2223B
+<100 1000 mRNA
+ product enolase
+ protein_id gnl|dbname|KCS_2223B
+ transcript_id gnl|dbname|mrna.KCS_2223B
+<100 876 CDS
+ product enolase
+ protein_id gnl|dbname|KCS_2223B
+ transcript_id gnl|dbname|mrna.KCS_2223B
+ note 3' end; 5' end is gene KCS_2223A on contig Cont01.00111
+```
+
+### Alternatively spliced genes
+
+In many cases a gene can be alternatively spliced, yielding alternative transcripts.
+These transcripts may differ in the coding region and produce different products, or they may differ in the non-translated 5' or 3' UTR and produce the same protein. To annotate alternatively spliced genes, include one mRNA and CDS for each transcript, and include only one gene over all of the features. Give the corresponding mRNA and CDS the same name, and include a note "alternatively spliced" on each. If there are multiple CDS with the same name, then add a note to each mRNA and CDS to refer to each other, eg "transcript variant A" and "encoded by transcript variant A" for one mRNA/CDS pair. If the CDS have different translations, then they should have different product names. Make sure that all the proteins have unique protein_id's.
+Example 1 (different products):
+
+```text
+>Feature Cont01.00055
+10 5000 gene
+ locus_tag CCC_04562
+10 500 mRNA
+722 1555
+2548 3901
+4400 5000
+ product enolase isoform A
+ note alternatively spliced
+ protein_id gnl|dbname|CCC_04562A
+ transcript_id gnl|dbname|mrna.CCC_04562A
+102 500 CDS
+722 1555
+2548 3901
+4400 4566
+ product enolase isoform A
+ note alternatively spliced
+ protein_id gnl|dbname|CCC_04562A
+ transcript_id gnl|dbname|mrna.CCC_04562A
+10 500 mRNA
+2548 3901
+4400 5000
+ product enolase isoform B
+ note alternatively spliced
+ protein_id gnl|dbname|CCC_04562B
+ transcript_id gnl|dbname|mrna.CCC_04562B
+102 500 CDS
+2548 3901
+4400 4566
+ product enolase isoform B
+ note alternatively spliced
+ protein_id gnl|dbname|CCC_04562B
+ transcript_id gnl|dbname|mrna.CCC_04562B
+```
+
+Example 2 (same product):
+
+```text
+>Feature Cont01.00056
+100 1000 gene
+ locus_tag CCC_03222
+100 333 mRNA
+444 678
+800 1000
+ product hypothetical protein
+ note transcript variant A; alternatively spliced
+ protein_id gnl|dbname|CCC_03222A
+ transcript_id gnl|dbname|mrna.CCC_03222A
+456 678 CDS
+800 865
+ product hypothetical protein
+ note encoded by transcript variant A; alternatively spliced
+ protein_id gnl|dbname|CCC_03222A
+ transcript_id gnl|dbname|mrna.CCC_03222A
+100 360 mRNA
+444 678
+800 1000
+ product hypothetical protein
+ note transcript variant B; alternatively spliced
+ protein_id gnl|dbname|CCC_03222B
+ transcript_id gnl|dbname|mrna.CCC_03222B
+456 678 CDS
+800 865
+ product hypothetical protein
+ note encoded by transcript variant B; alternatively spliced
+ protein_id gnl|dbname|CCC_03222B
+ transcript_id gnl|dbname|mrna.CCC_03222B
+```
+
+## Ribosomal RNA, tRNA and other RNA features
+
+RNA features (rRNA, tRNA, ncRNA) need a corresponding gene feature with a locus_tag qualifier.
+If the amino acid of a tRNA is unknown, use `tRNA-Xxx` as the product, as in the example.
+Many submitters like to label the tRNAs such as tRNA-Gly1, etc.
+If you wish to do this please include "tRNA-Gly1" as a note and not in /gene.
+The use of `/gene` is reserved for the actual biological gene symbol such as `"trnG"`.
+If a tRNA is a pseudogene, please use the `/pseudo` qualifier.
+
+Annotate ncRNAs that belong to one of the INSDC `nRNA_class` as an ncRNA feature, with the appropriate value in the required `/ncRNA_class` qualifier.
+Regions of an RNA should be annotated as a misc_feature (eg, leader sequences), or a misc_binding feature if they bind a known molecule (eg, riboswitches).
+If the RFAM identifier is known, it can be included as a `db_xref`.
+Some rRNA, tRNA and ncRNA examples:
+
+```text
+<1 400 gene
+ locus_tag KCS_00011
+<1 400 rRNA
+ product 16S ribosomal RNA
+488 560 gene
+ locus_tag KCS_00012
+488 560 tRNA
+ product tRNA-Lys
+570 601 gene
+ locus_tag KCS_00020
+ pseudo
+570 601 tRNA
+ product tRNA-Phe
+ pseudo
+700 780 gene
+ locus_tag KCS_00013
+700 780 tRNA
+ product tRNA-Xxx
+900 923 gene
+ locus_tag KCS_00014
+900 923 ncRNA
+ ncRNA_class miRNA
+ product mir-9c
+950 1000 gene
+ locus_tag KCS_00015
+950 1000 tmRNA
+ product tmRNA
+```
+
+Riboswitches used to be annotated using the misc_binding feature if the bound moiety was known, for example:
+
+```text
+1 100 misc_binding
+ note cobalamin riboswitch
+ bound_moiety adenosylcobalamin
+```
+
+Annotate riboswitches as regulatory features with the `regulatory_class` `'riboswitch'`:
+
+```text
+1 100 regulatory
+ regulatory_class riboswitch
+ note cobalamin riboswitch
+ bound_moiety adenosylcobalamin
+```
+
+If the bound moiety is unknown or if the sequence is a leader sequence, annotate as a `misc_feature`, for example:
+
+```text
+1 100 misc_feature
+ note yybP-ykoY element
+```
+
+`misc_feature` and `misc_binding` and regulatory features do not have an associated gene feature.
+If it is desired to tag these features with a `locus_tag`-like identifier, then include that value in the note, separated from other information by a semi-colon and space.
+
+## Evidence Qualifiers
+
+The International Nucleotide Sequence Database Collaboration, DDBJ, EMBL and GenBank has adopted a set of new qualifiers to describe the evidence for feature annotation in GenBank records. These are:
+
+/experimental="text" /inference="TYPE:text", where 'TYPE' is from a select list and 'text' is structured text.
+
+These qualifiers replace /evidence=experimental and /evidence=non-experimental, respectively, which are no longer supported.
+
+See more information about the Evidence Qualifiers .
+
+## Database cross references
+
+A variety of database cross references can be added to a feature. These appear as /db_xref on the features. This qualifier serves as a vehicle for linking of sequence records to other external databases. See the full list of db_xref databases.
+
+```text
+1 100 CDS
+ product RecA
+ protein_id gnl|center_name|Test_0001
+ db_xref InterPro:IPR000111
+```
+
+```text
+180 210 misc_feature
+ note yybP-ykoY element
+ db_xref RFAM:RF00080
+```text
+
+## Gene Ontology
+
+GO (Gene Ontology) terms can be included in genomes in order to describe protein functionality. Gene Ontology (GO) terms can be indicated with the following qualifiers
+
+```text
+1 100 CDS
+ product helicase
+ go_process chromatin assembly or disassembly|0006333||IEA
+ go_process antimicrobial humoral response|0019730|16163390|IMP
+ go_component nucleus|0005634|14668392|IDA
+ go_component chromatin|0000785||IEA
+ go_function ATP-dependent helicase activity|0008026||ISS
+ go_function nucleic acid binding|0003676||IEA
+ go_function ATP binding|0005524||IEA
+```
+
+The value field is separated by vertical bars '|' into a descriptive string, the GO identifier (leading zeroes are retained), and optionally a PubMed ID and one or more evidence codes. The evidence code is the fourth token, so include blank fields, as necessary (eg the last qualifier has no PubMed ID so the third field is blank).
+
+## Acknowledgements
+
+Some of this material has been adapted with permission from the following sources.
+
+- [NCBI](https://www.ncbi.nlm.nih.gov/genbank/eukaryotic_genome_submission_annotation)
diff --git a/content/appendices/file-types/gff.md b/content/appendices/file-types/gff.md
new file mode 100644
index 0000000..ad4b728
--- /dev/null
+++ b/content/appendices/file-types/gff.md
@@ -0,0 +1,168 @@
+---
+title: Generic Feature Format (GFF)
+type: docs
+toc: false
+---
+
+
+GFF is a standard file format for storing genomic features in a text file.
+
+## Examples
+
+```text
+##gff-version 3
+ctg123 . exon 1300 1500 . + . ID=exon00001
+ctg123 . exon 1050 1500 . + . ID=exon00002
+ctg123 . exon 3000 3902 . + . ID=exon00003
+ctg123 . exon 5000 5500 . + . ID=exon00004
+ctg123 . exon 7000 9000 . + . ID=exon00005
+```
+
+```text
+##gff-version 3
+#!gff-spec-version 1.21
+#!processor NCBI annotwriter
+#!genome-build PDT000047663.3
+#!genome-build-accession NCBI_Assembly:GCA_012827885.1
+#!annotation-date 02/29/2020 20:52:21
+#!annotation-source NCBI
+##sequence-region AATIZC010000001.1 1 652455
+##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=562
+AATIZC010000001.1 Genbank region 1 652455 . + . ID=AATIZC010000001.1:1..652455;Dbxref=taxon:562;collected-by=CDC;country=USA;gbkey=Src;isolation-source=stool;mol_type=genomic DNA;serovar=E. coli O157:H7;strain=2014C-4644
+AATIZC010000001.1 Genbank gene 138 713 . - . ID=gene-SR44_000001;Name=gmhB;gbkey=Gene;gene=gmhB;gene_biotype=protein_coding;locus_tag=SR44_000001
+AATIZC010000001.1 Protein Homology CDS 138 713 . - 0 ID=cds-EFL8326616.1;Parent=gene-SR44_000001;Dbxref=NCBI_GP:EFL8326616.1;Name=EFL8326616.1;gbkey=CDS;gene=gmhB;inference=COORDINATES: similar to AA sequence:RefSeq:NP_308229.1;locus_tag=SR44_000001;product=D-glycero-beta-D-manno-heptose 1%2C7-bisphosphate 7-phosphatase;protein_id=EFL8326616.1;transl_table=11
+AATIZC010000001.1 Genbank gene 901 1932 . + . ID=gene-SR44_000002;Name=metN;gbkey=Gene;gene=metN;gene_biotype=protein_coding;locus_tag=SR44_000002
+```
+
+## Format
+
+The GFF contains the following columns.
+
+### "seqid"
+
+The ID of the landmark used to establish the coordinate system for the current feature. IDs may contain any characters, but must escape any characters not in the set [a-zA-Z0-9.:^*$@!+_?-|].
+In particular, IDs may not contain unescaped whitespace and must not begin with an unescaped ">".
+
+To escape a character in this, or any of the other GFF3 fields, replace it with the percent sign followed by its hexadecimal representation.
+For example, ">" becomes "%E3". See URL Encoding (or: 'What are those "%20" codes in URLs?') for details.
+
+### "source"
+
+The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature.
+Typically this is the name of a piece of software, such as "Genescan" or a database name, such as "Genbank."
+In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column.
+It is not necessary to specify a source.
+If there is no source, put a "." (a period) in this field.
+
+### "type"
+
+The type of the feature (previously called the "method").
+This is constrained to be either: (a) a term from the "lite" sequence ontology, SOFA; or (b) a SOFA accession number.
+The latter alternative is distinguished using the syntax SO:000000.
+This field is required.
+
+### "start" and "end"
+
+The start and end of the feature, in 1-based integer coordinates, relative to the landmark given in column 1.
+Start is always less than or equal to end.
+
+For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark.
+These fields are required.
+
+### "score"
+
+The score of the feature, a floating point number.
+As in earlier versions of the format, the semantics of the score are ill-defined.
+It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features.
+If there is no score, put a "." (a period) in this field.
+
+### "strand"
+
+The strand of the feature. + for positive strand (relative to the landmark), - for minus strand, and . for features that are not stranded.
+In addition, ? can be used for features whose strandedness is relevant, but unknown.
+
+### "phase"
+
+For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame.
+The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.
+In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region.
+This is NOT to be confused with the frame, which is simply start modulo 3.
+If there is no phase, put a "." (a period) in this field.
+
+For forward strand features, phase is counted from the start field.
+For reverse strand features, phase is counted from the end field.
+
+The phase is required for all CDS features.
+
+### "attributes"
+
+A list of feature attributes in the format tag=value.
+Multiple tag=value pairs are separated by semicolons.
+URL escaping rules are used for tags or values containing the following characters: ",=;".
+Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. This field is not required.
+
+Column 9 tags have predefined meanings.
+
+#### ID
+
+Indicates the unique identifier of the feature.
+IDs must be unique within the scope of the GFF file.
+
+#### Name
+
+Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.
+
+#### Alias
+
+A secondary name for the feature. It is suggested that this tag be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.
+
+#### Parent
+
+Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, and so forth. A feature may have multiple parents. Parent can *only* be used to indicate a partof relationship.
+
+#### Target
+
+Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is "target_id start end [strand]", where strand is optional and may be "+" or "-". If the target_id contains spaces, they must be escaped as hex escape %20.
+
+#### Gap
+
+The alignment of the feature to the target if the two are not collinear (e.g. contain gaps).
+The alignment format is taken from the CIGAR format described in the Exonerate documentation.
+http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate?cvsroot=Ensembl.
+See the GFF3 specification for more information.
+
+#### Derives_from
+
+Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural "part of" one. This is needed for polycistronic genes. See the GFF3 specification for more information.
+
+#### Note
+
+A free text note.
+
+#### Dbxref
+
+A database cross reference. See the GFF3 specification for more information.
+
+#### Ontology_term
+
+A cross reference to an ontology term. See the GFF3 specification for more information.
+
+Multiple attributes of the same type are indicated by separating the values with the comma "," character, as in:
+
+```text
+Parent=AF2312,AB2812,abc-3
+```
+
+Note that attribute names are case sensitive.
+"Parent" is not the same as "parent".
+
+All attributes that begin with an uppercase letter are reserved for later use.
+Attributes that begin with a lowercase letter can be used freely by applications.
+You can stash any semi-structured data into the database by using one or more unreserved (lowercase) tags.
+
+## Acknowledgements
+
+Some of this material has been adapted with permission from the following sources.
+
+- [Formal GFF3 specification](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)
+- [GMOD wiki](http://gmod.org/wiki/GFF3)
diff --git a/content/genomics/_index.md b/content/genomics/_index.md
new file mode 100644
index 0000000..3b3fcd8
--- /dev/null
+++ b/content/genomics/_index.md
@@ -0,0 +1,6 @@
+---
+title: Genomics
+type: docs
+toc: false
+---
+
diff --git a/content/genomics/annotation/_index.md b/content/genomics/annotation/_index.md
new file mode 100644
index 0000000..c520437
--- /dev/null
+++ b/content/genomics/annotation/_index.md
@@ -0,0 +1,10 @@
+---
+title: Gene annotation
+type: docs
+toc: false
+weight: 2
+---
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
diff --git a/content/genomics/assembly/_index.md b/content/genomics/assembly/_index.md
new file mode 100644
index 0000000..8e70109
--- /dev/null
+++ b/content/genomics/assembly/_index.md
@@ -0,0 +1,32 @@
+---
+title: Gene assembly
+type: docs
+toc: false
+weight: 1
+---
+
+
+In bioinformatics, genome assembly represents the process of combining many short DNA sequences to recreate the original chromosomes from which the DNA originated.
+Sequence assembly is one of the basic steps after performing DNA sequencing.
+The established genome assembly can be submitted to databases such as the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/), [NCBI Genome](https://www.ncbi.nlm.nih.gov/datasets/genome/), and [Ensembl Genomes](https://www.ensembl.org/).
+You can also browse these databases for genomic sequences done by other researchers.
+
+> ![](https://sandbox.dodona.be/en/activities/1535795563/description/6Vg7bgJkMY2_iKxn/media/fragment-assembly.png)
+>
+> Credit: [Dodona](https://dodona.be/en/exercises/1535795563/#)
+
+There are two different types of genome assembly: [de novo](./de-novo/) assembly and mapping to a [reference](./reference/) genome.
+[De novo](./de-novo/) assembly refers to the genome assembly of a novel genome from scratch without reference to genomic data.
+A reference genome or a [reference assembly](./reference/) is a digital nucleic acid sequence database that represents a species’ set of genes.
+Once the reference genome is available, with its aid, the genome assembly becomes much easier, quicker, and even more accurate.
+Therefore, unless necessary, researchers choose the method of [reference-based alignment](./reference/).
+Reference-based alignment has become the current standard in diagnostics.
+
+A high-quality and well-annotated genome assembly is increasingly becoming an essential tool for applied and basic research across many biological disciplines in the 21st century that can turn any organism into a model organism.
+Thus, securing complete and accurate reference genomes and annotations before analyzing post-genome studies such as genome-wide association studies, structural variations, and posttranslational studies (methylation or histone modification) has become a cornerstone of modern genomics.
+However, early works have warned against its applications in genome assembly because the resultant assemblies may contain biases toward errors and chromosomal rearrangements in the existing reference genome.
+
+
+
+[^jung2020twelve]: Jung, H., Ventura, T., Chung, J. S., Kim, W. J., Nam, B. H., Kong, H. J., ... & Eyun, S. I. (2020). Twelve quick steps for genome assembly and annotation in the classroom. *PLoS computational biology, 16*(11), e1008325. doi: [10.1371/journal.pcbi.1008325](https://doi.org/10.1371/journal.pcbi.1008325)
+[^ee372]: [EE 372: Data Science for High-Throughput Sequencing](https://data-science-sequencing.github.io/)
diff --git a/content/genomics/assembly/concepts/_index.md b/content/genomics/assembly/concepts/_index.md
new file mode 100644
index 0000000..18f9c77
--- /dev/null
+++ b/content/genomics/assembly/concepts/_index.md
@@ -0,0 +1,8 @@
+---
+title: Concepts
+type: docs
+toc: false
+weight: 2
+---
+
+Here, you will find miscellaneous concepts that are relevant to genome assembly.
diff --git a/content/genomics/assembly/concepts/contigs/_index.md b/content/genomics/assembly/concepts/contigs/_index.md
new file mode 100644
index 0000000..08cb9e5
--- /dev/null
+++ b/content/genomics/assembly/concepts/contigs/_index.md
@@ -0,0 +1,24 @@
+---
+title: Contigs
+type: docs
+toc: false
+---
+
+
+
+In the context of genome assembly, a contig (derived from the word “contiguous”) is a set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region.
+
+> ![](https://www.genome.gov/sites/default/files/media/images/tg/Contig.jpg)
+>
+> Credit: [National Human Genome Research Institute](https://www.genome.gov/genetics-glossary/Contig)
+
+The process of generating an assembly involves the isolation of genomic DNA from a biological sample and fragmentation of DNA into small pieces that are then sequenced individually.
+Once the sequences of these small pieces, called reads, are obtained, researchers assemble these like tiny pieces of a giant puzzle into progressively larger contiguous sequence pieces, which are called contigs.
+This approach is termed Whole Genome Shotgun (WGS) sequencing.
+
+Contigs are the first level in the hierarchy of a genomic assembly.
+The next step is to build scaffolds (supercontigs).
+To build a scaffold, researchers place several contigs in the correct order and orientation.
+
+Therefore, contigs can refer both to overlapping DNA sequences and to overlapping physical segments.
+They play a crucial role in reconstructing the original DNA sequence of a chromosome or a region of a chromosome.
diff --git a/content/genomics/assembly/concepts/coverage/_index.md b/content/genomics/assembly/concepts/coverage/_index.md
new file mode 100644
index 0000000..c13a6b7
--- /dev/null
+++ b/content/genomics/assembly/concepts/coverage/_index.md
@@ -0,0 +1,52 @@
+---
+title: Coverage
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+Starting with some notation, let
+
+- $G$ = Length of the genome,
+- $L$ = Read length,
+- $N$ = Number of reads.
+
+We assume that $L$ is fixed.
+We first derive a relationship between the above three values that would result in successful assembly.
+Since $L$ and $G$ are fixed with our choice of experiment and technology, we need to choose $N$ (i.e., "How much sequencing do I need to do?").
+Intuitively, the reads must cover the entire genome, and each base has to be covered by at least 1 read.
+Therefore $LN>G$ or $N>G/L$.
+In order to achieve this lower bound, we need to have all $LN$ reads aligning perfectly without overlap, which is highly unlikely.
+
+It turns out that if we let ϵ represent the probability of not achieving full genome coverage, then
+
+$$
+N \leq \frac{G}{L} \ln \left( \frac{G}{\varepsilon} \right)
+\tag{1}
+$$
+
+If this condition is met, then we have achieved coverage with probability $\leq 1 - \varepsilon$.
+This result is more stringent than our previous bound due to the ln(G/ϵ) term, which is greater than 1.
+
+In isolation, $N$ is not too informative.
+For a particular sequencing experiment, $N$=100 million reads could be large or small depending on the size of the genome and the length of each read.
+Because the reads are random, some bases will be covered more often than other bases.
+Therefore rather than using $N$, we are instead interested in the coverage depth, or the average coverage per base, which is described by
+
+$$
+c = \frac{NL}{G} \leq \ln \left( \frac{G}{\varepsilon} \right).
+\tag{2}
+$$
+
+As an example, if the genome of interest is about one billion base pairs long, then we need at least 25x coverage depth since $G = 10^{9}$; $\varepsilon = 0.01$, $\Rightarrow c = 25.328$.
+Note that $LG$ is quite small, and therefore the number of reads can be approximated with a Poisson distribution with mean
+
+$$
+c= \frac{NL}{G}.
+\tag{3}
+$$
diff --git a/content/genomics/assembly/concepts/scs/_index.md b/content/genomics/assembly/concepts/scs/_index.md
new file mode 100644
index 0000000..6e7c118
--- /dev/null
+++ b/content/genomics/assembly/concepts/scs/_index.md
@@ -0,0 +1,90 @@
+---
+title: Shortest common superstring
+type: docs
+toc: false
+---
+
+
+The Shortest Common Superstring (SCS) problem involves finding the shortest string that contains all given strings as subsequences.
+In simpler terms, if you have a set of strings, the challenge is to construct the smallest possible string that includes each of these strings as a part of it without necessarily keeping them contiguous but preserving their order.
+
+DNA is broken into numerous small pieces in genome sequencing, sequenced to produce reads.
+These reads are short sequences of nucleotides (`A`, `C`, `G`, `T`).
+The main challenge is assembling these reads in the correct order to reconstruct the genome's original sequence.
+This process is akin to solving the SCS problem, where each read is a string.
+The goal is to merge them into a single, continuous sequence that is as short as possible while still containing all the original sequences.
+
+The key to solving this problem lies in finding overlaps between the reads.
+By identifying how the end of one read overlaps with the beginning of another, it's possible to stitch these reads together to minimize redundancy, thus approaching the shortest common superstring.
+
+It's important to note that the SCS problem is NP-hard, meaning that no known algorithm can solve it efficiently for all possible input sets.
+In the context of genome assembly, this complexity is managed through various heuristic and approximation algorithms that seek to find a solution that is good enough, if not mathematically perfect.
+
+!!! example
+ Imagine the original DNA sequence we aim to reconstruct is: `ACGTACGTGACG`.
+ If we are provided three sequencing reads:
+
+ 1. `5'- ACGTAC -3'`
+ 2. `5'- TACGTG -3'`
+ 3. `5'- TGAACG -3'`
+
+ Identify initial overlaps:
+
+ - The suffix of Read 1 (`5'- ___TAC -3'`) overlaps with the prefix of Read 2 (`5'- TAC___ -3'`) by three nucleotides (`TAC`).
+
+ ```text
+ 5'- A C G T A C -3'
+ | | |
+ 5'- T A C G T G -3'
+ ```
+ - The suffix of Read 2 (`5'- ____TG -3'`) overlaps with the prefix of Read 3 (`5'- TG____ -3'`) by two nucleotides (`TG`).
+
+ ```text
+ 5'- T A C G T G -3'
+ | |
+ 5'- T G A A C G -3'
+ ```
+
+ Merge based on largest overlaps:
+
+ - Merging Read 1 and Read 2 through their overlap gives us `5'- ACGTACGTG -5'`.
+ - Then, merging this combined sequence with Read 3 by aligning the overlap of GT gives us the shortest complete sequence `5'- ACGTACGTGAACG -3'`.
+
+!!! warning
+
+ When we discuss an overlap between two sequences in the context of DNA sequencing or bioinformatics, we refer to the condition where the suffix of one sequence matches the prefix of another sequence.
+ This scenario facilitates the sequential alignment and assembly of fragments into a longer, continuous sequence.
+
+ If the suffix of one sequence matches the suffix of another, we do not typically describe this as an overlap in the context of sequencing assembly or similar applications.
+ This is because such a match does not provide a way to extend the sequence by combining the two sequences end-to-end.
+ Instead, it indicates that both sequences end in the same way but does not necessarily provide a direct means of linking one sequence to the beginning of another to form a longer chain.
+
+ **Overlap Case** (Suffix of $x$ matches Prefix of $y$).
+
+ $$
+ x = 5' [G, C, G, T, A, C] 3'
+ $$
+
+ $$
+ y = 5' [T, A, C, A, C] 3'
+ $$
+
+ Here, if the suffix of $x$ (e.g., $[T, A, C]$) matches the prefix of $y$ (e.g., $[T, A, C]$), we can align and connect these sequences to form a longer sequence because there is a continuity that allows for extension.
+
+ ```text
+ 5'- G C G T A C -3'
+ | | |
+ 5'- T A C A C -3'
+ ```
+
+ **Non-Overlap Case** (Suffix of $x$ matches Suffix of $y$).
+
+ If the suffix of $x$ (e.g., $[A, C]$) matches the suffix of $y$ (e.g., $[A, C]$), there is no direct way to extend the sequence by concatenating $x$ and $y$ because both sequences end in the same manner.
+
+ ```text
+ 5'- G C G T A C -3'
+ X X X | |
+ 5'- T A C A C -3'
+ ```
+
+ This scenario doesn't contribute to the assembly of a longer sequence from fragments.
diff --git a/content/genomics/assembly/concepts/unresolved/_index.md b/content/genomics/assembly/concepts/unresolved/_index.md
new file mode 100644
index 0000000..4b5d27e
--- /dev/null
+++ b/content/genomics/assembly/concepts/unresolved/_index.md
@@ -0,0 +1,18 @@
+---
+title: Unresolvable repeats
+type: docs
+toc: false
+---
+
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+Unresolvable repeats pose a significant hurdle in genome assembly, as they create ambiguity in determining the correct order and arrangement of DNA sequences.
+The key strategy employed to handle unresolvable repeats involves essentially leaving them out of the assembly.
+This approach is crucial for ensuring the accuracy and reliability of the assembled genome.
+
+Unresolvable repeats have a profound effect on the assembly process, leading to the fragmentation of the genome.
+When faced with unresolvable repeats, the assembly algorithm breaks down the genetic sequence into smaller, manageable fragments.
+This fragmentation is a necessary step to navigate the complexities introduced by repeats and ensures a more accurate reconstruction of the genome.
diff --git a/content/genomics/assembly/de-novo/_index.md b/content/genomics/assembly/de-novo/_index.md
new file mode 100644
index 0000000..04ee533
--- /dev/null
+++ b/content/genomics/assembly/de-novo/_index.md
@@ -0,0 +1,24 @@
+---
+title: De novo
+type: docs
+toc: false
+weight: 3
+---
+
+
+TODO:
+
+## Tools
+
+| Name | Main feature |
+| ---- | ------------ |
+| [ABySS2](https://github.com/bcgsc/abyss) | An assembler intended for SGS PE and linked reads. |
+| [AllPath-LG](https://www.msi.umn.edu/sw/allpaths-lg) | Uses a unipath graph from the k-mer paths to collapse repeats. |
+| [MEGAHIT](https://github.com/voutcn/megahit) | An ultrafast and memory-efficient assembler for SGS reads. |
+| [SOAPdenovo](https://github.com/aquaskyline/SOAPdenovo2) | De Bruijn graph assembler with an error correction stage. |
+| [Falcon/HGAP](https://pb-falcon.readthedocs.io/en/latest/about.html) | Diploid-aware mode including trim, correction, and consensus for PacBio reads. |
+| [CANU](https://canu.readthedocs.io/en/latest/) | A fork of the Celera Assembler, including trim, correction, and consensus for TGS reads. |
+| [SMARTdenovo](https://github.com/ruanjue/smartdenovo) | De novo assembler including all-vs.-all raw read alignments without an error correction stage for TGS reads. |
+| [MECAT](https://github.com/xiaochuanle/MECAT2) | Ultrafast mapping, error correction, and de novo assembly tools for single-molecule sequencing reads. |
+| [Flye](https://github.com/fenderglass/Flye) | A repeat graph mode including trim, correction, and consensus with polishing for TGS reads. |
+| [Shasta](https://github.com/paoloshasta/shasta) | A run-length representation of ONT reads. |
diff --git a/content/genomics/assembly/de-novo/de-bruijn/_index.md b/content/genomics/assembly/de-novo/de-bruijn/_index.md
new file mode 100644
index 0000000..a41ecb9
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/_index.md
@@ -0,0 +1,14 @@
+---
+title: de Bruijn
+type: docs
+toc: false
+---
+
+
+
+De Bruijn graphs are a powerful and intuitive way to represent overlaps between sequences of symbols, and they have found significant applications in bioinformatics, particularly in the assembly of genomes from nucleotide sequences.
+Here, we will explore what de Bruijn graphs are, how they are constructed from nucleotide sequences (specifically focusing on k-mers), and why they are so useful in the context of genome assembly.
+
+
+
+[^ekim2021minimizer]: Ekim, B., Berger, B., & Chikhi, R. (2021). Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. *Cell systems, 12*(10), 958-968. doi: [10.1016/j.cels.2021.08.009](https://doi.org/10.1016/j.cels.2021.08.009)
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/_index.md
new file mode 100644
index 0000000..ea9aa06
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/_index.md
@@ -0,0 +1,16 @@
+---
+title: De Bruijn graphs
+type: docs
+toc: false
+---
+
+
+
+TODO:
+
+- Memory efficiency for large genomes
+- Ability to handle high-throughput short read data
+- Implicit representation of repeats
+- Scalability and parallelization potential
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/concepts/index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/concepts/index.md
new file mode 100644
index 0000000..eca032f
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/concepts/index.md
@@ -0,0 +1,18 @@
+---
+title: Concepts
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+- Mathematical background of de Bruijn sequences
+- K-mers: definition and significance in genomics
+- Graph theory basics: nodes, edges, and paths
+- Formal definition of a de Bruijn graph in the context of genome assembly
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/construction/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/construction/_index.md
new file mode 100644
index 0000000..3748e4e
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/construction/_index.md
@@ -0,0 +1,24 @@
+---
+title: Construction
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+- Preprocessing of sequencing reads
+ - Quality filtering and trimming
+ - Error correction techniques
+- K-mer extraction from reads
+- Creating nodes and edges
+ - Nodes as (k-1)-mers
+ - Edges as k-mers
+- Handling of overlaps between k-mers
+- - Directionality in the graph (dealing with double-stranded DNA)
+- Time and space complexity of graph construction
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/_index.md
new file mode 100644
index 0000000..cb4cac3
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/_index.md
@@ -0,0 +1,24 @@
+---
+title: Contig extraction
+type: docs
+toc: false
+---
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+Contig extraction is a crucial step in the genome assembly process, where the assembler aims to reconstruct longer, contiguous sequences (contigs) from the complex structure of the de Bruijn graph. This process involves identifying paths through the graph that likely represent true genomic sequences, while avoiding artifacts introduced by sequencing errors, repeats, and other complexities. In this section, we'll explore the general principles and techniques used by most de Bruijn graph assemblers for contig extraction.
+
+## Basic Concepts
+
+Before diving into the specifics of contig extraction, it's important to review some fundamental concepts:
+
+- **De Bruijn Graph:** A directed graph where nodes represent (k-1)-mers and edges represent k-mers. The graph structure captures the overlaps between adjacent k-mers in the sequencing reads.
+- **Contig:** A contiguous sequence assembled from overlapping reads, representing a portion of the genome without gaps.
+- **Path:** A sequence of connected nodes and edges in the graph.
+- **Coverage:** The number of reads supporting a particular k-mer or edge in the graph.
+- **Branching:** Points in the graph where a node has multiple incoming or outgoing edges, often representing repeats or errors.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/ambiguities/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/ambiguities/_index.md
new file mode 100644
index 0000000..9897c1a
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/ambiguities/_index.md
@@ -0,0 +1,18 @@
+---
+title: Ambiguities
+type: docs
+toc: false
+---
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+Real genomic data often leads to complex graph structures due to repeats, errors, and coverage variations. Assemblers employ various strategies to resolve these ambiguities:
+a) Coverage-Based Resolution: Differences in coverage can help distinguish between true genomic paths and error-induced branches. Higher coverage paths are generally preferred.
+b) Paired-End Information: Many assemblers use paired-end read information to resolve ambiguities. If two regions are connected by multiple paired-end reads, it suggests they should be part of the same contig.
+c) Bubble Removal: "Bubbles" in the graph (alternative paths between two nodes) often represent small variations or sequencing errors. Assemblers typically collapse these to a single path, often choosing the higher-coverage alternative.
+d) Repeat Resolution: Long repeats create complex branching structures. Some assemblers attempt to resolve these using coverage patterns, paired-end information, or by breaking contigs at repeat boundaries.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/paths/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/paths/_index.md
new file mode 100644
index 0000000..e69614c
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/paths/_index.md
@@ -0,0 +1,19 @@
+---
+title: Simple paths
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+The first step in contig extraction is typically to identify simple, unbranched paths in the de Bruijn graph. These paths represent straightforward sections of the genome where there is no ambiguity in the assembly.
+
+- Linear Paths: Sequences of nodes where each internal node has exactly one incoming and one outgoing edge. These are the easiest to extract and often form the backbone of many contigs.
+- Guaranteed Contigs: In some cases, linear paths with high, consistent coverage can be immediately extracted as contigs, as they likely represent unique genomic regions.
+- Path Compression: Many assemblers perform path compression, where linear paths are collapsed into single nodes, simplifying the graph structure and reducing computational complexity for subsequent steps.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/quality/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/quality/_index.md
new file mode 100644
index 0000000..a333d42
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/quality/_index.md
@@ -0,0 +1,18 @@
+---
+title: Quality assessment
+type: docs
+toc: false
+---
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+After initial contig extraction, many assemblers perform additional steps to improve contig quality:
+a) Error Correction: Examine the extracted contigs for remaining sequencing errors, often using the consensus of multiple reads.
+b) Contig Breaking: Identify potential misassemblies within contigs, often by looking for sudden drops in coverage or conflicting paired-end information.
+c) Contig Extension: Attempt to extend contigs at their ends, possibly using lower-confidence graph traversals or additional sequencing data.
+d) Contig Merging: Identify and merge contigs that likely overlap but weren't connected in the initial extraction due to graph complexities.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/_index.md
new file mode 100644
index 0000000..66516cd
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/_index.md
@@ -0,0 +1,15 @@
+---
+title: Graph traversal
+type: docs
+toc: false
+---
+
+
+
+The core of contig extraction often involves sophisticated graph traversal algorithms.
+These algorithms aim to find paths through the graph that likely represent true genomic sequences.
+
+- [Greedy](./greedy/)
+- [Depth-first search](./dfs/)
+- [Breadth-first search](./bfs/)
+- [Maximum flow](./max-flow/)
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/bfs/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/bfs/_index.md
new file mode 100644
index 0000000..111b347
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/bfs/_index.md
@@ -0,0 +1,14 @@
+---
+title: Breadth-first search
+type: docs
+toc: false
+---
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+Explore all possible paths in parallel. This can be more thorough but is computationally intensive for large graphs.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/dfs/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/dfs/_index.md
new file mode 100644
index 0000000..9705d87
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/dfs/_index.md
@@ -0,0 +1,16 @@
+---
+title: Depth-first search
+type: docs
+toc: false
+---
+
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+Explore paths deeply before backtracking. This can be effective for finding long contigs but may struggle with complex repeat structures.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/greedy/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/greedy/_index.md
new file mode 100644
index 0000000..925445f
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/greedy/_index.md
@@ -0,0 +1,15 @@
+---
+title: Greedy traversal
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+At each branching point, choose the highest-coverage edge. This simple approach can be effective but may lead to misassemblies in complex regions.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/max-flow/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/max-flow/_index.md
new file mode 100644
index 0000000..a97e9f3
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/contig/traversal/max-flow/_index.md
@@ -0,0 +1,15 @@
+---
+title: Maximum flow
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+These can be used to find multiple paths through the graph simultaneously, potentially resolving some complex repeat structures.
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/errors/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/errors/_index.md
new file mode 100644
index 0000000..3280b26
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/errors/_index.md
@@ -0,0 +1,20 @@
+---
+title: Errors
+type: docs
+toc: false
+---
+
+
+
+
+TODO:
+
+- Impact of sequencing errors on graph structure
+ - Tips
+ - Bubbles
+ - Chimeric connections
+- Error correction strategies in the graph
+ - Coverage-based methods
+ - Topological approaches
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/repeats/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/repeats/_index.md
new file mode 100644
index 0000000..da68f55
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/repeats/_index.md
@@ -0,0 +1,16 @@
+---
+title: Repeats
+type: docs
+toc: false
+---
+
+
+
+TODO:
+
+- Repeat identification and resolution
+ - Branching patterns in the graph
+ - Use of paired-end information
+ - Coverage-based repeat detection
+
+
diff --git a/content/genomics/assembly/de-novo/de-bruijn/graphs/simplify/_index.md b/content/genomics/assembly/de-novo/de-bruijn/graphs/simplify/_index.md
new file mode 100644
index 0000000..ed64130
--- /dev/null
+++ b/content/genomics/assembly/de-novo/de-bruijn/graphs/simplify/_index.md
@@ -0,0 +1,16 @@
+---
+title: Simplification
+type: docs
+toc: false
+---
+
+
+
+TODO:
+
+- Strategies for simplifying the graph
+ - Tip clipping
+ - Bubble popping
+ - Path compression
+
+
diff --git a/content/genomics/assembly/de-novo/greedy/_index.md b/content/genomics/assembly/de-novo/greedy/_index.md
new file mode 100644
index 0000000..952069f
--- /dev/null
+++ b/content/genomics/assembly/de-novo/greedy/_index.md
@@ -0,0 +1,215 @@
+---
+title: Greedy algorithms
+type: docs
+toc: false
+---
+
+
+
+Greedy algorithms are a class of algorithms in computer science that make the most optimal choice at each step as they work towards finding the overall optimal solution to a problem.
+The "greediness" of these algorithms comes from their approach to make the locally optimal choice at each step with the hope that these local solutions will lead to a globally optimal solution.
+Greedy algorithms are quite powerful in solving a wide range of problems, including those in computational biology, due to their efficiency and simplicity.
+
+## Overview
+
+The de novo assembly process using a greedy algorithm typically consists of several vital steps.
+First, pairwise distances between reads are calculated, assessing the degree of overlap between sequences.
+This information is then utilized to cluster reads with the greatest overlap, forming groups of potentially contiguous genomic regions.
+Subsequently, reads are assembled within each cluster into longer contiguous sequences, or contigs, based on their overlapping regions.
+This assembly step is crucial for piecing together the fragmented genomic information in the original short reads.
+
+The process is iterative, as the algorithm refines the assembly by repeatedly calculating distances, clustering, and assembling reads. However, by nature, greedy algorithms may result in locally optimal solutions and struggle to reach a global optimum.
+Additionally, repetitive sequences in the genome pose a challenge, as it can lead to ambiguities and errors in the assembly.
+
+## Preliminaries
+
+Let's assume that we have some string, $x$, that has $l$ symbols from the set ${A, C, G, T}$.
+For example,
+
+$$
+x = [A, A, C, T, G, C, G].
+$$
+
+A substring of $x$ is a continuous interval of the symbols in $x$, and it is denoted as
+
+$$
+x[i:j] = (x[i], x[i + 1], \ldots, x[j]).
+$$
+
+This looks very similar to slicing Python lists!
+For example,
+
+```python
+dna_seq = ["A", "A", "C", "T", "G", "C", "G"]
+```
+
+if we wanted the `CTG` substring we would use
+
+```python
+dna_seq[2:5]
+```
+
+!!! info
+
+ We are going to assume that that $i$ starts at zero to be consistent with Python indexing.
+ Most mathematical formulations start at 1.
+
+### Prefix
+
+When the first index of the substring is `0`, we give this a special name called the prefix.
+For example, `dna_seq[0:3]` and `dna_seq[0:2]` would be valid prefixes; the only thing that is changing is the length of the substring.
+We will refer to this as $x_l$.
+
+### Suffix
+
+A suffix is when we slice a portion from the end; for example, `dna_seq[2:l]` would give us `["C", "T", "G", "C", "G"]` if we assume that `l` is the total length of `dna_seq`.
+We will refer to this as $x^l$.
+
+One way to keep track if $x_l$ or $x^l$ is the prefix or suffix substring is to imaging we have our DNA sequence vertically from beginning (bottom) to end (top).
+Thus, $x_l$ would be near the bottom of $x$ so that is the prefix substring.
+$x^l$ would be near the top of $x$ so that is the suffix substring.
+
+## Overlap
+
+In the context of DNA sequences, an overlap occurs when the suffix of one sequence ($x^l$) is identical to the prefix of another sequence ($y_l$).
+This means that the end portion of the first sequence ($x$) matches the starting portion of the second sequence ($y$).
+We can arbitrarily select each sequence to be $x$ or $y$, so an overlap could also be $x_l$ and $y^l$.
+
+## Greedy algorithm
+
+Tarhio and Ukkone[^tarhio1988greedy] developed an algorithm for solving the [shortest common superstring problem](../../#shortest-common-superstring).
+They state that the "greedy" approximation is to:
+
+> find and remove two strings in $R$ which have the longest mutual overlap amongst all possible pairs in $R$.
+> Then form the overlapped string from the removed two strings and replace it back in $R$.
+> Repeat this until there is only one string in $R$ or no two strings have a nonempty overlap.
+
+The greedy algorithm for the shortest common superstring problem operates under a simple yet effective heuristic: repeatedly merge the pair of strings with the maximum overlap until no more merges are possible.
+Here's a step-by-step breakdown:
+
+1. **Initialization:** Begin with a set of strings $R$.
+2. **Find the Longest Overlap:** At each step, examine all possible pairs of strings in $R$ to identify the two strings with the longest mutual overlap. The overlap between two strings $a$ and $b$ is the longest suffix of $a$ that matches a prefix of $b$ or vice versa.
+3. **Merge Strings:** Once the pair with the longest mutual overlap is found, merge these two strings into a single string. This merging involves combining the two strings into one, where the overlap is not duplicated but instead used to connect the strings.
+4. **Repeat or Terminate:** Replace the original two strings in $R$ with the newly formed string. If $R$ now contains only one string, or if no pairs of strings have a nonempty overlap, the process terminates. Otherwise, return to step 2 and repeat.
+5. **Result:** The algorithm ends with a single string that represents the shortest common superstring of the original set $R$, according to the greedy criterion of maximizing overlaps at each step.
+
+### Breaking ties
+
+What if you encounter two or more pairs of strings with overlaps of the same length that also result in superstrings of the same size when merged?
+The choice between these overlaps can be somewhat arbitrary from the algorithm's perspective.
+The greedy algorithm prioritizes maximizing the overlap length at each step to approximate the shortest common superstring.
+Still, it doesn't inherently prioritize among equally maximal overlaps based on any further criteria.
+
+In cases where two overlaps are of the same amount and merging them results in superstrings of the same length, you can consider the following approaches:
+
+- **First Found, First Merged:** Simply merge the first pair you identify with the maximal overlap.
+ This approach is straightforward and keeps the algorithm moving but doesn't attempt to optimize beyond the overlap length.
+- **Lexicographical Order:** Some implementations might choose between equivalent overlaps based on lexicographical order or another deterministic rule.
+ This can help ensure consistency in results, especially when the algorithm is run multiple times or in parallel.
+- **Random Selection:** Randomly selecting among the pairs with maximal overlaps is another approach.
+ However, it introduces nondeterminism into the algorithm, potentially leading to different run results.
+- **Consider Future Overlap Potential:** While not a part of the basic greedy algorithm, one could anticipate which merge might lead to longer overlaps in subsequent steps.
+ However, this approach significantly complicates the algorithm, moving away from its greedy nature and requiring lookahead, which may not be feasible or efficient in practice.
+
+In the context of the greedy algorithm for sequence assembly or similar applications, the specific choice among equally good overlaps might not significantly impact the overall goal of constructing a superstring efficiently.
+The primary concern is creating a superstring that includes all sequences rather than finding the optimal superstring, which is a more complex problem.
+
+## Example
+
+Suppose we have the following set of strings $R$ = {`AGT`, `GTAC`, `ACCA`, `CAG`, `GAC`}.
+The goal is to ensure each step accurately reflects finding and merging strings with the longest prefix-to-suffix overlap.
+
+First, let's correctly identify all possible overlaps.
+Overlaps should be considered based on the suffix of one string matching the prefix of another.
+
+- `GTAC` + `ACCA`: Overlap is `AC` and would merge into `GTACCA`.
+- `AGT` + `GTAC`: Overlap is `GT` and would merge into `AGTAC`.
+- `CAG` + `GAC`: Overlap is `G` and would merge into `CAGAC`.
+- `GAC` + `CAG`: Overlap is `C` and would merge into `GACAG`.
+- `ACCA` + `AGT`: Overlap is `A` and would merge into `ACCAGT`.
+
+We have the same overlap size for two of them.
+The first one, `GTAC` + `ACCA`, would result in a longer sequence, so we chose this.
+Now we have {`GTACCA`, `AGT`, `CAG`, `GAC`}.
+Possible overlaps would be:
+
+- `AGT` + `GTACCA`: Overlap is `GT` and would merge to `AGTACCA`.
+- `GTACCA` + `CAG`: Overlap is `CA` and would merge to `GTACCAG`.
+- `GTACCA` + `AGT`: Overlap is `A` and would merge to `GTACCAGT`.
+- `CAG` + `GTACCA`: Overlap is `G` and would merge to `CAGTACCA`.
+- `CAG` + `GAC`: Overlap is `G` and would merge to `CAGGAC`.
+- `GAC` + `CAG`: Overlap is `C` and would merge to `GACAG`.
+
+Merging `AGT` + `GTACCA` and `GTACCA` + `CAG` has the same overlap of two nucleotides; however, there is no difference in the resulting length of the merged sequence.
+Which do you choose?
+It depends on the type of greedy algorithm you are using, but picking one at random is often the simplest.
+Let's see what happens based on our choice.
+
+**A. Merge `AGT` + `GTACCA`**
+
+{`AGTACCA`, `CAG`, `GAC`}
+
+- `CAG` + `AGTACCA`: Overlap is `AG` and would merge to `CAGTACCA`.
+- `AGTACCA` + `CAG`: Overlap is `CA` and would merge to `AGTACCAG`.
+- `CAG` + `GAC`: Overlap is `G` and would merge to `CAGGAC`.
+- `GAC` + `CAG`: Overlap is `C` and would merge to `GACAG`.
+
+**B. Merge `GTACCA` + `CAG`**
+
+{`GTACCAG`, `AGT`, `GAC`}
+
+- `AGT` + `GTACCAG`: Overlap is `GT` and would merge to `AGTACCAG`.
+- `GTACCAG` + `AGT`: Overlap is `AG` and would merge to `GTACCAGT`.
+- `GTACCAG` + `GAC`: Overlap is `G` and would merge to `GTACCAGAC`.
+
+Making choice **A** would provide more possible merges than **B**.
+Also, the top choices result in three different possible sequences: `CAGTACCA`, `AGTACCAG`, and `GTACCAGT`.
+
+**A1. Merge `CAG` + `AGTACCA`**
+
+{`CAGTACCA`, `GAC`}
+
+- `GAC` + `CAGTACCA`: Overlap is `C` and would merge to `GACAGTACCA`.
+
+**A2. Merge `AGTACCA` + `CAG`**
+
+{`AGTACCAG`, `GAC`}
+
+- `AGTACCAG` + `GAC`: Overlap is `G` and would merge to `AGTACCAGAC`.
+
+**B1. Merge `AGT` + `GTACCAG`**
+
+{`AGTACCAG`, `GAC`}
+
+- `AGTACCAG` + `GAC`: Overlap is `G` and would merge to `AGTACCAGAC`.
+
+**B2. Merge `GTACCAG` + `AGT`**
+
+{`GTACCAGT`, `GAC`}
+
+- No merge possible.
+
+We see that if we merge `GTACCAG` + `AGT` (**B2**), there is no way to combine the last string `GAC`; thus, this is not a valid move.
+If we merged `AGT` + `GTACCAG` (**B1**) we would get `AGTACCAGAC`, the same as move **A2** with `AGTACCA` + `CAG`.
+However, **A1** gives us a completely different option of `GACAGTACCA`.
+
+Which one is correct?
+They are both equally valid as we followed all algorithm rules, and the length of the final string is 10 for both.
+
+## Characteristics and Limitations
+
+The greedy algorithm is efficient in terms of computation, especially compared to exhaustive search methods. It's suitable for practical applications where exact solutions are not necessary.
+
+While the algorithm is fast, it does not guarantee an optimal solution. The shortest common superstring it finds is an approximation of the true shortest superstring.
+
+Despite its limitations, this algorithm is widely used in bioinformatics for tasks like genome assembly, where sequences are merged based on overlaps to reconstruct the original genome.
+
+The greedy algorithm's simplicity and the efficiency make it a powerful tool in scenarios where an approximate solution is sufficient and time is of the essence.
+Its application to problems like genome assembly underscores its utility in handling complex, real-world problems where exact solutions are impractical to compute.
+
+
+
+[^huson2001greedy]: Huson, D. H., Reinert, K., & Myers, E. (2001, April). The greedy path-merging algorithm for sequence assembly. In Proceedings of the fifth annual international conference on Computational biology (pp. 157-163). doi: [10.1145/369133.369190](https://doi.org/10.1145/369133.369190)
+[^shomorony2016information]: Shomorony, I., Kim, S. H., Courtade, T. A., & Tse, D. N. (2016). Information-optimal genome assembly via sparse read-overlap graphs. *Bioinformatics, 32*(17), i494-i502. doi: [10.1093/bioinformatics/btw450](https://doi.org/10.1093/bioinformatics/btw450)
+[^bresler2013optimal]: Bresler, G., Bresler, M. A., & Tse, D. (2013, April). Optimal assembly for high throughput shotgun sequencing. In BMC bioinformatics (Vol. 14, No. 5, pp. 1-13). BioMed Central. doi: [10.1186/1471-2105-14-S5-S18](https://doi.org/10.1186/1471-2105-14-S5-S18)
+[^tarhio1988greedy]: Tarhio, J., & Ukkonen, E. (1988). A greedy approximation algorithm for constructing shortest common superstrings. *Theoretical computer science, 57*(1), 131-145. doi: [10.1016/0304-3975(88)90167-3](https://doi.org/10.1016/0304-3975(88)90167-3)
diff --git a/content/genomics/assembly/de-novo/overlap-layout-consensus/_index.md b/content/genomics/assembly/de-novo/overlap-layout-consensus/_index.md
new file mode 100644
index 0000000..f9a92bf
--- /dev/null
+++ b/content/genomics/assembly/de-novo/overlap-layout-consensus/_index.md
@@ -0,0 +1,25 @@
+---
+title: Overlap-layout-consensus
+type: docs
+toc: false
+---
+
+
+
+The Overlap-Layout-Consensus (OLC) method is a fundamental technique used in genome assembly, which is essentially a three-step process.
+
+1. **Overlap**: The initial step involves identifying overlaps among all the reads.
+ This is akin to finding common sections among different fragments of a puzzle.
+2. **Layout**: Once overlaps are established, the OLC algorithm arranges all the reads in a specific order that best represents their overlaps, forming a graph.
+3. **Consensus**: The final step involves deriving a consensus sequence from the multiple sequence alignments (MSA).
+ This consensus sequence is a representation of the most likely arrangement of the reads.
+
+It's important to note that the OLC approach is more suitable for low-coverage long reads, whereas other methods like Debruijn Graph (DBG) are more suitable for high-coverage short reads, especially for large genome assembly.
+The choice of method depends on the specific requirements and constraints of the genome assembly project.
+
+
+
+[^kellisassembly]: [Libre texts](https://bio.libretexts.org/Bookshelves/Computational_Biology/Book%3A_Computational_Biology_-_Genomes_Networks_and_Evolution_%28Kellis_et_al.%29/05%3A_Genome_Assembly_and_Whole-Genome_Alignment/5.02%3A_Genome_Assembly_I-_Overlap-Layout-Consensus_Approach)
+[^severin2023bioinformatics]: [Introduction to Genome Assembly - Bioinformatics Workbook](https://bioinformaticsworkbook.org/dataAnalysis/GenomeAssembly/Intro_GenomeAssembly.html)
+[^wikiassembly]: [wikipedia.org/wiki/Sequence_assembly](https://en.wikipedia.org/wiki/Sequence_assembly)
+[^jung2020twelve]: Jung, H., Ventura, T., Chung, J. S., Kim, W. J., Nam, B. H., Kong, H. J., ... & Eyun, S. I. (2020). Twelve quick steps for genome assembly and annotation in the classroom. *PLoS computational biology, 16*(11), e1008325. doi: [10.1371/journal.pcbi.1008325](https://doi.org/10.1371/journal.pcbi.1008325)
diff --git a/content/genomics/assembly/qc/_index.md b/content/genomics/assembly/qc/_index.md
new file mode 100644
index 0000000..be39991
--- /dev/null
+++ b/content/genomics/assembly/qc/_index.md
@@ -0,0 +1,17 @@
+---
+title: Quality control
+type: docs
+toc: false
+weight: 1
+---
+
+
+
+HTS machines read thousands or millions of sequences in parallel.
+As you can imagine, this usually generates large `fastq` files, with millions of lines.
+Manually inspecting the quality of each read is out of the question.
+Specialized software has been developed to provide quality measures for `fastq` files generated by HTS machines.
+
+[FastQC](./fastqc/) is a popular program to generate quality reports on `fastq` data.
+In fact, this is usually the first thing you should do once you receive a new dataset.
+[FastQC](./fastqc/) reports provide a series of plots that allow the user to assess the overall quality of their raw data and detect potential biases and problems.
diff --git a/content/genomics/assembly/qc/fastp/.pages b/content/genomics/assembly/qc/fastp/.pages
new file mode 100644
index 0000000..44b9d3a
--- /dev/null
+++ b/content/genomics/assembly/qc/fastp/.pages
@@ -0,0 +1,2 @@
+nav:
+ - fastp: index.md
diff --git a/content/genomics/assembly/qc/fastp/fastp.html b/content/genomics/assembly/qc/fastp/fastp.html
new file mode 100644
index 0000000..dd82f36
--- /dev/null
+++ b/content/genomics/assembly/qc/fastp/fastp.html
@@ -0,0 +1,5054 @@
+fastp report at 2024-09-03 00:57:30
+
+
+
+
+
+
This estimation is based on paired-end overlap analysis, and there are 16.557590% reads found not overlapped. The nonoverlapped read pairs may have insert size <30 or >472, or contain too much sequencing errors to be detected as overlapped.
+
diff --git a/content/genomics/assembly/qc/fastp/index.md b/content/genomics/assembly/qc/fastp/index.md
new file mode 100644
index 0000000..ffffb70
--- /dev/null
+++ b/content/genomics/assembly/qc/fastp/index.md
@@ -0,0 +1,11 @@
+
+fastp
+
+
+TODO:
+
+**Code:**: [github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)
+
+**Reference:** Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. *Bioinformatics, 34*(17), i884-i890. DOI: [10.1093/bioinformatics/bty560](https://doi.org/10.1093/bioinformatics/bty560)
+
+Example analysis
diff --git a/content/genomics/assembly/qc/fastqc/_index.md b/content/genomics/assembly/qc/fastqc/_index.md
new file mode 100644
index 0000000..b9c0419
--- /dev/null
+++ b/content/genomics/assembly/qc/fastqc/_index.md
@@ -0,0 +1,16 @@
+---
+title: FastQC
+type: docs
+toc: false
+---
+
+
+
+TODO:
+
+See example forward and reverse FastQC reports.
+These reports contain a bunch of useful information, some of which we cover here.
+
+## Additional resources
+
+- [Galaxy training on quality control](https://gxy.io/GTN:T00239).
diff --git a/content/genomics/assembly/qc/fastqc/adapter.md b/content/genomics/assembly/qc/fastqc/adapter.md
new file mode 100644
index 0000000..bff446b
--- /dev/null
+++ b/content/genomics/assembly/qc/fastqc/adapter.md
@@ -0,0 +1,19 @@
+---
+title: Adapter content
+type: docs
+toc: false
+---
+
+
+
+The plot shows the cumulative percentage of reads with the different adapter sequences at each position.
+Once an adapter sequence is seen in a read it is counted as being present right through to the end of the read so the percentage increases with the read length.
+FastQC can detect some adapters by default (e.g. Illumina, Nextera), for others we could provide a contaminants file as an input to the FastQC tool.
+
+Ideally Illumina sequence data should not have any adapter sequence present.
+But with long reads, some of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3’ end of the read.
+
+{{< tabs items="Forward,Reverse" >}}
+ {{< tab >}}![](img/adapter-forward.png){{< /tab >}}
+ {{< tab >}}![](img/adapter-reverse.png){{< /tab >}}
+{{< /tabs >}}
\ No newline at end of file
diff --git a/content/genomics/assembly/qc/fastqc/base-content.md b/content/genomics/assembly/qc/fastqc/base-content.md
new file mode 100644
index 0000000..f0a8428
--- /dev/null
+++ b/content/genomics/assembly/qc/fastqc/base-content.md
@@ -0,0 +1,38 @@
+---
+title: Per base sequence content
+type: docs
+toc: false
+---
+
+
+
+
+“Per Base Sequence Content” plots the percentage of each of the four nucleotides (T, C, A, G) at each position across all reads in the input sequence file.
+As for the per base sequence quality, the x-axis is non-uniform.
+
+> ![](img/per-base-seq.png)
+>
+
+In a random library we would expect that there would be little to no difference between the four bases.
+The proportion of each of the four bases should remain relatively constant over the length of the read with `%A=%T` and `%G=%C`, and the lines in this plot should run parallel with each other.
+
+## Biases
+
+It’s worth noting that some library types will always produce biased sequence composition, normally at the start of the read.
+
+=== "RNA-seq"
+
+ Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries), and those which were fragmented using transposases, will contain an intrinsic bias in the positions at which reads start (the first 10-12 bases). This bias does not involve a specific sequence, but instead provides enrichment of a number of different K-mers at the 5’ end of the reads. Whilst this is a true technical bias, it isn’t something which can be corrected by trimming and in most cases doesn’t seem to adversely affect the downstream analysis. It will, however, produce a warning or error in this module.
+
+ !!! quote "Figure"
+
+
+=== "ChIP-seq"
+ ChIP-seq data can also encounter read start sequence biases in this plot if fragmenting with transposases. With bisulphite converted data, e.g. HiC data, a separation of G from C and A from T is expected:
+
+ !!! quote "Figure"
+
diff --git a/content/genomics/assembly/qc/fastqc/base-n.md b/content/genomics/assembly/qc/fastqc/base-n.md
new file mode 100644
index 0000000..d50a523
--- /dev/null
+++ b/content/genomics/assembly/qc/fastqc/base-n.md
@@ -0,0 +1,17 @@
+---
+title: Per base N content
+type: docs
+toc: false
+---
+
+
+
+If a sequencer is unable to make a base call with sufficient confidence, it will write an `N` instead of a conventional base call.
+This plot displays the percentage of base calls at each position or bin for which an `N` was called.
+
+> ![](img/per-base-n.png)
+>
+
+It’s not unusual to see a very high proportion of `N`s appearing in a sequence, especially near the end of a sequence
+But this curve should never rises noticeably above zero.
+If it does this indicates a problem occurred during the sequencing run.
diff --git a/content/genomics/assembly/qc/fastqc/base-quality.md b/content/genomics/assembly/qc/fastqc/base-quality.md
new file mode 100644
index 0000000..7fb1ccd
--- /dev/null
+++ b/content/genomics/assembly/qc/fastqc/base-quality.md
@@ -0,0 +1,85 @@
+---
+title: Per base sequence quality
+type: docs
+toc: false
+---
+
+
+
+
+With FastQC we can use the per base sequence quality plot to check the base quality of the reads.
+On the x-axis are the base position in the read. In this example, the sample contains reads that are up to 251 bp long.
+
+=== "Forward"
+ !!! quote "Figure"
+
+=== "Reverse"
+ !!! quote "Figure"
+
+
+!!! important
+ The x-axis is not always uniform.
+
+ When you have long reads, some binning is applied to keep things compact.
+ We can see that in our sample.
+ It starts out with individual 1-10 bases.
+ After that, bases are binned across a window a certain number of bases wide.
+ Data binning means grouping and is a data pre-processing technique used to reduce the effects of minor observation errors.
+ The number of base positions binned together depends on the length of the read.
+ With reads >50bp, the latter part of the plot will report aggregate statistics for 5bp windows.
+ Shorter reads will have smaller windows and longer reads larger windows.
+ Binning can be removed when running FastQC by setting the parameter “Disable grouping of bases for reads >50bp” to Yes.
+
+For each position, a boxplot is drawn with:
+
+- the median value, represented by the central red line
+- the inter-quartile range (25-75%), represented by the yellow box
+- the 10% and 90% values in the upper and lower whiskers
+- the mean quality, represented by the blue line
+
+The y-axis shows the quality scores. The higher the score, the better the base call.
+The background of the graph divides the y-axis into very good quality scores, scores of reasonable quality, and reads of poor quality.
+
+It is normal with all Illumina sequencers for the median quality score to start out lower over the first 5-7 bases and to then rise.
+The quality of reads on most platforms will drop at the end of the read.
+This is often due to signal decay or phasing during the sequencing run.
+The recent developments in chemistry applied to sequencing has improved this somewhat, but reads are now longer than ever.
+
+## Signal decay
+
+The fluorescent signal intensity decays with each cycle of the sequencing process.
+Due to the degrading fluorophores, a proportion of the strands in the cluster are not being elongated.
+The proportion of the signal being emitted continues to decrease with each cycle, yielding to a decrease of quality scores at the 3’ end of the read.
+
+## Phasing
+
+The signal starts to blur with the increase of number of cycles because the cluster looses synchronicity. As the cycles progress, some strands get random failures of nucleotides to incorporate due to:
+
+- Incomplete removal of the 3’ terminators and fluorophores
+- Incorporation of nucleotides without effective 3’ terminators
+- This leads to a decrease in quality scores at the 3’ end of the read.
+
+## Overclustering
+
+Sequencing facilities can overcluster the flow cells.
+It results in small distances between clusters and an overlap in the signals.
+Two clusters can be interpreted as a single cluster with mixed fluorescent signals being detected, decreasing signal purity.
+It generates lower quality scores across the entire read.
+
+## Instrumentation breakdown
+
+Some issues can occasionally happen with the sequencing instruments during a run. Any sudden drop in quality or a large percentage of low quality reads across the read could indicate a problem at the facility.
+Some examples of such issues.
+
+> ![](https://training.galaxyproject.org/training-material/topics/sequence-analysis/images/quality-control/per_base_sequence_quality_manifold_burst.png)
+>
+
+> ![](https://training.galaxyproject.org/training-material/topics/sequence-analysis/images/quality-control/per_base_sequence_quality_cycle_loss.png)
+>
+
+> ![](https://training.galaxyproject.org/training-material/topics/sequence-analysis/images/quality-control/per_base_sequence_quality_read2_failure.png)
+>
diff --git a/content/genomics/assembly/qc/fastqc/duplication.md b/content/genomics/assembly/qc/fastqc/duplication.md
new file mode 100644
index 0000000..4d3930e
--- /dev/null
+++ b/content/genomics/assembly/qc/fastqc/duplication.md
@@ -0,0 +1,30 @@
+---
+title: Sequence duplication levels
+type: docs
+toc: false
+---
+
+
+The graph shows the percentage of reads of a given sequence in the file which are present a given number of times in the file:
+
+> ![](img/seq-duplication.png)
+>
+
+In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias.
+Two sources of duplicate reads can be found:
+
+- PCR duplication in which library fragments have been over-represented due to biased PCR enrichment
+ It is a concern because PCR duplicates misrepresent the true proportion of sequences in the input.
+- Truly over-represented sequences such as very abundant transcripts in an RNA-Seq library or in amplicon data (like this sample)
+ It is an expected case and not of concern because it does faithfully represent the input.
+
+FastQC counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication.
+
+For whole genome shotgun data it is expected that nearly 100% of your reads will be unique (appearing only 1 time in the sequence data). Most sequences should fall into the far left of the plot. This indicates a highly diverse library that was not over sequenced. If the sequencing depth is extremely high (e.g. > 100x the size of the genome) some inevitable sequence duplication can appear: there are in theory only a finite number of completely unique sequence reads which can be obtained from any given input DNA sample.
+
+More specific enrichments of subsets, or the presence of low complexity contaminants will tend to produce spikes towards the right of the plot.
+These high duplication peaks will most often appear in the blue trace as they make up a high proportion of the original library, but usually disappear in the red trace as they make up an insignificant proportion of the deduplicated set.
+If peaks persist in the red trace then this suggests that there are a large number of different highly duplicated sequences which might indicate either a contaminant set or a very severe technical duplication.
+
+It is usually the case for RNA sequencing where there is some very highly abundant transcripts and some lowly abundant.
+It is expected that duplicate reads will be observed for high abundance transcripts.
diff --git a/content/genomics/assembly/qc/fastqc/gc.md b/content/genomics/assembly/qc/fastqc/gc.md
new file mode 100644
index 0000000..f7d2fd8
--- /dev/null
+++ b/content/genomics/assembly/qc/fastqc/gc.md
@@ -0,0 +1,22 @@
+---
+title: Per sequence GC content
+type: docs
+toc: false
+---
+
+
+
+
+This plot displays the number of reads vs. percentage of bases G and C per read.
+It is compared to a theoretical distribution assuming an uniform GC content for all reads, expected for whole genome shotgun sequencing, where the central peak corresponds to the overall GC content of the underlying genome.
+Since the GC content of the genome is not known, the modal GC content is calculated from the observed data and used to build a reference distribution.
+
+> ![](img/seq-gc.png)
+>
+
+An unusually-shaped distribution could indicate a contaminated library or some other kind of biased subset.
+A shifted normal distribution indicates some systematic bias, which is independent of base position.
+If there is a systematic bias which creates a shifted normal distribution then this won’t be flagged as an error by the module since it doesn’t know what your genome’s GC content should be.
+
+But there are also other situations in which an unusually-shaped distribution may occur.
+For example, with RNA sequencing there may be a greater or lesser distribution of mean GC content among transcripts causing the observed plot to be wider or narrower than an ideal normal distribution.
diff --git a/content/genomics/assembly/qc/fastqc/img/adapter-forward.png b/content/genomics/assembly/qc/fastqc/img/adapter-forward.png
new file mode 100644
index 0000000..d7831d4
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/adapter-forward.png differ
diff --git a/content/genomics/assembly/qc/fastqc/img/adapter-reverse.png b/content/genomics/assembly/qc/fastqc/img/adapter-reverse.png
new file mode 100644
index 0000000..e965624
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/adapter-reverse.png differ
diff --git a/content/genomics/assembly/qc/fastqc/img/per-base-n.png b/content/genomics/assembly/qc/fastqc/img/per-base-n.png
new file mode 100644
index 0000000..0d5972e
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/per-base-n.png differ
diff --git a/content/genomics/assembly/qc/fastqc/img/per-base-quality-forward.png b/content/genomics/assembly/qc/fastqc/img/per-base-quality-forward.png
new file mode 100644
index 0000000..a01c23b
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/per-base-quality-forward.png differ
diff --git a/content/genomics/assembly/qc/fastqc/img/per-base-quality-reverse.png b/content/genomics/assembly/qc/fastqc/img/per-base-quality-reverse.png
new file mode 100644
index 0000000..63e4f21
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/per-base-quality-reverse.png differ
diff --git a/content/genomics/assembly/qc/fastqc/img/per-base-seq.png b/content/genomics/assembly/qc/fastqc/img/per-base-seq.png
new file mode 100644
index 0000000..b5a188b
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/per-base-seq.png differ
diff --git a/content/genomics/assembly/qc/fastqc/img/seq-duplication.png b/content/genomics/assembly/qc/fastqc/img/seq-duplication.png
new file mode 100644
index 0000000..2ffadb8
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/seq-duplication.png differ
diff --git a/content/genomics/assembly/qc/fastqc/img/seq-gc.png b/content/genomics/assembly/qc/fastqc/img/seq-gc.png
new file mode 100644
index 0000000..87608c2
Binary files /dev/null and b/content/genomics/assembly/qc/fastqc/img/seq-gc.png differ
diff --git a/content/genomics/assembly/tools/_index.md b/content/genomics/assembly/tools/_index.md
new file mode 100644
index 0000000..965e8ba
--- /dev/null
+++ b/content/genomics/assembly/tools/_index.md
@@ -0,0 +1,8 @@
+---
+title: Tools
+type: docs
+toc: false
+weight: 4
+---
+
+
diff --git a/content/genomics/assembly/tools/canu/_index.md b/content/genomics/assembly/tools/canu/_index.md
new file mode 100644
index 0000000..47e8549
--- /dev/null
+++ b/content/genomics/assembly/tools/canu/_index.md
@@ -0,0 +1,13 @@
+---
+title: Canu
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+
diff --git a/content/genomics/assembly/tools/flye/_index.md b/content/genomics/assembly/tools/flye/_index.md
new file mode 100644
index 0000000..a03eb16
--- /dev/null
+++ b/content/genomics/assembly/tools/flye/_index.md
@@ -0,0 +1,13 @@
+---
+title: Flye
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+
diff --git a/content/genomics/assembly/tools/masurca/_index.md b/content/genomics/assembly/tools/masurca/_index.md
new file mode 100644
index 0000000..591b42d
--- /dev/null
+++ b/content/genomics/assembly/tools/masurca/_index.md
@@ -0,0 +1,12 @@
+---
+title: MaSurCA
+type: docs
+toc: false
+---
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+
diff --git a/content/genomics/assembly/tools/miniasm/_index.md b/content/genomics/assembly/tools/miniasm/_index.md
new file mode 100644
index 0000000..dd005b8
--- /dev/null
+++ b/content/genomics/assembly/tools/miniasm/_index.md
@@ -0,0 +1,13 @@
+---
+title: Miniasm
+type: docs
+toc: false
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress.
+
+
diff --git a/content/genomics/assembly/tools/spades/_index.md b/content/genomics/assembly/tools/spades/_index.md
new file mode 100644
index 0000000..4d78690
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/_index.md
@@ -0,0 +1,39 @@
+---
+title: SPAdes
+type: docs
+toc: false
+---
+
+
+
+
+TODO:
+
+A. Background and Development
+ 1. Origin and creators
+ - Developed at St. Petersburg Academic University
+ - Led by Pavel Pevzner's team
+ 2. Initial release and purpose
+ - First released in 2012
+ - Designed to address challenges in single-cell and short-read sequencing
+B. Key Features and Innovations
+ 1. Versatility
+ - Suitable for both single-cell and standard (multicell) assembly
+ - Adaptable to various sequencing technologies
+ 2. Core algorithmic innovations
+ - Multisized de Bruijn graph approach
+ - Advanced error correction tailored for different data types
+ - Paired assembly graph for improved use of read-pair information
+ 3. Specialized capabilities
+ - Effective handling of uneven coverage (crucial for single-cell data)
+ - Hybrid assembly combining short and long reads
+ - Dedicated modes for metagenome, transcriptome, and plasmid assembly
+C. Impact and Adoption
+ 1. Widespread use in genomics research
+ 2. Continuous development and community support
+
+
+
+
+[^bankevich2012spades]: Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... & Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-477. DOI: [10.1089/cmb.2012.0021](https://doi.org/10.1089/cmb.2012.0021)
+[^prjibelski2014exspander]: Prjibelski, A. D., Vasilinetc, I., Bankevich, A., Gurevich, A., Krivosheeva, T., Nurk, S., ... & Pevzner, P. A. (2014). ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics, 30(12), i293-i301. DOI: [10.1093/bioinformatics/btu266](https://doi.org/10.1093/bioinformatics/btu266)
diff --git a/content/genomics/assembly/tools/spades/construction/_index.md b/content/genomics/assembly/tools/spades/construction/_index.md
new file mode 100644
index 0000000..be2d81a
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/construction/_index.md
@@ -0,0 +1,15 @@
+---
+title: Construction
+type: docs
+toc: false
+---
+
+
+TODO:
+
+Initial graph building
+Removal of erroneous edges (tips, bubbles)
+Dealing with chimeric connections
+Graph simplification techniques
+
+
diff --git a/content/genomics/assembly/tools/spades/contig/_index.md b/content/genomics/assembly/tools/spades/contig/_index.md
new file mode 100644
index 0000000..6572082
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/contig/_index.md
@@ -0,0 +1,14 @@
+---
+title: Contig generation
+type: docs
+toc: false
+---
+
+
+
+TODO:
+
+Path traversal in the assembly graph
+Criteria for selecting high-quality contigs
+
+
diff --git a/content/genomics/assembly/tools/spades/error/_index.md b/content/genomics/assembly/tools/spades/error/_index.md
new file mode 100644
index 0000000..1d0f4f8
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/error/_index.md
@@ -0,0 +1,15 @@
+---
+title: Error correction
+type: docs
+toc: false
+---
+
+
+
+TODO:
+
+BayesHammer algorithm for Illumina reads
+Adaptations for single-cell data
+Integration of error correction and assembly
+
+
diff --git a/content/genomics/assembly/tools/spades/multisized/_index.md b/content/genomics/assembly/tools/spades/multisized/_index.md
new file mode 100644
index 0000000..61df0aa
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/multisized/_index.md
@@ -0,0 +1,14 @@
+---
+title: Multisized graphs
+type: docs
+toc: false
+---
+
+
+TODO:
+
+Concept of iterative graph construction
+Advantages over single k-mer size methods
+How SPAdes selects and uses multiple k-mer sizes
+
+
diff --git a/content/genomics/assembly/tools/spades/paired/_index.md b/content/genomics/assembly/tools/spades/paired/_index.md
new file mode 100644
index 0000000..a932022
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/paired/_index.md
@@ -0,0 +1,17 @@
+---
+title: Paired assembly graphs
+type: docs
+toc: false
+---
+
+
+
+TODO:
+
+Concept and purpose
+k-bimer adjustment for accurate distance estimates
+How it improves repeat resolution
+
+
+
+[^medvedev2011paired]: Medvedev, P., Pham, S., Chaisson, M., Tesler, G., & Pevzner, P. (2011). Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology, 18(11), 1625-1634. DOI: [10.1089/cmb.2011.0151](https://doi.org/10.1089/cmb.2011.0151)
diff --git a/content/genomics/assembly/tools/spades/pipeline/_index.md b/content/genomics/assembly/tools/spades/pipeline/_index.md
new file mode 100644
index 0000000..38d938d
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/pipeline/_index.md
@@ -0,0 +1,16 @@
+---
+title: Pipeline
+type: docs
+toc: false
+---
+
+
+TODO:
+
+Input preprocessing
+Error correction
+Assembly graph construction
+Contig generation
+Scaffolding
+
+
diff --git a/content/genomics/assembly/tools/spades/scaffold/.pages b/content/genomics/assembly/tools/spades/scaffold/.pages
new file mode 100644
index 0000000..989aa62
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/scaffold/.pages
@@ -0,0 +1,2 @@
+nav:
+ - Scaffolding: index.md
diff --git a/content/genomics/assembly/tools/spades/scaffold/index.md b/content/genomics/assembly/tools/spades/scaffold/index.md
new file mode 100644
index 0000000..a078193
--- /dev/null
+++ b/content/genomics/assembly/tools/spades/scaffold/index.md
@@ -0,0 +1,8 @@
+# Scaffolding
+
+TODO:
+
+Use of paired-end and mate-pair information
+Gap closure techniques
+
+
diff --git a/content/sequencing/_index.md b/content/sequencing/_index.md
new file mode 100644
index 0000000..b9fd4c3
--- /dev/null
+++ b/content/sequencing/_index.md
@@ -0,0 +1,116 @@
+---
+title: Sequencing
+type: docs
+---
+
+
+Sequencing in biology refers to determining the precise order of subunits within a biological macromolecule.
+This fundamental concept underpins much of modern molecular biology and has far-reaching implications for our understanding of life at its most basic level.
+To fully grasp the significance of sequencing, we must first explore the nature of subunits in biological molecules, the importance of their order, and how this concept relates to more familiar ideas.
+
+## Subunits in Biological Context
+
+In biological sequencing, "subunits" are the basic building blocks that compose larger, more complex molecules.
+These subunits are typically small, repeating units that link in long chains to form macromolecules.
+The nature of these subunits varies depending on the type of molecule being sequenced.
+
+In DNA (deoxyribonucleic acid), the subunits are nucleotides.
+Each nucleotide consists of a deoxyribose sugar, a phosphate group, and one of four nitrogenous bases: adenine (A), guanine (G), cytosine (C), or thymine (T).
+
+> ![](https://openstax.org/apps/archive/20240725.163813/resources/aa0504395647ce5483e8e341b4f6247630f3c262)
+>
+> Credit: [OpenStax Anatomy and Physiology](https://openstax.org/books/anatomy-and-physiology/pages/3-3-the-nucleus-and-dna-replication)
+
+RNA (ribonucleic acid) also uses nucleotides as subunits, but with a slight variation: it contains ribose sugar instead of deoxyribose, and uracil (U) replaces thymine.
+
+> ![](https://www.thoughtco.com/thmb/Nz_jYNWfDCmSdPVKfWeu9ubLJWw=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/dna-versus-rna-608191_sketch_Final-54acdd8f8af04c73817e8811c32905fa.png)
+>
+> Credit: [ThoughtCo / Hilary Allison](https://www.thoughtco.com/dna-versus-rna-608191)
+
+Proteins, on the other hand, are composed of amino acids as their subunits.
+Twenty standard amino acids combine in various sequences to form the vast array of proteins found in living organisms.
+These amino acids, linked by peptide bonds, create the diverse structures and functions observed in proteins.
+
+The concept of subunits is crucial because it allows us to break down complex biological molecules into manageable, discrete units that can be identified and analyzed systematically. This reductionist approach has been instrumental in advancing our understanding of molecular biology and genetics.
+
+## The Importance of Order in Biological Molecules
+
+The order of subunits in biological molecules is paramount, as it directly determines their structure, function, and properties.
+This principle is exemplified in numerous ways across different types of biological molecules.
+
+In DNA and RNA, the sequence of nucleotides encodes genetic information.
+This information directs the synthesis of proteins and regulates gene expression, ultimately influencing an organism's traits and functions.
+The specific arrangement of nucleotides in a gene determines which amino acids will be incorporated into a protein and in what order.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Aminoacids_table.svg/1024px-Aminoacids_table.svg.png)
+>
+> Credit: [Mouagip](https://commons.wikimedia.org/wiki/File:Aminoacids_table.svg)
+
+The sequence of amino acids, known as the primary structure, dictates how proteins fold into their three-dimensional shape.
+This shape, in turn, determines the protein's function, whether it's catalyzing chemical reactions, providing structural support, or facilitating cell signaling.
+The intricate folding patterns of proteins, guided by the primary sequence, give rise to the incredible diversity of protein functions observed in living systems.
+
+Even small changes in the order of subunits can have profound effects.
+A single nucleotide change in DNA, known as a point mutation, can alter the amino acid sequence of a protein, potentially leading to genetic disorders or evolutionary adaptations.
+This sensitivity to sequence changes underscores the precision required in biological systems and the potential consequences of alterations.
+
+The order of subunits also plays a crucial role in molecular recognition and interaction.
+Many biological processes rely on the specific matching of complementary sequences.
+DNA replication, for instance, depends on pairing complementary base pairs.
+Enzymes recognize and bind to specific sequences on their substrate molecules.
+The immune system identifies foreign entities based on the sequence of amino acids in their proteins.
+
+Understanding the order of subunits thus provides invaluable insights into biological systems' function, evolution, and potential manipulation.
+It allows us to decipher the language of life at its most fundamental level.
+
+## Sequencing as Decoding a Message or Following a Recipe
+
+To better grasp the concept of sequencing, we can draw an analogy to more familiar processes: decoding a message or following a recipe. These comparisons provide accessible frameworks for understanding the complex process of biological sequencing.
+
+Consider a written message. It consists of letters arranged in a specific order to convey meaning. Similarly, biological molecules contain subunits arranged in a particular sequence that encodes information. Sequencing is akin to reading this molecular message, revealing the instructions encoded within the biological molecule. Just as changing the order of letters in a word can alter its meaning or render it nonsensical, changes in the sequence of biological subunits can dramatically affect the molecule's function.
+
+The sequence of subunits in a biological molecule can also be compared to the recipe's list of ingredients and steps. In cooking, the specific ingredients and the order in which they are combined determine the final dish. Similarly, in biology, the type and order of subunits determine the final structure and function of the molecule. A slight change in the recipe – adding an ingredient too early or late or using the wrong amount – can significantly alter the outcome. In the same way, small changes in biological sequences can significantly affect the resulting molecules and, by extension, the organism itself.
+
+This analogy also helps illustrate the complexity and precision involved in biological processes. Some recipes are simple, with few ingredients and steps, while others are intricate and demanding. Similarly, biological sequences can range from relatively short and straightforward to extremely long and complex. The human genome, for instance, contains approximately 3 billion nucleotides – imagine a recipe book with 3 billion steps! This analogy underscores the precision required in biological processes and in our methods of studying them.
+
+By framing sequencing in these familiar terms, we can better appreciate its fundamental importance in deciphering the complex language of life, from the basic building blocks to the intricate systems that govern living organisms. The ability to "read" these molecular recipes and messages has revolutionized our understanding of biology and continues to drive advances in fields ranging from medicine to ecology.
+
+## Sequencing as a Fundamental Tool in Molecular Biology
+
+One of the primary contributions of sequencing has been in elucidating the molecular basis of life.
+By revealing the sequences of DNA, RNA, and proteins, we have gained unprecedented insights into how genetic information is stored, transmitted, and expressed in living organisms.
+
+Sequencing has been instrumental in uncovering genotype-phenotype relationships.
+Researchers can identify specific genetic variations associated with particular phenotypes by comparing the genetic sequences of individuals with different traits or conditions.
+This has profound implications for our understanding of genetic diseases and complex traits like height or susceptibility to certain conditions.
+For instance, genome-wide association studies (GWAS) have allowed researchers to identify genetic loci associated with diseases ranging from cancer to cardiovascular disorders.
+
+Moreover, sequencing has revolutionized our understanding of evolutionary relationships between organisms.
+By comparing the genetic sequences of different species, we can reconstruct evolutionary histories and understand the processes of speciation and adaptation.
+This has led to molecular phylogenetics, which has sometimes confirmed and sometimes challenged our previous understanding of evolutionary relationships based on morphological characteristics.
+
+## Applications Across Biological Disciplines
+
+Sequencing's impact extends across various biological disciplines, each benefiting from the wealth of molecular data it provides.
+
+In genetics and genomics, sequencing is the cornerstone of research. It allows for the identification of genes, the study of gene regulation, and the analysis of entire genomes. Genomic sequencing has revealed the complexity of gene structure and function, uncovering phenomena such as alternative splicing and epigenetic modifications.
+
+In molecular biology and biochemistry, sequencing enables the study of molecular structures and interactions in unprecedented detail. It has been crucial in understanding protein structure-function relationships, enzyme mechanisms, and the intricacies of cellular signaling pathways.
+
+Sequencing technologies have transformed evolutionary biology and phylogenetics. The ability to compare genetic sequences across species has provided a molecular clock for dating evolutionary events and revealed instances of horizontal gene transfer, challenging the traditional view of the Tree of Life.
+
+## Impact on Research Methodologies
+
+Sequencing data often serves as the starting point in hypothesis generation and testing. Researchers can use sequence information to predict gene functions, protein interactions, or evolutionary relationships, which can be tested experimentally. This has led to a more targeted and efficient approach to biological research.
+
+The advent of high-throughput sequencing has ushered in the era of "omics" - genomics, transcriptomics, proteomics, and more. This has shifted biology towards more data-driven discovery. Instead of studying individual genes or proteins in isolation, researchers can now analyze entire systems simultaneously. This holistic approach has led to discoveries that would have been impossible with traditional methods, such as identifying complex gene regulatory networks or characterizing the human microbiome.
+
+## Technological Advances in Sequencing
+
+The journey from manual to automated methods marked the first major leap in sequencing technology. Early sequencing methods, such as Maxam-Gilbert sequencing and manual Sanger sequencing, were labor-intensive and time-consuming. The development of automated Sanger sequencing in the 1980s dramatically increased the speed and efficiency of sequencing, making larger-scale projects like the Human Genome Project feasible.
+
+The next revolution came with the development of high-throughput sequencing technologies, known as next-generation sequencing (NGS). These methods, including Illumina sequencing and Ion Torrent sequencing, allow for massively parallel sequencing of millions of DNA fragments simultaneously. This has reduced the time and cost of sequencing by orders of magnitude, making large-scale sequencing projects routine in many laboratories.
+
+Most recently, single-molecule sequencing approaches, such as Pacific Biosciences' SMRT sequencing and Oxford Nanopore's nanopore sequencing, have further pushed the boundaries. These technologies can sequence individual DNA molecules in real time, offering advantages such as longer read lengths and the ability to detect DNA modifications directly.
+
+These technological advances have not only made sequencing faster and cheaper but have also opened up new applications. For instance, long-read sequencing technologies have made it possible to resolve complex genomic regions and study structural variations that were previously difficult to analyze.
diff --git a/content/sequencing/dna/_index.md b/content/sequencing/dna/_index.md
new file mode 100644
index 0000000..145615a
--- /dev/null
+++ b/content/sequencing/dna/_index.md
@@ -0,0 +1,76 @@
+---
+title: DNA sequencing
+type: docs
+weight: 1
+---
+
+
+DNA sequencing is a laboratory technique that determines the order of nucleotides, or bases, in a DNA molecule.
+The sequence of these bases—adenine (A), guanine (G), cytosine (C), or thymine (T)—encodes the biological information cells use to develop and operate.
+Sequencing DNA is crucial to understanding the function of genes and other parts of the genome.
+
+Several different methods are available for DNA sequencing, each with its own characteristics.
+The most common methods include:
+
+- First generation: [Sanger](./first-gen/sanger/) and [Maxam-Gilbert](./first-gen/maxam-gilbert/);
+- Second generation: [Illumina](./second-gen/illumina/) and [ion torrent](./second-gen/ion-torrent/);
+- Third generation: [Nanopore](./third-gen/nanopore/) and single-molecule real-time.
+
+The development of additional methods represents an active area of genomics research.
+These methods are used in many areas of biology and other sciences, such as medicine, forensics, and anthropology.
+DNA sequencing has become a key technology in molecular biology.
+It allows researchers to identify changes in genes and noncoding DNA, associations with diseases and phenotypes, and potential drug targets.
+
+## Challenges
+
+Every genome sequencing, assembly, and annotation project is different due to each subject genome’s distinctive properties.
+There are four fundamental aspects that must be considered when embarking on a new genome project: the genome size, levels of ploidy and heterozygosity, GC content, and complexity.
+These will directly affect the overall quality and cost of genome sequencing, assembly, and annotation.
+
+### How big is the genome?
+
+The genome size will significantly influence the data that must be ordered and analyzed.
+To assemble a genome, the first step is to secure a certain number/amount of sequences/depth/coverage (called reads) before proceeding with ordering sequence data.
+To get an idea of a genome's size and complexity, publicly available databases for approximate genome sizes are accessible for [fungi](http://www.zbi.ee/fungal-genomesize), [animals](http://www.genomesize.com), and [plants](http://data.kew.org/cvalues).
+If information on a target species is unavailable from a public database, selecting a closely related species is a practical option.
+
+Alternatively, the two widely used flow cytometry and k-mer frequency distribution methods could provide reliable genome size estimates to predict repeat content and heterozygosity rates.
+Flow cytometry is a fast, easy, and accurate system of simultaneous multiparametric analysis for nuclear DNA content, including a ploidy level that isolates nuclei stained with a fluorescent dye.
+K-mer frequency distribution, a pseudo-normal/Poisson distribution around the mean coverage in the histogram of k-mer counts, is a powerful and straightforward approach to using raw Illumina DNA shotgun reads to infer genome size, data preprocessing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequencing coverage estimation, measuring sequencing error rates, and heterozygosity.
+It is highly recommended to use flow cytometry and k-mer methods—the gold standard for genome size measures when designing genomic sequencing projects—because no single sequence-based method performs well for all species.
+They all tend to underestimate genome sizes.
+
+### Levels of ploidy and heterozygosity
+
+Is it a diploid, polyploid, or highly heterozygous hybrid species?
+It is better to use a single individual and sequence a haploid, highly inbred diploid organism, or isogenic line because this will minimize potential heterozygosity problems for genome assembly.
+While most genome assemblers are haploid mode (some diploid-aware mode) to collapse allelic differences into one consensus sequence, using complex polyploid or less inbred diploid genomes can significantly increase the number of present alleles, which will likely result in a more fragmented assembly or create uncertainties about the contigs’ homology.
+If so, polyploid and highly repetitive genomes may require 50% to 100% more sequence data than their diploid counterparts.
+
+### GC content
+
+Is there high/low GC content in a genomic region?
+Extremely low or high GC content in a genomic region is mainly known to cause problems for second-generation sequencing (SGS) technologies (also called short-read sequencing, which refers primarily to Illumina sequencing), resulting in low or no coverage in those regions.
+While this can be compensated for by increasing the coverage, we recommend using third-generation sequencing (TGS) technologies (PacBio and ONT) that do not exhibit this bias.
+
+### Complexity
+
+How many repetitive sequences (or transposable elements) will likely be present in the genome?
+The amount and distribution of repetitive sequences, potentially occurring at different locations in the genome, can hugely influence genome assembly results simply because reads from these different repeats are very similar.
+The assemblers’ algorithms cannot distinguish them effectively.
+This may eventually lead to misassembly and misannotation.
+This is particularly true for SGS reads and assemblies, and a high repeat content will often lead to a fragmented assembly because the assemblers cannot effectively determine the correct assembly of these regions and stop extending the contigs at the border of the repeats.
+To resolve the assembly of repeats (or if the subject genome has a high repeat content), using TGS reads that are sufficiently long to include the unique sequences flanking the repeats is an effective strategy.
+Thus, understanding the target genome and generating sufficient sequence data/read coverage is a crucial starting point in a genome assembly and annotation project.
+
+## Acknowledgements
+
+Adapted with permission from the following materials:
+
+- [genome.gov](https://www.genome.gov/)
+
+
+
+[^ismali2022bioinformatics]: Ismail, H. D. (2022). Bioinformatics: a practical guide to NCBI databases and sequence alignments. Chapman and Hall/CRC.
+[^giani2020long]: Giani, A. M., Gallo, G. R., Gianfranceschi, L., & Formenti, G. (2020). Long walk to genomics: History and current approaches to genome sequencing and assembly. *Computational and Structural Biotechnology Journal*, 18, 9-19. doi: [10.1016/j.csbj.2019.11.002](https://doi.org/10.1016/j.csbj.2019.11.002)
+[^jung2020twelve]: Jung, H., Ventura, T., Chung, J. S., Kim, W. J., Nam, B. H., Kong, H. J., ... & Eyun, S. I. (2020). Twelve quick steps for genome assembly and annotation in the classroom. *PLoS computational biology, 16*(11), e1008325. doi: [10.1371/journal.pcbi.1008325](https://doi.org/10.1371/journal.pcbi.1008325)
diff --git a/content/sequencing/dna/first-gen/_index.md b/content/sequencing/dna/first-gen/_index.md
new file mode 100644
index 0000000..e9eb40d
--- /dev/null
+++ b/content/sequencing/dna/first-gen/_index.md
@@ -0,0 +1,10 @@
+---
+title: First generation
+type: docs
+weight: 3
+---
+
+First-generation sequencing technologies were developed and released in the late 1970s.
+
+- [Sanger](./sanger/)
+- [Maxam-Gilbert](./maxam-gilbert/)
diff --git a/content/sequencing/dna/first-gen/maxam-gilbert/index.md b/content/sequencing/dna/first-gen/maxam-gilbert/index.md
new file mode 100644
index 0000000..e4a13d2
--- /dev/null
+++ b/content/sequencing/dna/first-gen/maxam-gilbert/index.md
@@ -0,0 +1,9 @@
+---
+title: Maxam-Gilbert
+type: docs
+---
+
+
+TODO:
+
+
diff --git a/content/sequencing/dna/first-gen/sanger/_index.md b/content/sequencing/dna/first-gen/sanger/_index.md
new file mode 100644
index 0000000..0baa154
--- /dev/null
+++ b/content/sequencing/dna/first-gen/sanger/_index.md
@@ -0,0 +1,24 @@
+---
+title: Sanger
+type: docs
+---
+
+
+Sanger sequencing, also known as the "chain termination method," is a DNA sequencing technique that was developed by the English biochemist Frederick Sanger and his colleagues in 1977 [^sanger1977dna].
+This method is designed for determining the sequence of nucleotide bases in a piece of DNA (commonly less than 1,000 bp in length) with 99.99% base accuracy [^shendure2008next].
+These fragments are then used to assemble larger DNA fragments and, eventually, entire chromosomes.
+
+First, feel free to watch this video to get an overview of how Sanger sequencing works.
+
+
+
+
+
+[^sanger1977dna]: Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. *Proceedings of the national academy of sciences, 74*(12), 5463-5467. doi: [10.1073/pnas.74.12.5463](https://doi.org/10.1073/pnas.74.12.5463)
+[^shendure2008next]: Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. *Nature biotechnology, 26*(10), 1135-1145. doi: [10.1038/nbt1486](https://doi.org/10.1038/nbt1486)
+
+[^shuhaib2023mastering]: Al-Shuhaib, M. B. S., & Hashim, H. O. (2023). Mastering DNA chromatogram analysis in Sanger sequencing for reliable clinical analysis. *J. Genet. Eng. Biotechnol., 21*(1), 115. doi: [10.1186/s43141-023-00587-6](https://doi.org/10.1186/s43141-023-00587-6)
+[^lopez2021sanger]: Arteche-López, A., Ávila-Fernández, A., Romero, R., Riveiro-Álvarez, R., López-Martínez, M. A., Giménez-Pardo, A., ... & Ayuso, C. (2021). Sanger sequencing is no longer always necessary based on a single-center validation of 1109 NGS variants in 825 clinical exomes. *Scientific reports, 11*(1), 5697. doi: [10.1038/s41598-021-85182-w](https://doi.org/10.1038/s41598-021-85182-w)
+[^cario2020sanger]: De Cario, R., Kura, A., Suraci, S., Magi, A., Volta, A., Marcucci, R., ... & Sticchi, E. (2020). Sanger validation of high-throughput sequencing in genetic diagnosis: Still the best practice?. *Frontiers in genetics, 11*, 592588. doi: [10.3389/fgene.2020.592588](https://doi.org/10.3389/fgene.2020.592588)
+[^giani2020long]: Giani, A. M., Gallo, G. R., Gianfranceschi, L., & Formenti, G. (2020). Long walk to genomics: History and current approaches to genome sequencing and assembly. *Computational and Structural Biotechnology Journal*, 18, 9-19. doi: [10.1016/j.csbj.2019.11.002](https://doi.org/10.1016/j.csbj.2019.11.002)
+[^crossley2020guidelines]: Crossley, B. M., Bai, J., Glaser, A., Maes, R., Porter, E., Killian, M. L., ... & Toohey-Kurth, K. (2020). Guidelines for Sanger sequencing and molecular assay monitoring. *Journal of Veterinary Diagnostic Investigation, 32*(6), 767-775. doi: [10.1177/1040638720905833](https://doi.org/110.1177/1040638720905833)
diff --git a/content/sequencing/dna/first-gen/sanger/chromatogram/_index.md b/content/sequencing/dna/first-gen/sanger/chromatogram/_index.md
new file mode 100644
index 0000000..982a31b
--- /dev/null
+++ b/content/sequencing/dna/first-gen/sanger/chromatogram/_index.md
@@ -0,0 +1,62 @@
+---
+title: Chromatogram
+type: docs
+---
+
+
+A chromatogram represents the migration of labeled sequencing products via capillary electrophoresis.
+Fluorescence is detected at the end of the capillary, and signal intensity from four color channels, each representing a DNA base, is plotted on the y-axis relative to time on the x-axis.
+
+> ![](https://media.labxchange.org/xblocks/lb-LabXchange-22c08d85-html-1/Sanger%20Sequencing%20Figures_Nicole-14.1624866693317-6ba38bd8669aa942be7db5f53a49c621.png)
+>
+> An example of a sequencing chromatogram. Each DNA nucleotide, adenine, cytosine, guanine, and thymine, has its color on the chromatogram for visualization.
+
+Most trace viewers show a processed version of the chromatogram; analysis software manipulates raw data from the sequencer to facilitate base calling.
+The start of the processed trace, assigned position 1, is the software's best guess of the first peak.
+
+### Trace start
+
+The first 20 to 40 bases are typically not well resolved. Very short sequencing products do not migrate predictably during capillary electrophoresis, and the analysis software has difficulty assigning bases within this region, causing Ns to appear in the sequence. To avoid critical data falling in this region, we recommend designing primers that bind at least 60 bp, preferably 100 bp, away from critical bases. Note that the sequence of the primer will not be seen in the results, as bases within the primer are not labeled during the extension reaction.
+
+> ![](https://www.azenta.com/sites/default/files/web-media-library/blog/analyzing-sanger-sequencing-data/fig_02_start-trace.png)
+>
+
+### Trace middle
+
+Most sequencing protocols are optimized to provide the best peak resolution between roughly 100 and 500 bases. Peaks should be sharp and well-spaced in this range, and the base calling is most reliable.
+
+> ![](https://www.azenta.com/sites/default/files/web-media-library/blog/analyzing-sanger-sequencing-data/fig_03_middle-trace.png)
+>
+
+### Trace end
+
+Toward the end of the trace, expect peaks to be less defined and lower in intensity. The base calling will also be less reliable. Due to the nature of in vitro polymerization, larger sequencing products are generated less efficiently than their shorter counterparts. Thus, the larger products are fewer in number and produce a weaker signal. Also, with any electrophoresis method, it becomes increasingly difficult to resolve a single-base difference as DNA fragments become larger. For example, the difference in molecular weight between 100 bp and 101 bp is 1%, whereas it's just 0.1% between 1,000 bp and 1,001 bp.
+
+> ![](https://www.azenta.com/sites/default/files/web-media-library/blog/analyzing-sanger-sequencing-data/fig_04_end-trace.png)
+>
+
+### Dye blobs
+
+Broad C and T peaks may be observed around position 80. Known as "dye blobs," these peaks represent aggregates of unincorporated dye terminators. Although most cleanup protocols are effective at removing leftover nucleotides after completion of the sequencing reaction, no method is 100% effective. Dye blobs are more frequently observed in inefficient sequencing reactions, with a higher fraction of unincorporated nucleotides and a low signal-to-noise ratio. Please note that the sequence within this region can often be determined by manual inspection of the chromatogram, even if the analysis software is unable to assign bases (i.e., Ns appear in the sequence). Suppose you need to sequence a key base, such as an SNP.
+In that case, we recommend using primers that bind at least 100 bp away from the key base to avoid it falling within the dye blob region.
+
+> ![](https://www.azenta.com/sites/default/files/web-media-library/blog/analyzing-sanger-sequencing-data/fig_07_dye-blobs.png)
+>
+
+### Assigning bases
+
+Once the sequencer collects the raw data, it's processed and analyzed by base-calling software. The four dyes used in Sanger sequencing have slightly different mobility properties due, in part, to their unequal molecular weights. Therefore, the relative position of the peaks must be slightly adjusted to compensate for these differences (see figure below). After this correction, the algorithm identifies peaks and assigns bases.
+
+> ![](https://www.azenta.com/sites/default/files/web-media-library/blog/analyzing-sanger-sequencing-data/fig_09_raw.png)
+>
+
+
+
+[^sanger1977dna]: Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. *Proceedings of the national academy of sciences, 74*(12), 5463-5467. doi: [10.1073/pnas.74.12.5463](https://doi.org/10.1073/pnas.74.12.5463)
+[^shendure2008next]: Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. *Nature biotechnology, 26*(10), 1135-1145. doi: [10.1038/nbt1486](https://doi.org/10.1038/nbt1486)
+
+[^shuhaib2023mastering]: Al-Shuhaib, M. B. S., & Hashim, H. O. (2023). Mastering DNA chromatogram analysis in Sanger sequencing for reliable clinical analysis. *J. Genet. Eng. Biotechnol., 21*(1), 115. doi: [10.1186/s43141-023-00587-6](https://doi.org/10.1186/s43141-023-00587-6)
+[^lopez2021sanger]: Arteche-López, A., Ávila-Fernández, A., Romero, R., Riveiro-Álvarez, R., López-Martínez, M. A., Giménez-Pardo, A., ... & Ayuso, C. (2021). Sanger sequencing is no longer always necessary based on a single-center validation of 1109 NGS variants in 825 clinical exomes. *Scientific reports, 11*(1), 5697. doi: [10.1038/s41598-021-85182-w](https://doi.org/10.1038/s41598-021-85182-w)
+[^cario2020sanger]: De Cario, R., Kura, A., Suraci, S., Magi, A., Volta, A., Marcucci, R., ... & Sticchi, E. (2020). Sanger validation of high-throughput sequencing in genetic diagnosis: Still the best practice?. *Frontiers in genetics, 11*, 592588. doi: [10.3389/fgene.2020.592588](https://doi.org/10.3389/fgene.2020.592588)
+[^giani2020long]: Giani, A. M., Gallo, G. R., Gianfranceschi, L., & Formenti, G. (2020). Long walk to genomics: History and current approaches to genome sequencing and assembly. *Computational and Structural Biotechnology Journal*, 18, 9-19. doi: [10.1016/j.csbj.2019.11.002](https://doi.org/10.1016/j.csbj.2019.11.002)
+[^crossley2020guidelines]: Crossley, B. M., Bai, J., Glaser, A., Maes, R., Porter, E., Killian, M. L., ... & Toohey-Kurth, K. (2020). Guidelines for Sanger sequencing and molecular assay monitoring. *Journal of Veterinary Diagnostic Investigation, 32*(6), 767-775. doi: [10.1177/1040638720905833](https://doi.org/110.1177/1040638720905833)
diff --git a/content/sequencing/dna/first-gen/sanger/principles/_index.md b/content/sequencing/dna/first-gen/sanger/principles/_index.md
new file mode 100644
index 0000000..738b24f
--- /dev/null
+++ b/content/sequencing/dna/first-gen/sanger/principles/_index.md
@@ -0,0 +1,126 @@
+---
+title: Controlled DNA synthesis
+type: docs
+---
+
+
+The development of Sanger sequencing was grounded in a deep understanding of DNA structure and replication mechanisms.
+Frederick Sanger and his colleagues at the University of Cambridge approached the challenge of determining DNA sequences by considering how the natural process of DNA replication could be manipulated and observed.
+
+## Reading DNA's Code
+
+At the core of Sanger's work was a fundamental question: How can we determine the exact sequence of nucleotides in a DNA molecule?
+This was not a trivial problem, as DNA molecules can be extremely long and are composed of just four types of nucleotides (A, T, G, C) repeated in a specific order.
+To address this challenge, Sanger and his colleagues drew upon several key principles and concepts related to DNA structure and replication.
+
+Understanding how DNA naturally replicates was crucial to Sanger's approach.
+DNA polymerase enzymes read a template strand and create a complementary strand by adding nucleotides one by one.
+This process is highly accurate and processive, meaning it can continue for long stretches of DNA without dissociating from the template.
+The specific pairing of nucleotides (A with T, G with C) ensures the accuracy of replication and provides a mechanism for "reading" one strand by creating its complement.
+This complementarity is fundamental to both natural DNA replication and Sanger sequencing.
+
+Several challenges need to be overcome to develop a reliable sequencing method.
+Genomic DNA can be millions of base pairs long, making developing a method to read such long sequences a significant challenge.
+Additionally, the four DNA nucleotides are very similar in structure and chemical properties, making it difficult to distinguish between them directly.
+The speed and processivity of DNA polymerase in natural systems also posed a challenge, as replication occurs too quickly and continuously to allow for easy observation of each nucleotide addition.
+Furthermore, in natural systems, DNA replication continues until the end of the template is reached or specific termination signals are encountered.
+
+## Terminating DNA replication
+
+Sanger hypothesized that if DNA replication could be selectively terminated at specific points, the resulting fragments would reveal the sequence of the template DNA.
+Here's how this idea works in practice:
+
+1. **Template Preparation**: Start with a single-stranded DNA template whose sequence you want to determine.
+2. **Primer Attachment**: Attach a short, known DNA sequence (primer) to the template.
+ This provides a starting point for DNA synthesis.
+3. **Controlled Synthesis**: Allow DNA polymerase to begin synthesizing the complementary strand, but include in the reaction a small amount of modified nucleotides that, when incorporated, stop further synthesis.
+4. **Fragment Generation**: As synthesis proceeds, these modified nucleotides are occasionally incorporated instead of the normal nucleotides, causing the synthesis to stop at different points along the template.
+5. **Fragment Analysis**: By separating these fragments by size and determining which modified nucleotide caused each termination, you can deduce the sequence of the original template.
+
+You can see a brief animation of this controlled synthesis idea below.
+
+> This model simplifies the complex process of DNA replication, omitting enzymes like DNA polymerase and other cellular factors. It aims to visualize the basic concept of complementary base pairing and the directionality of DNA synthesis.
+>
+>
+>
+> [Full screen](/p5/chain-term/chain-termination.html)
+
+The key to this method is the generation of a set of DNA fragments that differ in length by single nucleotides.
+Each fragment length corresponds to the position of a specific nucleotide in the sequence.
+By determining the length of each fragment and which nucleotide it ends with, you can reconstruct the sequence of the original DNA template.
+This approach transformed the problem of reading a long, complex DNA molecule into a more manageable task of analyzing a collection of shorter DNA fragments.
+
+## Nucleotide detection
+
+At the time of his groundbreaking work, the tools available for molecular detection were limited, with radioactivity being the primary means of tracking biological molecules.
+Sanger leveraged this technology by incorporating radioactively labeled, terminating nucleotides, often referred to as "hot" nucleotides, into his sequencing reactions.
+However, the limitation of this approach was that radioactive labeling could not distinguish between different nucleotides; it could only indicate the presence of a labeled molecule.
+
+To overcome this challenge, Sanger designed a protocol that required four separate reactions, one for each nucleotide (A, T, G, and C).
+In each reaction, only one type of terminating nucleotide was radioactively labeled.
+This clever workaround allowed researchers to determine which nucleotide was present at each position in the sequence by comparing the results from all four reactions.
+While effective, this method was labor-intensive and time-consuming, requiring multiple reactions and careful analysis to piece together the complete DNA sequence.
+
+Modern Sanger sequencing techniques have moved away from radioactive labeling in favor of fluorescence-based detection.
+This shift has dramatically simplified and accelerated the sequencing process.
+Instead of using radioactive nucleotides and running four separate reactions, contemporary methods employ fluorescently labeled dideoxynucleotides that emit light at different wavelengths.
+Each of the four nucleotides is tagged with a distinct fluorescent dye, allowing them to be distinguished from one another in a single reaction.
+
+This advancement has not only increased the efficiency of Sanger sequencing but has also paved the way for its automation, significantly boosting the speed and scale at which DNA sequences can be determined.
+
+## Fragment separation
+
+The separation of DNA fragments is a crucial step in Sanger sequencing, as it allows for the determination of fragment lengths and, consequently, the DNA sequence.
+This process has undergone significant changes since the method's inception, reflecting advancements in technology and the push for higher throughput and automation.
+
+In the original Sanger sequencing method, DNA fragment separation was achieved through a labor-intensive process called slab gel electrophoresis.
+This technique involved pouring a polyacrylamide gel between two glass plates, creating a thin, flat gel "slab."
+The DNA samples from the four separate sequencing reactions (one for each nucleotide) were loaded into wells at one end of the gel.
+An electric field was then applied, causing the negatively charged DNA fragments to migrate through the gel towards the positive electrode.
+Smaller fragments moved more quickly through the gel matrix, while larger fragments were more impeded, resulting in separation based on size.
+
+> This image shows a classic DNA sequencing gel obtained through Sanger sequencing and visualized by autoradiography.
+>
+> ![](https://upload.wikimedia.org/wikipedia/commons/c/cb/Sequencing.jpg){ height=600 }
+>
+> The gel contains four lanes, labeled A, T, G, and C, corresponding to the four DNA nucleotides. Each lane represents a separate sequencing reaction containing DNA fragments that terminate with the respective nucleotide.
+> To read the sequence:
+>
+> 1. Start at the bottom of the gel, where the shortest fragments are located.
+> 2. Move upwards, reading the sequence by identifying which lane (A, T, G, or C) contains a band at each position.
+> 3. The sequence is read from bottom to top, corresponding to the 5' to 3' direction of the DNA strand.
+>
+> For example, if the bottom-most band is in the T lane, followed by bands in the A, C, and G lanes moving upwards, the beginning of the sequence would be TACG.
+> This method allows researchers to determine the precise order of nucleotides in a DNA molecule.
+
+The early gels were typically quite large, often 40-60 cm in length, to provide sufficient resolution for separating fragments that differed by just one nucleotide.
+After electrophoresis, which could take several hours, the gel would be dried and exposed to X-ray film to visualize the radioactively labeled DNA bands.
+Researchers would then manually read the sequence from the resulting autoradiograph, a time-consuming and error-prone process.
+Despite these challenges, this method was groundbreaking, allowing scientists to read sequences of up to several hundred base pairs in a single run.
+
+As Sanger sequencing evolved, significant improvements were made to the fragment separation process.
+The introduction of automated sequencers in the late 1980s and early 1990s marked a turning point.
+These machines utilized capillary electrophoresis, a technique that replaced the large slab gels with narrow glass capillaries filled with a polymer solution.
+This shift brought several advantages:
+
+1. **Increased speed**: Capillary electrophoresis could be performed much more quickly than slab gel electrophoresis, reducing run times from hours to minutes.
+2. **Improved resolution**: The narrow capillaries allowed for better heat dissipation, enabling the use of higher voltages and resulting in sharper band separation.
+3. **Automation**: Capillary systems could be easily automated, allowing for continuous operation and higher throughput.
+4. **Real-time detection**: By incorporating laser-induced fluorescence detection, these systems could detect and record the fluorescent signal from each DNA fragment as it passed a detection window, eliminating the need for post-electrophoresis processing.
+
+!!! quote "Figure"
+
+
+Modern Sanger sequencing machines typically use arrays of 8, 16, or even 96 capillaries running in parallel, dramatically increasing the number of samples that can be processed simultaneously.
+The polymer solutions used in these capillaries have also been optimized for better separation and faster run times.
+
+[^sanger1977dna]: Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. *Proceedings of the national academy of sciences, 74*(12), 5463-5467. doi: [10.1073/pnas.74.12.5463](https://doi.org/10.1073/pnas.74.12.5463)
+[^shendure2008next]: Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. *Nature biotechnology, 26*(10), 1135-1145. doi: [10.1038/nbt1486](https://doi.org/10.1038/nbt1486)
+
+[^shuhaib2023mastering]: Al-Shuhaib, M. B. S., & Hashim, H. O. (2023). Mastering DNA chromatogram analysis in Sanger sequencing for reliable clinical analysis. *J. Genet. Eng. Biotechnol., 21*(1), 115. doi: [10.1186/s43141-023-00587-6](https://doi.org/10.1186/s43141-023-00587-6)
+[^lopez2021sanger]: Arteche-López, A., Ávila-Fernández, A., Romero, R., Riveiro-Álvarez, R., López-Martínez, M. A., Giménez-Pardo, A., ... & Ayuso, C. (2021). Sanger sequencing is no longer always necessary based on a single-center validation of 1109 NGS variants in 825 clinical exomes. *Scientific reports, 11*(1), 5697. doi: [10.1038/s41598-021-85182-w](https://doi.org/10.1038/s41598-021-85182-w)
+[^cario2020sanger]: De Cario, R., Kura, A., Suraci, S., Magi, A., Volta, A., Marcucci, R., ... & Sticchi, E. (2020). Sanger validation of high-throughput sequencing in genetic diagnosis: Still the best practice?. *Frontiers in genetics, 11*, 592588. doi: [10.3389/fgene.2020.592588](https://doi.org/10.3389/fgene.2020.592588)
+[^giani2020long]: Giani, A. M., Gallo, G. R., Gianfranceschi, L., & Formenti, G. (2020). Long walk to genomics: History and current approaches to genome sequencing and assembly. *Computational and Structural Biotechnology Journal*, 18, 9-19. doi: [10.1016/j.csbj.2019.11.002](https://doi.org/10.1016/j.csbj.2019.11.002)
+[^crossley2020guidelines]: Crossley, B. M., Bai, J., Glaser, A., Maes, R., Porter, E., Killian, M. L., ... & Toohey-Kurth, K. (2020). Guidelines for Sanger sequencing and molecular assay monitoring. *Journal of Veterinary Diagnostic Investigation, 32*(6), 767-775. doi: [10.1177/1040638720905833](https://doi.org/110.1177/1040638720905833)
diff --git a/content/sequencing/dna/first-gen/sanger/procedure/_index.md b/content/sequencing/dna/first-gen/sanger/procedure/_index.md
new file mode 100644
index 0000000..fd5ac0a
--- /dev/null
+++ b/content/sequencing/dna/first-gen/sanger/procedure/_index.md
@@ -0,0 +1,69 @@
+---
+title: Procedure overview
+type: docs
+---
+
+
+## Dideoxynucleotides
+
+The use of dideoxynucleotides (ddNTPs) as chain terminators was a critical insight in the development of the method.
+This lack of hydroxyl group prevents ddNTPs from making a phosphodiester bond with the next nucleotide, thus terminating the nucleotide chain.
+
+> ![](https://www.onlinebiologynotes.com/wp-content/uploads/2017/07/ddNTPs-1.jpg)
+>
+
+Respective ddNTPs of dNTPs terminate the chain at their respective site.
+For example, ddATP terminates at the A site, ddCTP at the C site, ddGTP at the G site, and ddTTP at the T site.
+
+## PCR
+
+The sequencing begins by dividing the DNA sample into four separate reactions, each containing all four standard deoxynucleotides (dATP, dGTP, dCTP, and dTTP) and the DNA polymerase enzyme.
+In these reactions, only one type of dideoxynucleotide (ddATP, ddGTP, ddCTP, or ddTTP) is added to each reaction alongside the regular deoxynucleotides.
+
+As the DNA polymerase extends the DNA chain during each sequencing reaction, termination occurs at different positions depending on which dideoxynucleotide is present.
+This results in the production of a series of DNA fragments of varying lengths in each reaction.
+
+The termination at specific bases generates a unique pattern of fragments in each reaction.
+These fragments represent the particular nucleotide positions in the original DNA sequence.
+Using different dideoxynucleotides in separate reactions allows the researcher to obtain information about the sequence at each position along the DNA template.
+
+> ![](https://www.sigmaaldrich.com/deepweb/assets/sigmaaldrich/marketing/global/images/technical-documents/protocols/genomics/sequencing/sanger-sequencing_steps_process_diagram/sanger-sequencing_steps_process_diagram.png)
+>
+> Credit: [Millipore Sigma](https://www.sigmaaldrich.com/US/en/technical-documents/protocol/genomics/sequencing/sanger-sequencing)
+
+## Fluorescence
+
+The addition of fluorescent tags to dideoxynucleotides (ddNTPs) facilitates the detection and determination of DNA sequences.
+Each ddNTP (ddATP, ddGTP, ddCTP, or ddTTP) is labeled with a distinct fluorescent tag.
+
+> ![](https://upload.wikimedia.org/wikipedia/commons/0/09/%D0%A4%D0%BB%D1%83%D0%BE%D1%80%D0%B5%D1%81%D1%86%D0%B5%D0%BD%D1%82%D0%BD%D1%8B%D0%B5_%D0%B4%D0%B8%D0%B4%D0%B5%D0%B7%D0%BE%D0%BA%D1%81%D0%B8%D1%82%D0%B5%D1%80%D0%BC%D0%B8%D0%BD%D0%B0%D1%82%D0%BE%D1%80%D1%8B_%D0%A1%D0%B0%D0%BD%D0%B3%D0%B5%D1%80.svg){ alight=left width=600 }
+>
+
+This labeling allows the simultaneous sequencing of DNA fragments from four separate reactions.
+After the DNA fragments are generated and separated by size through gel electrophoresis, adding fluorescent tags enables researchers to visualize and distinguish the terminated fragments based on their specific ddNTP.
+Automated sequencing machines can detect the color-coded fragments, providing a faster and more accurate means of determining the DNA sequence.
+This fluorescence-based approach enhances sensitivity and precision and reduces ambiguity in reading sequences.
+The use of fluorescently labeled ddNTPs has become a standard practice in Sanger sequencing, particularly in high-throughput sequencing projects, contributing to the efficiency and automation of the sequencing process.
+
+## Gel electrophoresis detection
+
+TODO:
+
+## From Theory to Practice
+
+Translating these ideas into a working method required overcoming numerous technical challenges, such as:
+
+- Finding the right balance of dNTPs and ddNTPs
+- Developing efficient separation techniques
+- Creating sensitive detection methods (initially using radioactive labels, later fluorescent tags)
+
+The elegance of Sanger sequencing lies in how it leverages fundamental principles of DNA structure and replication to solve the complex problem of determining nucleotide sequences. This method, born from careful reasoning and innovative thinking, revolutionized molecular biology and laid the groundwork for the genomic era.
+
+[^sanger1977dna]: Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. *Proceedings of the national academy of sciences, 74*(12), 5463-5467. doi: [10.1073/pnas.74.12.5463](https://doi.org/10.1073/pnas.74.12.5463)
+[^shendure2008next]: Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. *Nature biotechnology, 26*(10), 1135-1145. doi: [10.1038/nbt1486](https://doi.org/10.1038/nbt1486)
+
+[^shuhaib2023mastering]: Al-Shuhaib, M. B. S., & Hashim, H. O. (2023). Mastering DNA chromatogram analysis in Sanger sequencing for reliable clinical analysis. *J. Genet. Eng. Biotechnol., 21*(1), 115. doi: [10.1186/s43141-023-00587-6](https://doi.org/10.1186/s43141-023-00587-6)
+[^lopez2021sanger]: Arteche-López, A., Ávila-Fernández, A., Romero, R., Riveiro-Álvarez, R., López-Martínez, M. A., Giménez-Pardo, A., ... & Ayuso, C. (2021). Sanger sequencing is no longer always necessary based on a single-center validation of 1109 NGS variants in 825 clinical exomes. *Scientific reports, 11*(1), 5697. doi: [10.1038/s41598-021-85182-w](https://doi.org/10.1038/s41598-021-85182-w)
+[^cario2020sanger]: De Cario, R., Kura, A., Suraci, S., Magi, A., Volta, A., Marcucci, R., ... & Sticchi, E. (2020). Sanger validation of high-throughput sequencing in genetic diagnosis: Still the best practice?. *Frontiers in genetics, 11*, 592588. doi: [10.3389/fgene.2020.592588](https://doi.org/10.3389/fgene.2020.592588)
+[^giani2020long]: Giani, A. M., Gallo, G. R., Gianfranceschi, L., & Formenti, G. (2020). Long walk to genomics: History and current approaches to genome sequencing and assembly. *Computational and Structural Biotechnology Journal*, 18, 9-19. doi: [10.1016/j.csbj.2019.11.002](https://doi.org/10.1016/j.csbj.2019.11.002)
+[^crossley2020guidelines]: Crossley, B. M., Bai, J., Glaser, A., Maes, R., Porter, E., Killian, M. L., ... & Toohey-Kurth, K. (2020). Guidelines for Sanger sequencing and molecular assay monitoring. *Journal of Veterinary Diagnostic Investigation, 32*(6), 767-775. doi: [10.1177/1040638720905833](https://doi.org/110.1177/1040638720905833)
diff --git a/content/sequencing/dna/pcr/_index.md b/content/sequencing/dna/pcr/_index.md
new file mode 100644
index 0000000..0d7437d
--- /dev/null
+++ b/content/sequencing/dna/pcr/_index.md
@@ -0,0 +1,86 @@
+---
+title: Polymerase chain reaction
+type: docs
+weight: 2
+---
+
+The polymerase chain reaction (PCR) is a laboratory nucleic acid amplification technique used to denature and renature short segments of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequences using DNA polymerase I enzyme, an isolate from Thermus aquaticus, known as Taq DNA. [^lorenz2012polymerase]
+In 1985, PCR was introduced by Mullis and colleagues for which they received a Nobel prize.
+It is a monumental tool used in biomolecular sciences for its profound ability to examine and detect amplified components of DNA.
+
+PCR is a procedure that selectively focuses on a minuscule segment of DNA in a test tube. [^markham1993polymerase]
+Thermostability has the propensity to resist irreversible alterations in chemical and physical properties in extreme temperatures.
+Following several repetitive cycles of denaturation and renaturation in PCR procedures, Taq polymerase enzyme is preferred due to its heat-stable property, thus, allowing for the continuation of DNA synthesis despite the exposure of primers.
+PCR has been the prominent procedure of choice in diagnosing a wide array of bacterial and viral infections, as well as screening genetic diseases due to its high sensitivity making it the gold standard testing procedure for numerous samples.
+
+Polymerase chain reaction procedures begin with the collection of a small sample of DNA in a test tube.
+PCR consists of three major phases: denaturation, hybridization/annealing, elongation/amplification.
+
+## Denaturation
+
+During the denaturation phase, DNA is heated to 95 celsius (C) to dissociate the hydrogen bonds between complementary base pairs of the double-stranded DNA.
+
+## Annealing
+
+Immediately following denaturation, the process of annealing occurs; annealing involves cooling the denatured DNA at a temperature ranging from 37-72 C allowing for the hydrogen bonds to reform.
+Annealing best occurs at temperatures between 55 C to 72 C.
+
+The specific temperature is determined based on the physical and chemical properties of the specific primers used in the solution.
+Primers are 20-25 nucleotides in length.
+Annealing allows for the primers to bind to the single-stranded DNA at their respective complementary sites beginning at the 3’ end of the DNA template.
+Subsequently, the binding of the primers to their complementary sites on single-stranded DNA generates two double-stranded molecules.
+
+### Standard primers
+
+The M13-tailed primers are used to simplify the workflow when sequencing PCR products and they reduce the loss of the 5' unresolvable bases.
+When the PCR primers contain M13 tails on their 5' ends, the M13 sequence is incorporated into the amplicons.
+This enables sequencing master mixes containing the universal M13 forward, or M13 reverse primers.
+The sequence for the M13 forward and reverse primers is as follows:
+
+- M13 forward primer sequence: `5′ TGTAAAACGACGGCCAGT 3′`
+- M13 reverse primer sequence: `5′ CAGGAAACAGCTATGACC 3′`
+
+The primer is designed for the known sequences at the 3' end of the template strand.
+M13 sequences are generally attached to the 3' end, and the primer of this M13 is made.
+
+## Elongation
+
+Finally, an optimal reaction temperature, 75-80 C, that is best suitable for enzyme-induced DNA replication is selected to ensure DNA polymerase activity.
+
+In order to initiate the functionality of DNA polymerase, double-stranded DNA is mandatory for the occurrence of replication.
+Thereafter, DNA polymerase synthesizes DNA in a 3’ to 5’ direction producing strands identical to the template strands.
+This procedure is repeated several times via a thermal cycler.
+A thermal cycler is a device that controls the time and temperature of each cycle and its respective steps.
+This ultimately leads to several duplicated DNA available in the tube.
+
+> This model simplifies the complex process of DNA replication, omitting enzymes like DNA polymerase and other cellular factors. It aims to visualize the basic concept of complementary base pairing and the directionality of DNA synthesis.
+>
+>
+>
+>
+> [Full screen](/p5/dna-elongation/dna-elongation.html)
+>
+> This interactive animation demonstrates a simplified model of DNA replication, specifically the elongation phase:
+>
+> - **Template Strand**: The top strand represents the 3' to 5' template DNA strand, serving as the blueprint for replication.
+> - **Growing Strand**: The bottom strand shows the newly synthesizing DNA, growing in the 5' to 3' direction.
+> - **Floating Nucleotides**: Colorful shapes (circles and squares) represent free nucleotides (A, T, C, G) moving randomly in solution, simulating Brownian motion.
+> - **Base Pairing**: When a complementary nucleotide approaches the growing strand's end, it attaches, extending the new DNA strand.
+> This process follows the base-pairing rule (A with T, C with G).
+> - **Speed Control**: Use the slider in the top left corner to adjust the simulation speed.
+> Higher values accelerate the movement and replication process.
+> - **Restart Button**: Click 'Restart' to begin a new replication cycle with a fresh template strand.
+
+## Cycling
+
+Following 30-40 cycles, repetitive cycles eventually taper off due to the limited capability of the reagent as well as other contributing factors such as accumulation of pyrophosphate molecules, excessive self-annealing, and the presence of PCR inhibitors in the sample.
+There are several inhibitors that can affect the proper functioning of PCR. The most common PCR inhibitors are proteinase K, phenol, Ethylenediaminetetraacetic acid (EDTA).
+
+## Acknowledgements
+
+Material is reused with permission from StatPearls Publishing LLC under the [CC BY-NC-ND 4.0](http://creativecommons.org/licenses/by-nc-nd/4.0/) license.
+
+
+
+[^lorenz2012polymerase]: Lorenz, T. C. (2012). Polymerase chain reaction: basic protocol plus troubleshooting and optimization strategies. JoVE (Journal of Visualized Experiments), (63), e3998.
+[^markham1993polymerase]: Markham, A. F. (1993). The polymerase chain reaction: a tool for molecular medicine. BMJ: British Medical Journal, 306(6875), 441.
diff --git a/content/sequencing/dna/sample/.pages b/content/sequencing/dna/sample/.pages
new file mode 100644
index 0000000..9b87e6e
--- /dev/null
+++ b/content/sequencing/dna/sample/.pages
@@ -0,0 +1,9 @@
+nav:
+ - index.md
+ - lysis
+ - purification
+ - quantification
+ - fragmentation
+ - ligation
+ - amplification
+ - qc
diff --git a/content/sequencing/dna/sample/_index.md b/content/sequencing/dna/sample/_index.md
new file mode 100644
index 0000000..6433c75
--- /dev/null
+++ b/content/sequencing/dna/sample/_index.md
@@ -0,0 +1,17 @@
+---
+title: Sample preparation protocol
+type: docs
+weight: 1
+---
+
+
+The journey from a cellular solution to sequencing-ready DNA involves a series of critical steps, each designed to isolate, purify, and prepare the genetic material for analysis.
+By meticulously following these steps, researchers can prepare high-quality DNA samples that are optimized for sequencing.
+
+- [Cell lysis](./lysis/)
+- [DNA purification](./purification/)
+- [DNA quantification](./quantification/)
+- [DNA fragmentation](./fragmentation/)
+- [Adapter ligation](./ligation/)
+- [Library amplification](./amplification/)
+- [Quality control](./qc/)
diff --git a/content/sequencing/dna/sample/amplification/_index.md b/content/sequencing/dna/sample/amplification/_index.md
new file mode 100644
index 0000000..d45bb69
--- /dev/null
+++ b/content/sequencing/dna/sample/amplification/_index.md
@@ -0,0 +1,17 @@
+---
+title: Library amplification
+type: docs
+weight: 6
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+Many sequencing protocols require amplification of the adapted DNA fragments:
+
+- **PCR amplification**: Selectively amplifies fragments with correctly ligated adapters.
+- **Emulsion PCR**: Used in some platforms to amplify DNA on beads in isolated microdroplets.
+- **Bridge amplification**: An solid-phase amplification method used in some next-generation sequencing platforms.
diff --git a/content/sequencing/dna/sample/fragmentation/_index.md b/content/sequencing/dna/sample/fragmentation/_index.md
new file mode 100644
index 0000000..28bb963
--- /dev/null
+++ b/content/sequencing/dna/sample/fragmentation/_index.md
@@ -0,0 +1,28 @@
+---
+title: DNA fragmentation
+type: docs
+weight: 4
+---
+
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+Most sequencing platforms require DNA fragments of specific sizes:
+
+- **Enzymatic fragmentation**: Employs restriction enzymes or non-specific endonucleases to cleave DNA at defined sites or randomly.
+- **Mechanical shearing**: Includes methods like sonication or hydrodynamic shearing, which produce random fragments.
+- **Tagmentation**: A more recent method that combines fragmentation and adapter ligation in a single step using transposase enzymes.
+
+The choice of fragmentation method depends on the sequencing platform and the desired fragment size distribution.
+
+## Size Selection
+
+To ensure optimal sequencing performance, DNA fragments must be within a specific size range:
+
+- **Gel electrophoresis**: Fragments of the desired size are excised from agarose gels.
+- **Bead-based methods**: Magnetic beads with size-selective binding properties are used to isolate fragments of interest.
+- **Microfluidic systems**: Automated platforms that can precisely select fragments based on size.
diff --git a/content/sequencing/dna/sample/ligation/index.md b/content/sequencing/dna/sample/ligation/index.md
new file mode 100644
index 0000000..a4f3648
--- /dev/null
+++ b/content/sequencing/dna/sample/ligation/index.md
@@ -0,0 +1,18 @@
+---
+title: Adapter ligation
+type: docs
+weight: 5
+---
+
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+These steps prepare the DNA fragments for sequencing:
+
+- **End repair**: Converts overhangs to blunt ends and ensures 5' phosphorylation for efficient adapter ligation.
+- **A-tailing**: Addition of a single adenine nucleotide to the 3' ends of blunt-ended fragments (for some platforms).
+- **Adapter ligation**: Attachment of platform-specific adapters to the DNA fragments, enabling amplification and sequencing.
diff --git a/content/sequencing/dna/sample/lysis/_index.md b/content/sequencing/dna/sample/lysis/_index.md
new file mode 100644
index 0000000..46ae585
--- /dev/null
+++ b/content/sequencing/dna/sample/lysis/_index.md
@@ -0,0 +1,162 @@
+---
+title: Cell lysis
+type: docs
+weight: 1
+---
+
+
+
+The initial step in DNA extraction is cell lysis, which involves breaking open the cells to release their contents, including the DNA.
+The choice of lysis method depends on the sample type, with considerations for cell wall composition and the fragility of the genetic material.
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+## Chemical lysis
+
+Chemical lysis is a fundamental technique in DNA extraction protocols, employing various chemical agents to disrupt cell membranes and release intracellular contents, including nucleic acids.
+This method is widely used due to its efficiency, reproducibility, and applicability to a broad range of cell types.
+
+Chemical lysis operates by destabilizing the lipid bilayer of cell membranes and denaturing proteins.
+This process compromises the structural integrity of the cell, leading to the release of cellular components, including DNA, RNA, and proteins, into the surrounding solution.
+
+### Detergents
+
+In the field of molecular biology, particularly in DNA extraction procedures, detergents play a crucial role in cell lysis.
+Understanding how these molecules function is essential for grasping the fundamental principles of sample preparation in DNA sequencing and other molecular techniques.
+
+Detergents are remarkable molecules with a unique structure that makes them invaluable in cell lysis protocols.
+At their core, detergents are amphipathic, meaning they possess both hydrophilic (water-loving) and hydrophobic (water-fearing) regions.
+This dual nature is key to their function.
+
+- The **hydrophilic head** of a detergent molecule interacts favorably with water,
+- while its **hydrophobic tail**, typically a hydrocarbon chain, avoids water interactions.
+
+> ![](https://pubs.acs.org/cms/10.1021/acs.chemrev.7b00570/asset/images/large/cr-2017-005709_0001.jpeg)
+>
+> DOI: [10.1021/acs.chemrev.7b00570](https://doi.org/10.1021/acs.chemrev.7b00570)
+
+When introduced to a cellular environment, detergents begin their work by integrating themselves into the lipid bilayer of the cell membrane.
+The hydrophobic tails of the detergent molecules mingle with the fatty acid chains of the membrane lipids, while their hydrophilic heads remain exposed to the aqueous environment both inside and outside the cell.
+As more detergent molecules incorporate into the membrane, they initiate a process of micelle formation.
+
+> ![](https://www.cusabio.com/statics/images/TP-list-img02-1.png)
+>
+> DOI: [10.1007/978-3-319-35072-1_2](https://doi.org/10.1007/978-3-319-35072-1_2)
+
+Micelles are spherical structures formed when the hydrophobic tails of detergent molecules cluster together to avoid water, with the hydrophilic heads facing outward.
+In the context of cell lysis, these micelles begin to include not just detergent molecules but also components of the cell membrane.
+This process marks the beginning of membrane solubilization.
+
+At a specific detergent concentration known as the critical micelle concentration (CMC), the membrane's structural integrity begins to fail.
+The lipids and proteins that once formed the cell's protective barrier are now incorporated into detergent micelles.
+This solubilization of the membrane effectively creates holes in the cell's outer layer, leading to the release of cellular contents, including the all-important DNA, into the surrounding solution.
+
+However, the action of detergents isn't limited to membrane disruption.
+Many detergents, particularly ionic ones like sodium dodecyl sulfate (SDS), also interact with and denature proteins.
+This process occurs as the hydrophobic tails of detergent molecules interact with the hydrophobic regions of proteins, which are typically buried in the protein's interior.
+These interactions cause proteins to unfold, exposing more of their hydrophobic regions.
+The result is often a loss of protein structure and function, which can be beneficial in DNA extraction as it helps to separate DNA from its associated proteins.
+
+The efficiency of detergent-mediated cell lysis depends on several factors.
+The concentration of the detergent is crucial and must be above the CMC for effective lysis.
+Temperature also plays a role, with higher temperatures generally increasing lysis efficiency, although excessive heat can risk DNA degradation.
+The ionic strength of the solution and its pH can affect how detergents interact with cellular components, thereby influencing lysis effectiveness.
+Additionally, different cell types, with their varying membrane compositions, may respond differently to detergent lysis, necessitating optimized protocols for different sample types.
+Understanding these mechanisms allows researchers to fine-tune their lysis protocols.
+For instance, when working with mammalian cells, which are generally more susceptible to detergent lysis, a gentler approach might be used.
+In contrast, bacteria, especially gram-positive species with their robust cell walls, might require stronger detergents or a combination of detergent and enzymatic treatment for effective lysis.
+
+=== "Sodium Dodecyl Sulfate (SDS)"
+
+ An anionic detergent that is particularly effective in disrupting cell membranes and denaturing proteins.
+
+ !!! quote ""
+
+
+ SDS works by:
+
+ - Binding to and denaturing proteins, disrupting protein-lipid interactions.
+ - Solubilizing membrane lipids, leading to membrane disintegration.
+ - Typical concentrations range from 0.1% to 1% (w/v).
+
+=== "Triton X-100"
+
+ A non-ionic detergent that is gentler than SDS and often used for isolating membrane-bound proteins.
+
+ !!! quote ""
+
+
+ - It can solubilize membranes without significantly denaturing proteins.
+ - Typically used at concentrations of 0.1% to 2% (v/v).
+
+=== "CHAPS"
+
+ A zwitterionic detergent that is effective for membrane protein solubilization.
+
+ !!! quote ""
+
+
+ - It maintains protein activity better than SDS.
+ - Commonly used at concentrations of 0.5% to 2% (w/v).
+
+### Chaotropic Agents
+
+Chaotropic agents disrupt the hydrogen bonding network between water molecules, affecting the stability of other molecules in the solution.
+
+=== "Guanidinium Thiocyanate"
+
+ A potent protein denaturant and cell lysis agent.
+
+ !!! quote ""
+
+
+ - It rapidly solubilizes cellular components and inactivates nucleases.
+ - Typically used at high concentrations (4-6 M).
+ - Often combined with phenol in the guanidinium thiocyanate-phenol-chloroform extraction method.
+
+=== "Urea"
+
+ Another chaotropic agent used for cell lysis and protein denaturation.
+
+ !!! quote ""
+
+
+ - Less potent than guanidinium thiocyanate but still effective.
+ - Typically used at concentrations of 6-8 M.
+
+### Considerations
+
+1. **Cell Type Specificity**: Different cell types may require different lysis conditions. For example:
+ - Mammalian cells are generally more susceptible to detergent lysis than bacteria.
+ - Gram-positive bacteria, with their thick peptidoglycan layer, may require stronger lysis conditions or additional enzymatic treatment.
+2. **pH and Ionic Strength**: The efficiency of chemical lysis can be influenced by pH and salt concentration. Optimizing these parameters can improve lysis efficiency and DNA yield.
+3. **Temperature**: Some lysis protocols may require elevated temperatures (50-70°C) to increase the efficiency of membrane disruption and protein denaturation.
+4. **Downstream Applications**: The choice of lysis method should consider potential interference with subsequent steps in the DNA preparation process. For instance:
+ - SDS can inhibit some enzymatic reactions and may need to be removed before certain downstream applications.
+ - Chaotropic agents may need to be diluted or removed before DNA amplification or sequencing.
+5. **DNA Integrity**: While chemical lysis is generally less harsh than mechanical methods, prolonged exposure to harsh chemicals or elevated temperatures can lead to DNA degradation. Protocols should be optimized to minimize this risk.
+6. **Combinatorial Approaches**: Chemical lysis is often combined with other methods for enhanced efficiency:
+ - Enzymatic treatment (e.g., lysozyme for bacterial cells) followed by detergent lysis.
+ - Mild detergent treatment combined with mechanical disruption for tough-to-lyse samples.
+
+By carefully selecting and optimizing chemical lysis conditions, researchers can efficiently disrupt cell membranes and release DNA, setting the stage for subsequent purification steps in the DNA sequencing sample preparation workflow.
+
+## Enzymatic lysis
+
+Employs enzymes like lysozyme to break down cell walls, particularly effective for bacterial samples.
+
+## Mechanical disruption
+
+Involves physical methods such as sonication, bead-beating, or freeze-thaw cycles to rupture cell membranes.
diff --git a/content/sequencing/dna/sample/purification/index.md b/content/sequencing/dna/sample/purification/index.md
new file mode 100644
index 0000000..b50ca20
--- /dev/null
+++ b/content/sequencing/dna/sample/purification/index.md
@@ -0,0 +1,30 @@
+---
+title: DNA purification
+type: docs
+weight: 2
+---
+
+
+
+Following cell lysis, the released DNA must be separated from other cellular components and purified.
+The choice of detergent and lysis conditions can impact these downstream processes, influencing the purity and quality of the final DNA sample.
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+## Protein removal
+
+Proteinase K digestion is commonly used to degrade proteins, including nucleases that could damage the DNA.
+
+## RNA elimination
+
+RNase treatment may be employed to remove RNA, which can interfere with downstream applications.
+
+## Organic extraction
+
+Phenol-chloroform extraction is a classical method for separating nucleic acids from proteins and lipids.
+
+## Solid-phase extraction
+
+Silica-based methods, often in column format, allow DNA to bind selectively under specific pH and salt conditions.
diff --git a/content/sequencing/dna/sample/qc/_index.md b/content/sequencing/dna/sample/qc/_index.md
new file mode 100644
index 0000000..6de5c51
--- /dev/null
+++ b/content/sequencing/dna/sample/qc/_index.md
@@ -0,0 +1,17 @@
+---
+title: Quality control
+type: docs
+weight: 6
+---
+
+
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+Final assessment of the sequencing library is crucial:
+
+- **qPCR**: Provides accurate quantification of amplifiable fragments.
+- **Bioanalyzer or TapeStation analysis**: Assesses fragment size distribution and library concentration.
+- **Sequencing QC**: A small-scale sequencing run may be performed to validate library quality before full-scale sequencing.
diff --git a/content/sequencing/dna/sample/quantification/_index.md b/content/sequencing/dna/sample/quantification/_index.md
new file mode 100644
index 0000000..a4c798d
--- /dev/null
+++ b/content/sequencing/dna/sample/quantification/_index.md
@@ -0,0 +1,16 @@
+---
+title: DNA quantification
+type: docs
+weight: 3
+---
+
+
+> [!CAUTION]
+>
+> This page is a work in progress and is subject to change at any moment.
+
+Accurate quantification and quality assessment of the extracted DNA are crucial:
+
+- **Spectrophotometric methods**: UV absorbance at 260 nm is used to quantify DNA, with the 260/280 nm ratio indicating purity.
+- **Fluorometric assays**: Dye-based methods (e.g., PicoGreen) offer higher sensitivity and specificity for double-stranded DNA.
+- **Gel electrophoresis**: Provides information on DNA integrity and can detect contamination by RNA or degraded DNA.
diff --git a/content/sequencing/dna/second-gen/_index.md b/content/sequencing/dna/second-gen/_index.md
new file mode 100644
index 0000000..989ffb2
--- /dev/null
+++ b/content/sequencing/dna/second-gen/_index.md
@@ -0,0 +1,9 @@
+---
+title: Second generation
+type: docs
+weight: 4
+---
+
+
+Second-generation sequencing technologies, or next-generation sequencing (NGS), are DNA sequencing methods that emerged in the mid to late 1990s and were implemented in commercial DNA sequencers by 2000.
+These methods use miniaturized and parallelized platforms to sequence 1 million to 43 billion short reads (50 to 400 bases each) per instrument run.
diff --git a/content/sequencing/dna/second-gen/illumina/_index.md b/content/sequencing/dna/second-gen/illumina/_index.md
new file mode 100644
index 0000000..c05244a
--- /dev/null
+++ b/content/sequencing/dna/second-gen/illumina/_index.md
@@ -0,0 +1,192 @@
+---
+title: Illumina
+type: docs
+---
+
+
+Illumina, arguably the most popular sequencing method, is based on the concept of "sequencing by synthesis".
+Sequencing by synthesis technology uses a polymerase or ligase enzyme to incorporate nucleotides with a fluorescent tag, which are then identified to determine the DNA sequence.
+
+
+
+## Genomic library
+
+### DNA fragmentation
+
+The initial phase in NGS library preparation for Illumina systems involves breaking down DNA into the desired size range, typically 300–600 bp, depending on the intended application.
+Traditionally, two methods are used for DNA fragmentation: mechanical shearing and enzymatic digestion.
+
+#### Mechanical shearing
+
+Mechanical shearing is more prevalent among these two methods due to its unbiased fragmentation and ability to consistently yield fragment sizes.
+In contrast, enzymatic digestion necessitates lower DNA input and provides a more streamlined library preparation workflow.
+Mechanical shearing involves the disruption of phosphodiester linkages in DNA molecules by applying shear force.
+Standard methods include high-power, unfocused sonication, nebulization, and focused high-frequency acoustic shearing.
+
+**Sonication** is the simplest method.
+It employs a sonicator (probe- or waterbath-based) to emit low-frequency acoustic waves for shearing.
+While probe-based sonication delivers more focused energy, the samples are exposed in an open container, posing a contamination risk. Waterbath-based sonication keeps samples within a closed system but often requires higher energy.
+Optimization is crucial to achieving desired fragment lengths, and resting periods are necessary between sonication cycles to prevent overheating, leading to a longer workflow.
+
+**Nebulization** involves using compressed gas to create shear force, forcing a nucleic acid solution through a small hole in a nebulizer.
+The level of fragmentation can be controlled by gas pressure but may result in high sample loss.
+
+The **focused acoustic** method employs high-frequency ultrasonic waves to shear DNA with minimal sample loss, low contamination risk, and better control over uniform fragmentation.
+However, its usage is limited due to the need for specialized equipment and associated costs.
+
+#### Enzymatic digestion
+
+**Enzymatic digestion** is an effective alternative to mechanical shearing. It employs endonucleases and nicking enzymes to cleave DNA strands.
+Enzymes with less specificity or enzyme cocktails are used to mitigate sequence bias. Enzymatic digestion requires lower DNA input and enables automation, streamlining the workflow, minimizing sample loss, reducing contamination risks, and decreasing hands-on time.
+
+**Transposon-based fragmentation** offers an alternative to mechanical shearing and enzymatic digestion.
+This approach, using transposons, simultaneously fragments and tags DNA templates, generating blunt DNA fragments with transposed sequences at both ends.
+Adapters (and indexes) are added via adapter-addition PCR, circumventing traditional workflow steps like DNA fragmentation, end conversion, and adapter ligation.
+
+### End repair
+
+Following the initial fragmentation, DNA samples undergo a repair process known as end repair.
+DNA fragments generated through mechanical shearing or enzymatic digestion exhibit a combination of 5′ and 3′ protruding ends that necessitate repair or conversion for subsequent ligation with adapters.
+The critical steps involved in this process include making the termini blunt, phosphorylating them, and adenylating them.
+
+To accomplish this:
+
+- 5′ overhangs are filled in using the 5′→3′ polymerase activity of enzymes like T4 DNA polymerase or Klenow fragment.
+- 3′ overhangs are eliminated by the 3′→5′ exonuclease activity of enzymes such as T4 DNA polymerase.
+- The 5′ ends of the blunted DNA fragments are phosphorylated for efficient ligation in subsequent steps, employing enzymes like T4 polynucleotide kinase.
+- The 3′ ends of the blunted DNA fragments undergo adenylation (A tailing), a requirement for T–A ligation with Illumina adapters. Enzymes such as Klenow fragment (exo–) or Taq DNA polymerase facilitate this step.
+
+> ![](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina/jcr:content/MainParsys/textimage_6889/image.img.320.medium.jpg/1693927921286.jpg)
+>
+> Credit: [ThermoFisher](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina.html)
+
+While the end conversion process involves several enzymatic steps, some commercially available kits are designed to streamline these reactions in a single tube, saving time and minimizing sample loss.
+
+### Adapter ligation
+
+Adapters are a pair of heat-treated short DNA sequences that aid in the amplification and sequencing processes.
+Twin sets of annealed adapters are attached to both ends of the genetic material library fragments.
+This allows the oligos on the flow cell to be identified during sequencing. In the library preparation phase, an excess of adapters relative to the sample DNA drives the ligation reaction to completion.
+Efficient ligation is crucial for transforming DNA fragments into sequences, influencing the conversion rate and yield of the libraries.
+These library fragments, enclosed by adapters, are sometimes called inserts.
+
+During the formation of adapter duplexes, two oligos, P5 and P7, are annealed.
+The P5 and P7 adapters derive their names from their binding sites on the flow cell oligos.
+Adapters have noncomplementary ends to prevent self-ligation, forming a Y shape after annealing.
+However, this Y shape is lost if library amplification follows.
+
+> ![](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina/jcr:content/MainParsys/textimage_6e54/image.img.320.medium.jpg/1693927921338.jpg)
+>
+> Credit: [ThermoFisher](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina.html)
+
+Examining the library adapters in more detail, they typically consist of 50–60 nucleotides and encompass features such as:
+
+- Sites binding to P5 or P7 oligos on the flow cells and sequencing primers.
+- Index sequences, usually 6–8 nucleotides, distinguishing one sample from another, enabling multiplexing.
+- Additional T at the 3′ end of the P5 adapter to prevent adapter dimer formation.
+- Phosphate at the 5′ end of the P7 adapter for ligation with the 3′ end of library fragments.
+
+> ![](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina/jcr:content/MainParsys/textimage_2b88/image.img.320.medium.jpg/1693927921369.jpg)
+>
+> Credit: [ThermoFisher](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina.html)
+
+The issue of index hopping, which occurs when sequencing multiple libraries together, is addressed through two main strategies:
+
+- Using unique dual indexes (UDIs) instead of combinatorial dual indexes (CDIs) ensures that index sequences are assigned uniquely to each library in the pool before sequencing.
+- Reducing the number of unligated adapters in the samples minimizes index hopping, with PCR-free libraries being more susceptible due to fewer cleanup steps usually performed.
+
+> ![](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina/jcr:content/MainParsys/textimage_d5c7/image.img.320.medium.jpg/1693927921436.jpg)
+>
+> Credit: [ThermoFisher](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina.html)
+
+These strategies help mitigate the challenges posed by index hopping during sequencing, particularly in the context of data analysis and the accurate assignment of sequencing data to respective samples.
+
+### Amplification
+
+Methods for preparing DNA libraries can be classified as either PCR-free or PCR-based, depending on the necessity for amplification.
+Adhering to protocols that result in diverse and representative libraries of input samples across various quantities is crucial to ensuring the production of high-quality data.
+
+#### PCR-free
+
+PCR-free library preparation is typically preferred when creating libraries that cover sequences with high GC or AT content, as PCR amplification may contribute to GC bias.
+This approach helps ensure library diversity.
+It's important to note that bias can still be introduced during cluster generation and sequencing chemistry, even with PCR-free library preparation methods.
+
+Compared to PCR-based methods, PCR-free libraries generally require more starting material.
+However, there have been advancements in reducing these requirements.
+This can be challenging in situations involving limited or precious samples and highly degraded nucleic acids.
+PCR-free libraries may be more challenging to assess library quality and quantity accurately than PCR-amplified libraries.
+
+However, the improved representation and balanced coverage provided by PCR-free libraries make them appealing for various applications, including:
+
+- Studies involving population-scale genomics and the molecular basis of diseases.
+- Investigations of promoters and regulatory regions in the genome often have high GC or AT content.
+- Whole-genome sequencing analysis and variant calling for single-nucleotide polymorphisms (SNPs) and small insertions or deletions (indels).
+
+#### PCR-based
+
+The PCR-based approach is typical for constructing Next-Generation Sequencing (NGS) libraries due to its ability to accommodate lower sample input and selectively amplify inserts with adapters at both ends.
+However, PCR may introduce bias towards GC content, presenting challenges in subsequent data analysis.
+For instance, GC bias can impede de novo genome assembly and the discovery of single-nucleotide polymorphisms (SNPs).
+
+Several factors contribute to GC bias, and it is crucial to consider the following elements to achieve a well-balanced library coverage:
+
+- The choice of PCR enzyme and master mix.
+- The number of PCR cycles performed, along with cycling conditions.
+- The presence of PCR additives or enhancers in the reaction.
+
+An elevation in the number of PCR cycles when using a specific PCR enzyme or master mix typically results in increased GC bias.
+As a general guideline, run the minimum number of cycles (e.g., 4–8) that yield sufficient library quantities for sequencing.
+
+Reducing the number of PCR cycles mitigates GC bias, diminishes PCR duplicates, and enhances library complexity.
+PCR duplicates refer to sequencing reads originating from two or more PCR amplicons of the same DNA molecule.
+Although bioinformatics tools exist to detect and eliminate PCR duplicates during data analysis, minimizing their occurrence is essential for efficiently utilizing the flow cell in sequencing.
+
+Additionally, other PCR artifacts, such as amplification bias (attributed to PCR stochasticity), nucleotide errors (resulting from enzyme fidelity), and PCR chimeras (stemming from the enzyme's template switching), can compromise library quality and complexity.
+
+## Flow cell loading
+
+The flow cell loading step is a critical preparatory process in Illumina sequencing that attaches DNA or RNA fragments to the sequencing flow cell surface and amplifies them into clusters ready for sequencing.
+
+First, the glass flow cell surface is chemically treated with lanes of tiny wells, each containing oligos complementary to platform-specific adapter sequences that will be added to the fragment ends.
+The fragments to be sequenced, whether genomic DNA or cDNA from an RNA preparation, are diluted to an optimal concentration and hydraulically loaded into the flow cell lanes along with necessary sequencing reagents.
+The adapter-modified fragments then hybridize and bind to their complementary oligos on the flow cell surface.
+
+## Clonal amplification
+
+The Illumina platform employs solid-phase amplification, wherein each DNA fragment in the library initially binds to primers on the sequencing chip, also known as the flow cell, through adapters.
+Through a series of amplification reactions called bridge amplification, each fragment forms a cluster of identical molecules known as clonal clusters.
+Each cluster corresponds to a single primary library molecule.
+It is important to note that when clonal amplification is performed on a patterned flow cell with predefined arrays, an exclusion amplification (ExAmp) chemistry is employed.
+ExAmp technology involves the immediate amplification of a DNA fragment after binding to the primer on the patterned flow cell, preventing other DNA fragments from forming a polyclonal cluster.
+
+> ![](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/illumina-workflow/jcr:content/MainParsys/textimage_1d51/image.img.320.low.jpg/1693405678648.jpg)
+>
+> Credit: [ThermoFisher](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/illumina-workflow.html)
+
+It is essential to distinguish this clonal amplification process from library amplification, which aims to increase the library input before loading it onto a flow cell.
+
+## Sequencing by synthesis
+
+Following clonal amplification, the subsequent step is sequencing by synthesis (SBS).
+This method detects nucleotides incorporated by a DNA polymerase into the complementary DNA strand of clonal clusters on a base-by-base basis.
+
+The sequencing technology by Illumina employs fluorescent dye–labeled dNTPs featuring a reversible terminator to capture fluorescent signals in each cycle, utilizing a process known as cyclic reversible termination.
+In each cycle, the DNA polymerase incorporates only one of the four fluorescent dNTPs based on complementarity, and then unbound dNTPs are removed.
+Images of the clusters are taken after the incorporation of each nucleotide, and the emission wavelength and fluorescence intensity of the incorporated nucleotide are analyzed to identify the base incorporated in each cluster during that cycle.
+Following imaging, the fluorescent dye and terminator are cleaved and released, followed by the subsequent synthesis cycle, imaging, and deprotection cycle.
+As each base is sequenced individually in each cycle, this process is iterated for "n" cycles to achieve a read length of "n" bases.
+
+> ![](https://atdbio.com/assets/book/reversible-terminator-sequencing.svg){ alight=left width=700 }
+>
+> Credit: [atdbio](https://atdbio.com/nucleic-acids-book/Next-generation-sequencing#Reversible-terminator-sequencing-Illumina)
+
+> ![](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/illumina-workflow/jcr:content/MainParsys/textimage_544/image.img.320.low.jpg/1693405678683.jpg){ alight=left width=700 }
+>
+> Credit: [ThermoFisher](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/illumina-workflow.html)
+
+
+
+[^pereira2020bioinformatics]: Pereira, R., Oliveira, J., & Sousa, M. (2020). Bioinformatics and computational tools for next-generation sequencing analysis in clinical genetics. *Journal of clinical medicine, 9*(1), 132. doi: [10.3390/jcm9010132](https://doi.org/10.3390/jcm9010132)
+[^hess2020library]: Hess, J. F., Kohl, T. A., Kotrová, M., Rönsch, K., Paprotka, T., Mohr, V., ... & Paust, N. (2020). Library preparation for next generation sequencing: A review of automation strategies. *Biotechnology advances, 41*, 107537. doi: [10.1016/j.biotechadv.2020.107537](https://doi.org/10.1016/j.biotechadv.2020.107537)
+[^thermofisher]: [ThermoFisher](https://www.thermofisher.com/us/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/next-generation-sequencing/dna-sequencing-preparation-illumina.html)
diff --git a/content/sequencing/dna/second-gen/ion-torrent/_index.md b/content/sequencing/dna/second-gen/ion-torrent/_index.md
new file mode 100644
index 0000000..071286a
--- /dev/null
+++ b/content/sequencing/dna/second-gen/ion-torrent/_index.md
@@ -0,0 +1,9 @@
+---
+title: Ion torrent
+type: docs
+---
+
+
+TODO:
+
+
diff --git a/content/sequencing/dna/third-gen/_index.md b/content/sequencing/dna/third-gen/_index.md
new file mode 100644
index 0000000..a0e912a
--- /dev/null
+++ b/content/sequencing/dna/third-gen/_index.md
@@ -0,0 +1,17 @@
+---
+title: Third generation
+type: docs
+weight: 5
+---
+
+
+The next era of DNA sequencing technology, termed third-generation sequencing (TGS), has gained a place in biology as a way to study genomes, transcriptomes, and metagenomes at an unprecedented resolution.
+These technologies work by single-molecule sequencing and provide:
+
+- Long reads with no amplification.
+- Direct detection of epigenetic modifications on native DNA.
+- Direct sequencing through regions of the genome inaccessible or difficult to analyze by short-read platforms.
+- Uniform coverage of the genome as they are not as sensitive to GC content as short-read platforms.
+
+There are several companies currently at the heart of third generation sequencing technology development; namely, [Pacific Biosciences](https://www.pacb.com/), [Oxford Nanopore Technology](https://nanoporetech.com/), [Quantapore](https://quantapore.com/).
+These companies are taking fundamentally different approaches to sequencing single DNA molecules.
diff --git a/content/sequencing/dna/third-gen/nanopore/_index.md b/content/sequencing/dna/third-gen/nanopore/_index.md
new file mode 100644
index 0000000..192458c
--- /dev/null
+++ b/content/sequencing/dna/third-gen/nanopore/_index.md
@@ -0,0 +1,9 @@
+---
+title: Nanopore
+type: docs
+---
+
+
+TODO:
+
+
diff --git a/content/sequencing/rna/_index.md b/content/sequencing/rna/_index.md
new file mode 100644
index 0000000..e90791a
--- /dev/null
+++ b/content/sequencing/rna/_index.md
@@ -0,0 +1,8 @@
+---
+title: RNA sequencing
+type: docs
+toc: false
+weight: 2
+---
+
+TODO:
diff --git a/content/sequencing/rna/library/_index.md b/content/sequencing/rna/library/_index.md
new file mode 100644
index 0000000..8b1e41c
--- /dev/null
+++ b/content/sequencing/rna/library/_index.md
@@ -0,0 +1,82 @@
+---
+title: RNA-seq library
+type: docs
+---
+
+
+The purpose of an RNA-seq library is to enable the comprehensive analysis of an organism's transcriptome through high-throughput sequencing.
+We must collect, process, and prepare RNA molecules for downstream sequencing.
+
+## Avoiding RNA degradation
+
+Working with RNA presents unique challenges due to its inherently unstable nature and the widespread presence of ribonucleases (RNases) in cells, tissues, and environmental surfaces, which can rapidly degrade RNA molecules.[^mp-bio-rna-extract]
+These enzymes are remarkably resilient and can significantly compromise the integrity of RNA samples if not adequately managed.
+Therefore, maintaining an RNase-free environment and handling practices is crucial in RNA extraction and downstream applications to ensure the quality and reliability of experimental results.
+Here are several key strategies to prevent RNA degradation:
+
+1. **Decontaminate Work Surfaces and Equipment.**
+ Before beginning any RNA work, thoroughly clean all work surfaces, pipettors, and equipment with 70% ethanol and 3% hydrogen peroxide.
+ This step is vital for removing potential RNase contamination.
+ Products specifically designed for neutralizing RNases, such as RNase Erase decontamination solution, can be highly effective.
+2. **Use Personal Protective Equipment (PPE).**
+ Wearing gloves is essential to protect your RNA samples from RNases that might be present on your skin.
+ Changing gloves frequently is advisable, especially after touching potentially contaminated surfaces or materials.
+3. **Employ RNase-Free Consumables.**
+ Ensure all reagents, tubes, pipette tips, and other consumables are RNase-free.
+ Many suppliers provide certified RNase-free products, reducing the risk of introducing RNases into your samples.
+4. **Temperature Control.**
+ RNA is less stable at higher temperatures, so keeping samples cold throughout the extraction process is crucial.
+ Use ice or a cold block to keep samples and reagents chilled unless a specific step in your protocol requires room temperature or elevated temperatures.
+5. **Efficient Workflow.**
+ Plan and organize your workflow before starting the extraction process.
+ Having all necessary reagents, equipment, and consumables ready and within reach can minimize the time your RNA samples are exposed to potential RNase activity and reduce the risk of degradation.
+
+By following these guidelines, you can significantly reduce the risk of RNA degradation and improve the success of your RNA extraction and subsequent analyses. Remember, meticulous attention to detail and strict adherence to RNase-free techniques are vital to preserving the integrity of your RNA samples.
+
+## RNA extraction
+
+The process of RNA extraction can be carried out through several approaches, including organic extraction with agents such as TRIzol, the use of RNA extraction kits featuring filter-based spin columns and lysis buffers, or the employment of magnetic particle methods.
+A DNase treatment is administered to eliminate the presence of any DNA remnants within the RNA sample. This crucial step ensures the RNA-seq library is devoid of DNA contaminants, vital for accurate sequencing outcomes.
+Maintaining the RNA in an environment free from RNase is imperative to preserve its integrity.
+
+### Selection of RNA species
+
+Before embarking on RNA-Seq library construction, selecting an optimal library preparation protocol is pivotal.
+This protocol should either enrich or deplete the "total" RNA sample for specific RNA species.
+The total RNA pool comprises ribosomal RNA (rRNA), precursor messenger RNA (pre-mRNA), mRNA, and various noncoding RNA (ncRNA) classes.
+In most cell types, rRNA dominates, often representing over 95% of all cellular RNA.
+Without removing rRNA before library construction, these transcripts would dominate sequencing efforts.
+This dominance would significantly reduce the sequencing depth and limit the detectability of less abundant RNA species.
+Hence, the effective elimination of rRNA is essential for accurate transcriptome profiling.
+
+Many protocols concentrate on enriching mRNA molecules by targeting the 3′ poly-A tail of mRNA molecules.
+This targeting is achieved using poly-T oligos attached to a substrate, such as magnetic beads.
+An alternative strategy involves the selective depletion of rRNA using commercially available kits, like RiboMinus (Life Technologies) or RiboZero (Epicentre).
+This approach is particularly advantageous for the accurate quantification of noncoding RNA species, which, being potentially polyadenylated, might be omitted from poly-A enriched libraries.
+These strategies ensure a more comprehensive and precise transcriptome analysis by focusing sequencing resources on the RNA species of interest.
+
+## Complementary DNA
+
+The conversion of RNA into complementary DNA (cDNA) is fundamental for several reasons that bolster the integrity and utility of the sequencing process.
+
+RNA molecules are naturally less stable and more susceptible to degradation than DNA.
+This inherent instability could potentially compromise the sequencing outcomes.
+The transformation of RNA into cDNA addresses this issue by significantly enhancing the library's stability and durability, thereby ensuring the reliability of downstream processing and sequencing operations.
+
+The vast majority of high-throughput sequencing platforms are primarily designed for DNA analysis.
+By converting RNA to cDNA, the samples become compatible with these technologies.
+This compatibility facilitates a comprehensive transcriptome analysis, enabling researchers to delve deeper into genetic expressions and variations.
+
+The conversion process involves reverse transcription, where RNA is transformed into cDNA, followed by an amplification stage.
+These steps are essential, particularly for samples with low input or those derived from single cells, as they ensure that even transcripts present in minimal amounts are adequately represented for sequencing.
+This amplification is crucial for detecting and quantifying transcripts, playing a pivotal role in the success of sequencing efforts.
+
+
+
+[^sarantopoulou2019comparative]: Sarantopoulou, D., Tang, S. Y., Ricciotti, E., Lahens, N. F., Lekkas, D., Schug, J., ... & Grant, G. R. (2019). Comparative evaluation of RNA-Seq library preparation methods for strand-specificity and low input. Scientific reports, 9(1), 13477. doi: [10.1038/s41598-019-49889-1](https://doi.org/10.1038/s41598-019-49889-1)
+[^berge2019rna]: Van den Berge, K., Hembach, K. M., Soneson, C., Tiberi, S., Clement, L., Love, M. I., ... & Robinson, M. D. (2019). RNA sequencing data: hitchhiker's guide to expression analysis. *Annual Review of Biomedical Data Science, 2*, 139-173. doi: [10.1146/annurev-biodatasci-072018-021255](https://doi.org/10.1146/annurev-biodatasci-072018-021255)
+[^hrdlickova2017rna]: Hrdlickova, R., Toloue, M., & Tian, B. (2017). RNA‐Seq methods for transcriptome analysis. Wiley Interdisciplinary Reviews: RNA, 8(1), e1364. doi: [10.1002/wrna.1364](https://doi.org/10.1002/wrna.1364)
+[^han2015advanced]: Han, Y., Gao, S., Muegge, K., Zhang, W., & Zhou, B. (2015). Advanced applications of RNA sequencing and challenges. Bioinformatics and biology insights, 9, BBI-S28991. doi: [10.4137/BBI.S28991](https://doi.org/10.4137/BBI.S28991)
+[^kukurba2015rna]: Kukurba, K. R., & Montgomery, S. B. (2015). RNA sequencing and analysis. Cold Spring Harbor Protocols, 2015(11), pdb-top084970. doi: [10.1101/pdb.top084970](https://doi.org/10.1101/pdb.top084970)
+[^robles2012efficient]: Robles, J. A., Qureshi, S. E., Stephen, S. J., Wilson, S. R., Burden, C. J., & Taylor, J. M. (2012). Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC genomics, 13(1), 1-14. doi: [10.1186/1471-2164-13-484](https://doi.org/10.1186/1471-2164-13-484)
+[^wang2009rna]: Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10(1), 57-63. doi: [10.1038/nrg2484](https://doi.org/10.1038/nrg2484)
diff --git a/content/transcriptomics/_index.md b/content/transcriptomics/_index.md
new file mode 100644
index 0000000..494c4fd
--- /dev/null
+++ b/content/transcriptomics/_index.md
@@ -0,0 +1,4 @@
+---
+title: Transcriptomics
+type: docs
+---
diff --git a/content/transcriptomics/ge/_index.md b/content/transcriptomics/ge/_index.md
new file mode 100644
index 0000000..a54bd82
--- /dev/null
+++ b/content/transcriptomics/ge/_index.md
@@ -0,0 +1,6 @@
+---
+title: Gene expression
+type: docs
+weight: 3
+---
+
diff --git a/content/transcriptomics/mapping/_index.md b/content/transcriptomics/mapping/_index.md
new file mode 100644
index 0000000..5e83bc2
--- /dev/null
+++ b/content/transcriptomics/mapping/_index.md
@@ -0,0 +1,56 @@
+---
+title: Read mapping
+type: docs
+weight: 1
+---
+
+
+Splicing is a crucial process in gene expression where introns (non-coding regions) are removed from pre-mRNA transcripts, and exons (coding regions) are joined together.
+
+> ![](https://www.cshl.edu/wp-content/uploads/2018/08/exons_introns_splicing.jpg){ alight=left width=600 }
+>
+> Credit: [Cold Spring Harbor Laboratory](https://www.cshl.edu/predicting-how-splicing-errors-impact-disease-risk/)
+
+This process allows a single gene to produce multiple proteins through alternative splicing, significantly contributing to transcript diversity and the complexity of proteomes in eukaryotic organisms.
+By selectively including or excluding RNA segments, cells can generate various proteins from a limited number of genes, playing a fundamental role in cellular functions and organism development.
+
+The necessity of splice-aware alignment tools for RNA-seq (RNA sequencing) data analysis stems from the challenge of accurately mapping short RNA-seq reads to a reference genome.
+These reads may span exon-exon junctions, where the genetic sequence is discontinuous due to the removal of introns during splicing.
+[Traditional alignment tools](../../alignment/), designed for contiguous genomic sequences, may fail to correctly align reads across exon-exon junctions, leading to inaccurate gene expression quantification and misunderstanding of the transcriptome's complexity.
+Splice-aware aligners are designed to handle these complexities by identifying potential splicing events and mapping reads spanning exon-exon junctions.
+This enables precise quantification of gene expression levels, identification of alternative splicing events, and a deeper understanding of the transcriptome's dynamic nature, essential for studies on gene regulation, disease mechanisms, and therapeutic strategies.
+
+## Challenges
+
+### Splice junctions
+
+The primary challenge in splice-aware alignment is the identification of splice junctions.
+Unlike other genomic regions, exon-exon junctions (the points where two exons are joined together after introns are removed) are not contiguous in the genome.
+This non-contiguity makes it challenging to align reads that span these junctions because the alignment tools must infer the splicing events that connect the exons.
+This inference process is computationally intensive due to:
+
+- There is a vast diversity of potential splicing events, with thousands of possible combinations in which exons can be joined.
+- There is a need for high accuracy in distinguishing true splice junctions from sequencing errors or variations.
+- The computational complexity of aligning short reads to a reference genome, considering the potential for multiple splicing options.
+
+Several computational tools and algorithms have been developed to tackle the challenge of identifying splice junctions.
+These tools employ strategies such as using known splice junctions from databases, predicting junctions based on sequence motifs, or employing machine learning techniques to infer splicing events from RNA-seq data.
+
+### Alternative splicing
+
+The vast potential for alternative splicing in eukaryotic organisms means that there can be many possible transcripts for a given gene, each potentially producing different protein variants.
+Accurately mapping reads to these variants requires sophisticated computational approaches that can handle the ambiguity and complexity of the transcriptome.
+
+### Genomic Variability and Errors
+
+Complicating matters further, genomic variability among individuals and sequencing errors can mimic splicing events or obscure real splice junctions, making the task of accurately identifying and mapping reads to the correct genomic locations even more challenging.
+
+### Short reads
+
+Short reads are small sequences of DNA, typically ranging from 50 to 300 base pairs in length, produced by NGS technologies. Despite their critical role in genomic analysis, these short sequences pose significant challenges for accurate genome alignment.
+Due to their limited length, short reads may not contain enough unique information to unambiguously map them back to their original location in the genome.
+This is akin to finding the exact location of a specific sentence in a large book by only having a few words from that sentence.
+
+
+
+[^su2023identification]: Su, T., Hollas, M. A., Fellers, R. T., & Kelleher, N. L. (2023). Identification of splice variants and isoforms in transcriptomics and proteomics. *Annual review of biomedical data science, 6*, 357-376. doi: [10.1146/annurev-biodatasci-020722-044021](https://doi.org/10.1146/annurev-biodatasci-020722-044021)
diff --git a/content/transcriptomics/mapping/bowtie/_index.md b/content/transcriptomics/mapping/bowtie/_index.md
new file mode 100644
index 0000000..522729a
--- /dev/null
+++ b/content/transcriptomics/mapping/bowtie/_index.md
@@ -0,0 +1,15 @@
+---
+title: Bowtie
+type: docs
+---
+
+
+
+
+TODO:
+
+
+
+[^bowtie_github]: [github.com/BenLangmead/bowtie](https://github.com/BenLangmead/bowtie)
+[^langmead2012fast]: Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. *Nature methods, 9*(4), 357-359. doi: [10.1038/nmeth.1923](https://doi.org/10.1038/nmeth.1923)
+[^langmead2009ultrafast]: Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), 1-10. doi: [10.1186/gb-2009-10-3-r25](https://doi.org/10.1186/gb-2009-10-3-r25)
diff --git a/content/transcriptomics/mapping/bwa/_index.md b/content/transcriptomics/mapping/bwa/_index.md
new file mode 100644
index 0000000..4fe41ca
--- /dev/null
+++ b/content/transcriptomics/mapping/bwa/_index.md
@@ -0,0 +1,19 @@
+---
+title: Burrows-Wheeler Aligner
+type: docs
+---
+
+
+
+!!! warning "Preliminaries"
+
+ We assume you are familiar with the following material:
+
+ - [Burrows–Wheeler transform](../../../appendices/algorithms/compression/bwt/)
+
+TODO:
+
+
+
+[^li2010fast]: Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5), 589-595. doi: [10.1093/bioinformatics/btp698](https://doi.org/10.1093/bioinformatics/btp698)
+[^li2009fast]: Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754-1760. doi: [10.1093/bioinformatics/btp324](https://doi.org/10.1093/bioinformatics/btp324)
diff --git a/content/transcriptomics/mapping/hisat/_index.md b/content/transcriptomics/mapping/hisat/_index.md
new file mode 100644
index 0000000..ab1f333
--- /dev/null
+++ b/content/transcriptomics/mapping/hisat/_index.md
@@ -0,0 +1,14 @@
+---
+title: HISAT
+type: docs
+---
+
+
+
+
+TODO:
+
+
+
+[^kim2019graph]: Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8), 907-915. doi: [10.1038/s41587-019-0201-4](https://doi.org/10.1038/s41587-019-0201-4)
+[^kim2015hisat]: Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357-360. doi: [10.1038/nmeth.3317](https://doi.org/10.1038/nmeth.3317)
diff --git a/content/transcriptomics/mapping/minimap/_index.md b/content/transcriptomics/mapping/minimap/_index.md
new file mode 100644
index 0000000..51af597
--- /dev/null
+++ b/content/transcriptomics/mapping/minimap/_index.md
@@ -0,0 +1,13 @@
+---
+title: Minimap2
+type: docs
+---
+
+
+
+TODO:
+
+
+
+[^minimap2-code]: [github.com/lh3/minimap2](https://github.com/lh3/minimap2)
+[^li2018minimap2]: Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094-3100. doi: [10.1093/bioinformatics/bty191](https://doi.org/10.1093/bioinformatics/bty191)
diff --git a/content/transcriptomics/mapping/rsem/_index.md b/content/transcriptomics/mapping/rsem/_index.md
new file mode 100644
index 0000000..1606a1a
--- /dev/null
+++ b/content/transcriptomics/mapping/rsem/_index.md
@@ -0,0 +1,11 @@
+---
+title: RSEM
+type: docs
+---
+
+
+TODO:
+
+
+
+[^dobin2013star]: Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics, 12, 1-16. doi: [10.1186/1471-2105-12-323](https://doi.org/10.1186/1471-2105-12-323)
diff --git a/content/transcriptomics/mapping/soap/_index.md b/content/transcriptomics/mapping/soap/_index.md
new file mode 100644
index 0000000..5c2a612
--- /dev/null
+++ b/content/transcriptomics/mapping/soap/_index.md
@@ -0,0 +1,14 @@
+---
+title: SOAP
+type: docs
+---
+
+
+
+
+TODO:
+
+
+
+[^li2009soap2]: Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., & Wang, J. (2009). SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15), 1966-1967. doi: [10.1093/bioinformatics/btp336](https://doi.org/10.1093/bioinformatics/btp336)
+[^li2008mapping]: Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. *Genome research, 18*(11), 1851-1858. doi: [10.1101/gr.078212.108](https://doi.org/10.1101/gr.078212.108)
diff --git a/content/transcriptomics/mapping/star/_index.md b/content/transcriptomics/mapping/star/_index.md
new file mode 100644
index 0000000..09518a6
--- /dev/null
+++ b/content/transcriptomics/mapping/star/_index.md
@@ -0,0 +1,12 @@
+---
+title: Spliced Transcripts Alignment to a Reference
+type: docs
+---
+
+
+
+TODO:
+
+
+
+[^dobin2013star]: Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. *Bioinformatics, 29*(1), 15-21. doi: [10.1093/bioinformatics/bts635](https://doi.org/10.1093/bioinformatics/bts635)
diff --git a/content/transcriptomics/quant/_index.md b/content/transcriptomics/quant/_index.md
new file mode 100644
index 0000000..baa4ff4
--- /dev/null
+++ b/content/transcriptomics/quant/_index.md
@@ -0,0 +1,27 @@
+---
+title: Quantification
+type: docs
+weight: 2
+---
+
+
+Performing RNA quantification is an essential step beyond [read mapping][read-mapping] in RNA sequencing (RNA-seq) studies.
+Quantifying RNA transcripts serves several crucial purposes that significantly enhance the value and applicability of the sequencing data.
+While [read mapping][read-mapping] is a foundational step in RNA-seq data analysis, it primarily serves to locate the origin of each read within the genome.
+However, mapping alone does not provide a complete picture of gene expression levels or the functional implications of the sequencing data.
+Here’s why RNA quantification is performed and why it's important.
+
+1. **Quantitative Insights into Gene Expression**: RNA quantification translates the raw counts of mapped reads into meaningful expression levels for each gene or transcript.
+ This allows researchers to quantitatively compare gene expression across different samples, conditions, or time points.
+ Quantification is essential for identifying differentially expressed genes, indicating biological responses to diseases, treatments, or environmental changes.
+2. **Normalization of Data**: RNA quantification involves normalization steps that correct for various biases and differences in sequencing depth across samples.
+ This is crucial for making accurate comparisons of gene expression levels between samples.
+ Without normalization, differences in read counts could reflect variations in sequencing effort rather than true differences in gene expression.
+3. **Functional Analysis and Interpretation**: Quantified expression levels enable functional analyses such as pathway enrichment and gene ontology analyses.
+ These analyses help understand the biological processes, cellular components, and molecular functions associated with differentially expressed genes, providing insights into the underlying biology of the study.
+4. **Discovery of Novel Transcripts and Isoforms**: RNA quantification can help identify and quantify known genes, novel transcripts, and alternative splicing events.
+ This expands our understanding of the transcriptome complexity and how alternative splicing contributes to gene regulation and cellular diversity.
+5. **Comparison Across Studies and Conditions**: Standardized quantification methods allow for comparing gene expression data across different studies, labs, and experimental conditions.
+ This is critical for replicating findings, validating hypotheses, and integrating data from multiple sources for meta-analyses.
+
+[read-mapping]: ../mapping/
diff --git a/content/transcriptomics/quant/kallisto/_index.md b/content/transcriptomics/quant/kallisto/_index.md
new file mode 100644
index 0000000..5e06d68
--- /dev/null
+++ b/content/transcriptomics/quant/kallisto/_index.md
@@ -0,0 +1,9 @@
+---
+title: Kallisto
+type: docs
+---
+
+
+
+
+[^bray2016near]: Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525-527. doi: [10.1038/nbt.3519](https://doi.org/10.1038/nbt.3519)
diff --git a/content/transcriptomics/quant/problem/_index.md b/content/transcriptomics/quant/problem/_index.md
new file mode 100644
index 0000000..c8fa6e7
--- /dev/null
+++ b/content/transcriptomics/quant/problem/_index.md
@@ -0,0 +1,111 @@
+---
+title: Problem formulation
+type: docs
+---
+
+
+Formulating the problem of RNA quantification in the context of gene expression analysis involves determining the abundance of transcripts in a biological sample.
+This process is pivotal for understanding gene regulation, cellular responses, and the molecular underpinnings of disease.
+The challenge lies in accurately measuring the relative quantities of thousands of different RNA molecules from a complex mixture derived from sequencing experiments.
+
+The foundational concept in addressing this challenge is to consider the transcriptome as a collection of unique transcripts, each representing a gene or a variant of a gene being expressed in the sample.
+The goal is to quantify the fraction of the total RNA pool that each transcript represents, which informs us about the gene's expression level.
+However, direct quantification is complicated by several factors, including the variable length of transcripts and the technical biases introduced during the sequencing process.
+
+## True transcriptome
+
+First, we define the true transcriptome ($T$) as the complete set of transcripts present in the sample being studied.
+$T$ encompasses the exact composition of RNA molecules in the sample, including all different isoforms (variations of genes) that exist.
+We mathematically define $T$ as
+
+$$
+\left\{ \left( t_1, \ldots, t_M \right), \left( c_1, \ldots, c_M \right) \right\}.
+$$
+
+where $t_i$ is some transcript nucleotide sequence $i$ with $c_i$ copies with a total of $M$ transcripts.
+
+!!! example
+
+ | $i$ | $t_i$ | $c_i$ |
+ | --- | ----- | ----- |
+ | 1 | AATCGGCT | 3 |
+ | 2 | AATCGGCC | 2 |
+ | 3 | TTCGGCAA | 1 |
+
+Essentially, it is a set of all transcripts and how many copies there are.
+We could also write
+
+$$
+T = \left\{ \left( t_1, c_1 \right), \left( t_2, c_2 \right), \ldots, \left( t_M, c_M \right) \right\}.
+$$
+
+These metrics are purely theoretical and represent the absolute expression.
+Remember, we are normally interested in changes of expression (e.g., unchanged, upgregulated, downregulated) and not specifically the absolute number of each isoform.
+One example is the nucleotide fraction.
+
+## Nucleotide fraction
+
+The nucleotide fraction, denoted as $\eta_i$, is defined as the proportion of nucleotides in the sample originating from a copy of transcript $i$.
+This quantity is computed based on the number of copies of the transcript, $c_i$, and its effective length, $\widetilde{l_i}$.
+The formula for calculating the nucleotide fraction is given by:
+
+$$
+\eta_i = \frac{c_i \cdot \widetilde{l_i}}{\sum_{j=1}^{M} c_j \cdot \widetilde{l_j}}
+$$
+
+This fraction represents the probability of drawing a sequencing fragment from some position on a particular transcript, normalized by the effective transcript length to account for the varying lengths of transcripts.
+
+## Effective transcript length
+
+The difference between length and effective length in the context of transcript abundance estimation from RNA-seq data involves how these measurements account for the practicalities of sequencing and data analysis.
+
+Length refers to the actual length of the transcript in nucleotides.
+This is a straightforward measurement that does not consider any factors related to how the transcript might be sequenced or analyzed.
+
+**Effective length**, on the other hand, is adjusted to account for the empirical distribution of fragment lengths obtained during sequencing.
+This adjustment is necessary because the range of fragment sizes that can be sampled from a transcript is limited, particularly near the ends of the transcript.
+The effective length therefore accounts for the fact that not all regions of a transcript are equally likely to be sampled, especially for longer transcripts or those sequenced with certain techniques that generate fragments of specific lengths.
+
+We define effective length of transcript $t_i$ as
+
+$$
+\widetilde{l_i} = l_i - \mu_d^{l_i}
+$$
+
+where $\mu_d^{l_i}$ is the mean of the truncated empirical fragment length distribution, $d$.
+
+The mean of the truncated empirical fragment length distribution, denoted as $\mu_{d_{l_i}}$, is calculated from the empirical fragment length distribution that has been truncated to consider only those lengths that are actually observable given the length of the transcript. This mean is a crucial parameter for determining the effective length of a transcript, which in turn is used for calculating various metrics of RNA quantification, such as the nucleotide and transcript fractions.
+
+The calculation of $\mu_{d_{l_i}}$ is defined as follows:
+
+$$
+\mu_{d}^{l_i} = \frac{\sum_{j=1}^{l_i} j \cdot Pr\{X = j\}}{\sum_{k=1}^{l_i} Pr\{X = k\}}
+$$
+
+where:
+
+- $l_i$ is the length of transcript $i$.
+- $Pr\{X = j\}$ is the probability of drawing a fragment of length $j$ under the empirical fragment length distribution $d$.
+- The numerator sums the product of each fragment length $j$ and its probability of being drawn, for all possible fragment lengths up to the length of the transcript $l_i$.
+- The denominator sums the probabilities of drawing a fragment of any length $k$ up to the length of the transcript, effectively normalizing the mean to account for the fact that not all fragment lengths are equally likely.
+
+This mean reflects the average fragment length expected to be observed from transcript $i$, taking into account the distribution of fragment lengths as well as the physical constraints imposed by the transcript's length. This parameter is essential for accurately estimating the effective length of transcripts, which is used in the calculation of transcript abundance metrics in RNA sequencing data analysis.
+
+## Transcript fraction
+
+The transcript fraction, denoted as $\tau_i$, is obtained by normalizing the nucleotide fraction $\eta_i$ by the effective transcript length $\widetilde{l_i}$. The formula for calculating the transcript fraction is given by:
+
+$$
+\tau_i = \frac{\frac{\eta_i}{\widetilde{l_i}}}{\frac{c_i}{\sum_{j=1}^{M} c_j \cdot \widetilde{l_j}}}
+$$
+
+This quantity represents the fraction of transcripts in the sample that originate from a given transcript $i$, normalized by the effective lengths of all transcripts.
+These values can be used to directly compute measures of relative transcript abundance, such as transcripts per million (TPM), providing a normalized measure that facilitates comparison across different samples or experiments.
+
+Once you have the transcript fraction ($\tau_i$), the TPM for a transcript $t_i$ is calculated as:
+
+$$
+\text{TPM}_i = \tau_i \times 10^6
+$$
+
+This formula adjusts the transcript fraction to a scale of 1 million to facilitate comparison across samples or conditions.
diff --git a/content/transcriptomics/quant/salmon/_index.md b/content/transcriptomics/quant/salmon/_index.md
new file mode 100644
index 0000000..adab44f
--- /dev/null
+++ b/content/transcriptomics/quant/salmon/_index.md
@@ -0,0 +1,130 @@
+---
+title: Salmon
+type: docs
+---
+
+
+!!! warning "Preliminaries"
+
+ We assume you are familiar with the following material:
+
+ - [probability density functions](https://stats.crumblearn.org/dist/discrete/prob-mass/)
+ - Maximum-likelihood estimation
+ - [Problem statement](../problem/)
+
+Salmon is a method for quantifying transcript abundance from RNA-seq data while accounting for various biases that affect the accuracy of abundance estimates, notably fragment GC-content bias.
+It features a dual-phase parallel inference algorithm and bias models to correct for these biases, significantly improving the sensitivity of differential expression analysis.
+The method involves lightweight mapping, an online phase for initial expression level estimation and model parameterization, and an offline phase for refining these estimates.
+Salmon's approach is demonstrated to outperform existing methods in terms of accuracy, leveraging models for sequence-specific bias and fragment GC bias, among others.
+
+## Objectives
+
+Salmon offers two options for inferring the relative quantity of transcripts.
+
+### Maximum-likelihood
+
+The maximum-likelihood objective, the default method, aims at finding the set of transcript abundance estimates that maximizes the likelihood of the observed RNA sequencing data under the model assumed by Salmon.
+This approach is grounded in the principle of maximum-likelihood estimation (MLE), a statistical method for estimating the parameters of a model given observed data.
+The equation can be written as:
+
+$$
+Pr \left\{ F | \pmb{\eta}, \pmb{Z}, T \right\} = \prod_{j=1}^N Pr \left\{ f_j | \pmb{\eta}, \pmb{Z}, T \right\}
+$$
+
+Where:
+
+- $\text{Pr} \left\{ F | \pmb{\eta}, \pmb{Z}, T \right\}$ is the probability of observing the entire set of fragments $F$ given the model parameters.
+- $F$ is the set of all sequenced fragments obtained from RNA-Seq.
+- $\pmb{\eta}$ represents the vector of parameters related to transcript abundance, such as the fraction of each transcript present in the sample.
+- $\pmb{Z}$ is a binary matrix indicating the relationship between fragments and transcripts (i.e., which fragment came from which transcript).
+- $T$ is the set of all transcripts in the reference transcriptome.
+- $N$ is the total number of fragments.
+- $f_j$ is an individual fragment.
+- The product $\prod_{j=1}^N$ indicates that the probability of observing all fragments is the product of the probabilities of observing each individual fragment, assuming each fragment is observed independently.
+
+At the heart of RNA-Seq analysis is the goal to determine the composition of an RNA sample—specifically, how much of each transcript (a piece of RNA that corresponds to a gene or part of a gene) is present. The equation models the probability, or likelihood, of observing the particular set of RNA fragments that were sequenced given: A known set of transcripts (\(T\)) in the reference transcriptome (the complete set of transcripts in a cell, organ, or organism).
+
+The model parameters that include:
+
+**Transcript Abundances (\( \pmb{\eta} \))**: This vector of parameters quantifies the relative amount of each transcript in the sample. Essentially, it's a measure of gene expression level, indicating how active certain genes are in the sample being studied. The accuracy of \( \pmb{\eta} \) directly influences the model's ability to reflect the true composition of the RNA sample.
+
+**Assignments (\( \pmb{Z} \))**: The binary matrix \( \pmb{Z} \) indicates the origin of each RNA fragment (\(f_j\)) in the sequenced data. For each fragment, \( \pmb{Z} \) specifies which transcript it is derived from. Correctly assigning fragments to their source transcripts is crucial for accurate estimation of transcript abundances. This is because the distribution of fragments across different transcripts provides key information on how abundant each transcript is.
+
+The likelihood of observing the set of fragments \(F\) is fundamentally tied to how well the parameters \( \pmb{\eta} \) and the assignments \( \pmb{Z} \) align with the actual sequenced data.
+If the estimated abundances and fragment-transcript assignments closely match the real data, the likelihood will be high.
+This indicates a good fit between the model and the observed data, suggesting that the model parameters accurately capture the true transcript abundances in the sample.
+
+In practice, maximizing this likelihood through computational methods allows researchers to infer the most probable transcript abundances in the sample, given the observed RNA fragments.
+This process is iterative, refining \( \pmb{\eta} \) and \( \pmb{Z} \) to improve the fit between the model predictions and the actual data.
+The ultimate goal is to achieve a model that can accurately reflect the transcriptome's composition, providing insights into gene expression patterns that are critical for understanding biological processes and diseases.
+
+### Bayesian
+
+The Bayesian objective in the context of the Salmon software, as detailed in the paper, revolves around inferring the posterior distribution of transcript abundances from RNA sequencing data, conditioned on the observed data and the known transcriptome. This approach contrasts with directly seeking maximum likelihood estimates of the parameters of interest, offering a probabilistic understanding of transcript abundances that incorporates prior knowledge and the uncertainty inherent in the data.
+
+## Bias corrections
+
+The model implemented by Salmon tries to account for several types of biases that commonly affect RNA-seq data. These biases, if uncorrected, can lead to inaccurate quantification of transcript abundances. Below is an explanation of the different biases that Salmon's model addresses.
+
+### Sequence-Specific Biases
+
+Sequence-specific biases arise from the sequencing process itself, where certain sequences are preferentially selected or amplified due to their nucleotide composition. This bias is particularly evident at the 5′ and 3′ ends of sequenced fragments. Salmon models these biases by learning sequence-specific bias at both ends of the fragments, which helps in correcting for the preferential selection of sequences based on their nucleotide composition.
+
+### Fragment-Level GC Bias
+
+The GC content of a fragment can significantly affect its likelihood of being sequenced, with fragments of extreme GC content being underrepresented in the sequencing data. Salmon addresses this by learning a model of fragment-level GC bias, which corrects for the preferential sequencing of fragments based on their GC content. This model considers the observed distribution of sequenced fragments for different GC content levels and corrects the abundance estimates accordingly.
+
+### Strand-Specific Protocols
+
+Some RNA-seq protocols are strand-specific, meaning that they can distinguish between the sense and antisense strands of RNA. This specificity introduces bias in the data that needs to be corrected for accurate quantification. Salmon accounts for the directionality imposed by strand-specific protocols in its bias models.
+
+### Fragment Length Distribution
+
+The distribution of fragment lengths can also bias abundance estimates. Shorter fragments might be overrepresented in the sequencing data due to the selection process during library preparation. Salmon corrects for this by considering the empirical distribution of fragment lengths and adjusting the effective length of transcripts accordingly. This correction ensures that the quantification accounts for the likelihood of sampling each possible fragment that a transcript can produce.
+
+### Incorporation of Bias Models into Effective Length
+
+After learning the bias models, Salmon re-estimates the effective length of each transcript to take into account the biases. This effective length now reflects not only the physical length of the transcript but also the likelihood of sampling each possible fragment given the learned biases. This approach, first introduced by Roberts et al., allows Salmon to correct for biases in a comprehensive manner, leading to more accurate abundance estimates.
+
+## Online inference
+
+The online inference phase of Salmon's algorithm is designed to tackle the challenge of estimating transcript abundances from RNA-sequencing data.
+This phase employs a variant of stochastic collapsed variational Bayesian inference to optimize a collapsed variational objective function.
+The inference procedure operates as a streaming algorithm, updating estimated read counts after processing small groups of observations (mini-batches).
+These updates are done asynchronously and in parallel, aiming to make efficient use of computational resources.
+
+During the online phase, Salmon estimates initial expression levels, auxiliary parameters, and foreground bias models.
+It also constructs equivalence classes over the input fragments, which serve as a highly reduced representation of the sequencing experiment.
+This strategy helps manage the complexity of the data by grouping together fragments that provide similar information about transcript abundances.
+
+The key aspects of the online phase include:
+
+1. **Streaming Inference**: The algorithm processes data in mini-batches, allowing for efficient and parallel processing. This approach helps Salmon to quickly adjust its estimates of transcript abundances as more data are processed.
+2. **Variational Bayesian Inference**: By employing a form of variational Bayesian inference, Salmon approximates the posterior distribution of transcript abundances. This statistical framework allows for the incorporation of prior knowledge and the estimation of uncertainty in the abundance estimates.
+3. **Equivalence Classes**: Salmon groups together sequenced fragments into equivalence classes based on their compatibility with the same set of transcripts. This reduces the computational complexity of the inference process and enables more efficient optimization.
+4. **Bias Models**: The online phase involves the estimation of models to correct for known biases in the sequencing data. These in clude sequence-specific biases and biases related to the sequencing process itself. By correcting for these biases, Salmon aims to produce more accurate estimates of transcript abundances.
+5. **Adaptation to Data**: The algorithm dynamically updates its estimates and models based on the data observed in each mini-batch. This adaptive approach allows Salmon to refine its predictions as more information becomes available.
+
+In summary, the online inference phase of Salmon is a sophisticated computational strategy designed to accurately estimate transcript abundances from RNA-seq data. By leveraging variational Bayesian inference and efficient data processing techniques, Salmon addresses the challenges of bias correction and data complexity, ultimately aiming to provide accurate and reliable estimates of transcript abundances in a computationally efficient manner.
+
+## Offline inference
+
+Offline inference describes a computational process that takes place after the initial, or "online," analysis of RNA sequencing data by the Salmon software. This phase leverages the data and preliminary analyses obtained from the online phase to refine the estimates of transcript abundance, essentially polishing the results for greater accuracy.
+
+- **Rich Equivalence Classes**: In the offline phase, Salmon utilizes "rich equivalence classes" constructed during the online phase. These classes group fragments (reads or parts of reads from sequencing) that are likely to come from the same set of transcripts, thereby reducing computational complexity and focusing efforts on distinguishing between transcripts that share many fragments in common.
+- **Expectation-Maximization (EM) Algorithm**: The core of the offline inference phase is an optimization process using the EM algorithm. This algorithm iteratively improves the estimates of how many fragments come from each transcript, effectively fine-tuning the abundance measurements. It does this by maximizing the likelihood of observing the given data under the model, adjusting transcript abundance estimates to fit the observed data better.
+- **Variational Bayes Optimization**: Optionally, Salmon can perform variational Bayesian (VB) optimization instead of standard EM updates. This approach involves approximating the true posterior distribution of transcript abundances with a simpler distribution, then iteratively updating this approximation to make it as close as possible to the true posterior. This method is particularly useful for managing computational complexity and uncertainty.
+- **Convergence Criterion**: The offline phase continues iterating until the changes in the estimates of transcript abundance fall below a pre-defined threshold, indicating that further iterations are unlikely to significantly alter the results. This criterion ensures that the algorithm stops when it has effectively converged on a stable solution.
+
+## Posterior sampling
+
+Posterior sampling details a statistical method used in the offline phase of the Salmon software to estimate the distribution of transcript abundances from RNA sequencing data. This method allows for the quantification of the uncertainty in transcript abundance estimates, providing more than just point estimates.
+
+- **Gibbs Sampling**: This is a Markov Chain Monte Carlo (MCMC) algorithm used to generate a sequence of samples from the posterior distribution of transcript abundances. Salmon's implementation of Gibbs sampling iteratively samples transcript abundances given the fragment assignments (i.e., which transcripts the sequenced fragments are likely to come from) and then reassigns fragments to transcripts based on these sampled abundances. This process helps in understanding the variability and confidence in the abundance estimates.
+- **Bootstrap Sampling**: An alternative method to Gibbs sampling, bootstrap sampling involves generating multiple resampled datasets from the original sequencing data by sampling with replacement. For each resampled dataset, the offline inference procedure is rerun to produce new estimates of transcript abundances. This method is used to assess the stability and reliability of the abundance estimates by observing how they vary across the resampled datasets.
+
+
+
+[^salmon-github]: [Salmon repository on GitHub](https://github.com/COMBINE-lab/salmon)
+[^patro2017salmon]: Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. *Nature methods, 14*(4), 417-419. doi: [10.1038/nmeth.4197](https://doi.org/10.1038/nmeth.4197)
+[^li2010rna]: Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A., & Dewey, C. N. (2010). RNA-Seq gene expression estimation with read mapping uncertainty. *Bioinformatics, 26*(4), 493-500. doi: [10.1093/bioinformatics/btp692](https://doi.org/10.1093/bioinformatics/btp692)
diff --git a/content/transcriptomics/single/_index.md b/content/transcriptomics/single/_index.md
new file mode 100644
index 0000000..9430f55
--- /dev/null
+++ b/content/transcriptomics/single/_index.md
@@ -0,0 +1,5 @@
+---
+title: Single cell
+type: docs
+weight: 4
+---
diff --git a/content/transcriptomics/spatial/_index.md b/content/transcriptomics/spatial/_index.md
new file mode 100644
index 0000000..d3982ce
--- /dev/null
+++ b/content/transcriptomics/spatial/_index.md
@@ -0,0 +1,5 @@
+---
+title: Spatial
+type: docs
+weight: 5
+---
diff --git a/hugo.yaml b/hugo.yaml
index 499c75b..9a03010 100644
--- a/hugo.yaml
+++ b/hugo.yaml
@@ -1,5 +1,5 @@
# Hugo configuration file
-title: handbook
+title: Omics
baseURL: https://handbook.oasci.org/
languageCode: en-us
@@ -15,6 +15,12 @@ markup:
goldmark:
renderer:
unsafe: true
+ extensions:
+ passthrough:
+ delimiters:
+ block: [['\[', '\]'], ['$$', '$$']]
+ inline: [['\(', '\)']]
+ enable: true
# enable hextra syntax highlight
highlight:
@@ -22,9 +28,21 @@ markup:
menu:
main:
- - name: People
- pageRef: /people
+ - name: Sequencing
+ pageRef: /sequencing
weight: 1
+ - name: Genomics
+ pageRef: /genomics
+ weight: 2
+ - name: Alignment
+ pageRef: /alignment
+ weight: 3
+ - name: Transcriptomics
+ pageRef: /transcriptomics
+ weight: 4
+ - name: Appendices
+ pageRef: /appendices
+ weight: 5
- name: Search
weight: 5
params:
@@ -41,10 +59,9 @@ params:
displayLogo: true
logo:
path: img/logo.svg
- width: 90%
page:
- width: normal
+ width: wide
theme:
default: system
diff --git a/layouts/partials/scripts.html b/layouts/partials/scripts.html
new file mode 100644
index 0000000..3d0bed1
--- /dev/null
+++ b/layouts/partials/scripts.html
@@ -0,0 +1,109 @@
+{{- $jsTheme := resources.Get "js/theme.js" | resources.ExecuteAsTemplate "theme.js" . -}}
+{{- $jsMenu := resources.Get "js/menu.js" -}}
+{{- $jsTabs := resources.Get "js/tabs.js" -}}
+{{- $jsLang := resources.Get "js/lang.js" -}}
+{{- $jsCodeCopy := resources.Get "js/code-copy.js" -}}
+{{- $jsFileTree := resources.Get "js/filetree.js" -}}
+{{- $jsSidebar := resources.Get "js/sidebar.js" -}}
+{{- $jsBackToTop := resources.Get "js/back-to-top.js" -}}
+
+{{- $scripts := slice $jsTheme $jsMenu $jsCodeCopy $jsTabs $jsLang $jsFileTree $jsSidebar $jsBackToTop | resources.Concat "js/main.js" -}}
+{{- if hugo.IsProduction -}}
+ {{- $scripts = $scripts | minify | fingerprint -}}
+{{- end -}}
+
+
+
+{{/* Search */}}
+{{- if (site.Params.search.enable | default true) -}}
+ {{- $searchType := site.Params.search.type | default "flexsearch" -}}
+ {{- if eq $searchType "flexsearch" -}}
+ {{- $jsSearchScript := printf "%s.search.js" .Language.Lang -}}
+ {{- $jsSearch := resources.Get "js/flexsearch.js" | resources.ExecuteAsTemplate $jsSearchScript . -}}
+ {{- if hugo.IsProduction -}}
+ {{- $jsSearch = $jsSearch | minify | fingerprint -}}
+ {{- end -}}
+ {{- $flexSearchJS := resources.Get "lib/flexsearch/flexsearch.bundle.min.js" | fingerprint -}}
+
+
+ {{- else -}}
+ {{- warnf `search type "%s" is not supported` $searchType -}}
+ {{- end -}}
+{{- end -}}
+
+{{/* Mermaid */}}
+{{/* FIXME: need to investigate .Page.Store hasMermaid is set for homepage */}}
+{{- if and (.Page.Store.Get "hasMermaid") (not .Page.IsHome) -}}
+ {{- $mermaidJS := resources.Get "lib/mermaid/mermaid.min.js" | fingerprint -}}
+
+
+{{- end -}}
+
+
+
+
+
+
+
+
+{{/* KaTex */}}
+{{- $katexCSS := resources.Get "lib/katex/katex.min.css" | fingerprint -}}
+{{- $katexJS := resources.Get "lib/katex/katex.min.js" | fingerprint -}}
+{{- $mhchemJS := resources.Get "lib/katex/mhchem.min.js" | fingerprint -}}
+{{- $katexAutoRenderJS := resources.Get "lib/katex/auto-render.min.js" | fingerprint -}}
+
+
+
+
+{{ $katexFonts := resources.Match "lib/katex/fonts/*" }}
+{{- range $katexFonts -}}
+{{ .Publish }}
+{{- end -}}
+
diff --git a/static/fastqc/fastqc_forward_initial.html b/static/fastqc/fastqc_forward_initial.html
new file mode 100644
index 0000000..f0edffb
--- /dev/null
+++ b/static/fastqc/fastqc_forward_initial.html
@@ -0,0 +1,187 @@
+forward.gz FastQC Report