diff --git a/modules/README.md b/modules/README.md new file mode 100644 index 0000000..a76abcf --- /dev/null +++ b/modules/README.md @@ -0,0 +1,10 @@ +# Module directory +Directory for placing the course module files - these should be markdown or PDF documents + +They include the presentations and practical manuals for the module. + +Converting between markdown to PDF can be performed using pandoc. Here is a tutorial and system for that: + +[Converting with Pandoc](https://github.com/WCSCourses/format_convert) + +There is an example markdown file - module_base.md diff --git a/modules/encapsulated.md b/modules/encapsulated.md new file mode 100644 index 0000000..cc85a0d --- /dev/null +++ b/modules/encapsulated.md @@ -0,0 +1,446 @@ +# Genomics and Clinical Microbiology 2024 + +14 - 19 January 2024 + +Wellcome Genome Campus, Hinxton UK + +## Topic: Bioinformatics + +**Instructors: Dr. Keith Jolley, Dr. Made Krisna, Dr. Kasia Parfitt and Prof. Martin Maiden** + +## Background + +This module aims to introduce participants to bioinformatic analysis of molecular and genomic data. + +## Learning outcomes + +* Gain familiarity with the Linux operating system and command line; +* Understand the concepts of sequence alignment and tree building; +* Perform a bacterial genome assembly using the command line; +* Explore and analyse genomic datasets on PubMLST. + +## Practical 1: Introduction to the Linux command line and BLAST + +In this practical we will be introducing the Linux command line. Most bioinformatics tools run on Linux and so a familiarity with its use can be very helpful. To provide a real-world example of its power we will use the stand-alone BLAST package to create a local database that can be queried to identify a sequence. In this case we will construct a database of 16S sequences and use this to try to identify the species using sequences from four bacterial samples. + +Commands are typed in to a console window. This is accessed in different ways on different versions of Linux, but there should be either a menu entry or icon for ‘console’ or ‘terminal’. For the Linux system used on the course this is highlighted below: + +![](images/encapsulated/linux1.jpg) + +We will be using a number of commands to navigate the Linux file system. Commands that appear in a box should be typed directly in to the console window, e.g. + +``` +ls * +``` + +**Linux is case-sensitive – it is important that commands are typed exactly as shown, including spaces.** + +### Navigation +In this section we will gain a familiarity with the Linux commands used to navigate the file system and interact with files and directories. Type everything that appears in the boxes. + +#### List files in your home directory. + +List non-hidden files in your current directory: +``` +ls +``` +List all files in your current directory: +``` +ls -a +``` +#### Change current directory to the root of the file system and explore the directory structure. + +To go to the root (top-level) of the filesystem: +``` +cd / +``` +List non-hidden files in your current directory with extended information (file size, dates, owner). You can combine command attributes, e.g. (-a: show all; -l: show long information; -h: show human-readable file sizes): +``` +ls -lh +``` +Show the current directory (print working directory): +``` +pwd +``` +Change current directory to /home and list the contents: +``` +cd /home +ls -l +``` +Change to the parent directory and list the contents: +``` +cd .. +ls -l +``` +Go back to your home directory, list the contents, and check where you are in the file system hierarchical structure: +``` +cd ~ +ls +pwd +``` +### Creating a BLAST database +Make sure that you are in your home directory: +``` +cd +pwd +``` +We are now going to create a directory where we will generate the BLAST database. To do this, we use the ``mkdir`` (make directory) command. +``` +mkdir blast +``` +To enter a directory, we use the ``cd`` command (change directory). Enter the newly created blast directory. +``` +cd blast +``` +Now we will obtain reference 16S sequences. These are available from the SILVA website () as a compressed file (). This URL has been shortened by tinyurl to . You can download files using the ``wget`` command. + +``` +wget https://tinyurl.com/y7dh2htc --output-document=16S_ref.fasta.gz +``` + + We can see the size of this file by using the ``ls`` command with the -l (long) -h (human-readable) arguments. +``` +ls -lh +``` +You should see something like the following: +> ``-rw-rw-r-- 1 keith keith 230M Dec 11 2017 16S_ref.fasta.gz`` + +The file size is 230MB. + +Before we can use it, we need to uncompress it using the ``gunzip`` command. This is quite a large file and will therefore take a few seconds to uncompress. +``` +gunzip 16S_ref.fasta.gz +``` +When the file has finished uncompressing you will see that the file size is now much bigger and the .gz prefix has been removed from the filename. +``` +ls -lh +``` + +> ``-rw-rw-r-- 1 keith keith 1.1G Dec 11 2017 16S_ref.fasta`` + +The file size is now 1.1GB. + +You can look at the contents of this file using the ``less`` command. Scroll through the file using the space key, go back using the ‘b’ key and exit by pressing ‘q’. The file is in FASTA format, with taxonomic information in the header line for each sequence. +``` +less 16S_ref.fasta +``` +We can use this file to create a BLAST database against which we can query the specimen sequences. Creating the database will take about 2-3 minutes. +``` +makeblastdb -in 16S_ref.fasta -dbtype nucl -out 16Sdb +``` +Now if you list the contents of the directory, you should see 3 new files: 16Sdb.nhr, 16Sdb.nin and 16Sdb.nsq. These together comprise the BLAST database. + +16S sequences have been generated for 4 specimens. You can download these from . +``` +wget https://github.com/WCSCourses/GCM24/raw/main/course_data/BLAST/16S.tar +``` +This file is an uncompressed tar archive containing individual files for each specimen. Extract the files using the tar command. +``` +tar xvf 16S.tar +``` +If you now list the directory, you should see FASTA files for specimens 1-4. + +Query each sequence against the newly-created BLAST database. As these are nucleotide sequences and the BLAST database has been created using nucleotide sequences, we should use the ``blastn`` command. +``` +blastn -db 16Sdb -query specimen1.fas -out specimen1.out +``` +View the output using the ``less`` command: +``` +less specimen1.out +``` +Scroll down until you come to the first alignment. Alignments are shown in score order so the better alignments are shown first. Can you identify the species? + +Do the same for the other specimen files. You may wish to limit the number of results +``` +blastn -db 16Sdb -query specimen2.fas -out specimen2.out -num_alignments 5 +``` + +## Practical 2: Constructing phylogenetic trees + +### Introduction to phylogenetic analysis +For these exercises we will be using the program MEGA (Molecular Evolutionary Genetics Analysis). This is a free application that will run on Windows, MacOS or Linux and is available from . The exercise will be run as an interactive session with a talk. Please stop and wait for the next part of the talk when you reach a stop sign: + +![](images/encapsulated/stop-sign.png) + +You have been provided with a dataset that consists of the sequences of a gene for a collection of diverse *Mycobacteria* spp. isolates spanning the known diversity of the genus. You can find these sequences in https://github.com/WCSCourses/GCM24/tree/main/course_data/phylogenetics. Download the sequence data into your account to make it easy to drag-and-drop the files. To do this, open a terminal window and type (or copy and paste): + +``` +mkdir Desktop/phylogenetics +cd Desktop/phylogenetics +wget https://github.com/WCSCourses/GCM24/raw/main/course_data/phylogenetics/mycobacteria.fas +wget https://github.com/WCSCourses/GCM24/raw/main/course_data/phylogenetics/mycobacteria+C_diphtheriae.fas +``` + +You should now see a new folder on your desktop called 'phylogenetics' that contains two files. You will be able to drag-and-drop from here into the MEGA window later. + +![](images/encapsulated/linux_phylo.png) + +We will be looking at the *rplB* gene that encodes one of the proteins that make up the ribosome, the essential protein factory of the cell. Some ribosomal protein genes are involved in antibiotic resistance within *Mycobacteria*, and may therefore be under diversifying selection, but this gene is not. With its essential role we should, therefore, expect it to be relatively conserved within a species, making it a good candidate for phylogenetic analysis. + +### Aligning and formatting data +The sequences have been provided in FASTA format. This is the simplest and probably most common format for sequence data. Each sequence within a FASTA file consists of a header line beginning with a ‘>’ character followed by the sequence identifier and optional comments separated by a ‘|’ character. The sequence itself appears on the following line(s) and continues until either the next header line (beginning with a ‘>’) or the end of the file, e.g. + +``` +>seq_1 +TTTGATACTGTTGCCGAAGGTTTGGGCGAAATTCGCGATTTATTGCGCCGTTATCATCAT +GTCAGCCATGAGTTGGAAAATGGTTCGAGTGAGGCCTTATTGAAAGA +>seq_2 +TTTGATACCGTTGCCGAAGGTTTGGGTGAAATTCGCGATTTATTGCGCCGTTACCACCGC +GTCGGCCATGAGTTGGAAAACGGTTCGGGTGAGGCTTTGTTGAAAGA +>seq_3 +TTTGATACCGTTGCCGAAGGTTTGGGTAAAATTCGCGATTTATTGCGCCGTTACCACCGC +GTCGGTCATGAGTTGGAAAACGGTTCGGGTGAGGCTTTGTTGAAAGA +``` +MEGA can read FASTA files. Sequences need to be aligned before they can be used to generate phylogenetic trees. + +Run MEGA and open the mycobacteria.fas file now found in the phylogenetics folder on your desktop. + +![](images/encapsulated/linux_phylo2.jpg) + +You can either click the ‘Data’ button and then select the file, or simply drag-and-drop the file in to the main interface. + +![](images/encapsulated/mega1.png) + +A dialog box will ask you whether to ‘Analyze or Align file’ – select ‘Align’. + +![](images/encapsulated/mega2.png) + +The sequences will be loaded in to the ‘Alignment explorer’ window. + +![](images/encapsulated/mega3.png) + +Select all the sequences by clicking ``Edit .. Select All``. + +![](images/encapsulated/mega4.png) + +Click the Alignment tab and select Align by MUSCLE (codons): + +![](images/encapsulated/mega5.png) + +Aligning by codon takes account of the fact that these are protein-encoding sequences so the alignment is performed by first translating the codons to amino acids, performing the alignment, and then finally replacing the amino acids with the original codons. This will produce a more robust alignment for coding data than aligning individual nucleotides. + +Alignment options will appear. Leave the default values and click ‘OK’. + +![](images/encapsulated/mega6.png) + +You will be asked whether you would like to remove gaps before alignment. Click ‘Yes’: + +![](images/encapsulated/mega7.png) + +You will then get a warning that there are stop codons found in the translated sequences. Click 'Ignore' since these are at the ends of the gene and do not affect the alignment. + +![](images/encapsulated/mega8.png) + +![](images/encapsulated/stop-sign.png) + +### Distances +You can now perform phylogenetic analysis. Click on the ‘Data’ tab and select ‘Phylogenetic Analysis’: + +![](images/encapsulated/mega9.png) + +The data are protein encoding, so answer ‘Yes’ when asked: + +![](images/encapsulated/mega10.png) + +The aligned data are available in the main MEGA window. + +In order to re-construct a tree from sequence data we need to calculate the genetic distances between each sequence. There are different ways of doing this, employing different evolutionary models. The simplest is the p-distance which is basically a count of the number of differences between two aligned sequences divided by the length of the sequences. Select ‘Distances’ in the main MEGA window and then select ‘Compute Pairwise Distances…’ in the dropdown menu. + +![](images/encapsulated/mega11.png) + +You may be asked if you want to use the currently active file. Say 'yes'. + +![](images/encapsulated/mega12.png) + +An ‘Analysis Preferences’ dialog will be displayed. Select ‘p-distance’ in the Model/Method section leaving all other options at their default. + +![](images/encapsulated/mega13.png) + +Click ‘OK’. A distance matrix will be calculated and displayed. The distance between sequences 1 (*M. abscessus*) and 2 (*M. gilvum*) is 0.1182795699. This corresponds to 99 nucleotide differences in a total shared length of 837 bases, i.e. (99/837). + +![](images/encapsulated/mega14.png) + +If you re-calculate distances using a different model, you will see that the values are slightly different. Recalculate using the Jukes-Cantor model. The distance is now 0.1287197008. This difference is due to a multiple hit correction. + +![](images/encapsulated/stop-sign.png) + +### Tree building +Now you can generate a Neighbor-Joining tree by selecting ‘Phylogeny’ in the main MEGA window and then ‘Construct/Test Neighbor-Joining Tree’. + +![](images/encapsulated/mega15.png) + +Accept the default options and click ‘OK’. + +![](images/encapsulated/mega16.png) + +A ‘Tree Explorer’ window will open. + +![](images/encapsulated/mega17.png) + +The rectangular tree can be misleading because by default it will root at the midpoint whereas the root may not be known. A radiation tree is often a better way to draw an unrooted tree because no assumption of the root is implied. You can show a radiation tree by clicking on the tree icon and selecting ‘Radiation’. + +![](images/encapsulated/mega18.png) +![](images/encapsulated/mega19.png) + +Switching off the labels will make the tree clearer. You can do this by unchecking the 'Taxon names' checkbox. + +Often we will want to explicity root a tree using an outgroup – one or more nodes that we know to be more dissimilar than the other members of the tree. A second dataset has been provided that includes the same dataset with the addition of the rplB sequence from a Corynebacterium diphtheriae isolate. +Load the mycobacteria+C_diphtheriae.fas file, align it and generate a Neighbor-joining tree. + +![](images/encapsulated/mega20.png) + +Now we can see where the root should be, as *Corynebacterium diphtheriae* is the most distance node in the tree. We can explicitly root the tree using this node by selecting the branch from *Corynebacterium diphtheriae* to the other isolates, right-clicking and selecting ‘Root’. + +![](images/encapsulated/mega21.png) + +![](images/encapsulated/stop-sign.png) + +### Bootstrap tests +Bootstrapping is a way of testing the reliability of an inferred tree. It works by randomly replacing a subset of the data and testing whether the topology of a tree generated from these new sequences changes. If it does not then there is a strong signal supporting the topology and we can be more confident of the groupings. The test provides a percentage value for each branch of the tree. + +From the MEGA main window, select Phylogeny and Neighbor-joining tree again. This time, in the section marked ‘Phylogeny Test’, select Test of Phylogeny ‘Bootstrap method’ leaving other options at their default settings. + +![](images/encapsulated/mega22.png) + +The bootstrap values will be displayed on each branch of the tree. + +![](images/encapsulated/mega23.png) + +## Practical 3: Genome assembly + +### Introduction +In this training you will use the Linux command line to run a bacterial genome assembly using the program SPAdes. The objective is to: + +* become familiar with the bioinformatics steps involved in assembly and checking the quality of the assembly. +* understand what software is used and how it is parameterized or configured. + +For this purpose we will initially start with the process of assembling contigs from raw reads. Contigs is a term that means contiguous DNA and refers to the consensus sequence that is formed when sequence reads (usually from fastq files) are ‘stitched together’ to form large regions from the genome. With short reads, repetitive sequences usually prevent complete closed genomes from being produced but instead the end result is usually smaller pieces of contiguous DNA that make up most of the genome. + +### Stages in the assembly process + +* Raw fastq QC assessment +* Fastq trimming +* Trimmed fastq QC assessment +* Assembly +* Assembly QC + +### Introduction to "Next Generation Sequencing" +Next generation or high throughput sequencing involves massively parallel sequencing of small fragments of DNA that have been generated from an original nucleic acid source. This can be genomic DNA, PCR amplicons or cDNA generated from RNA. In this tutorial you will be working with sequence data that has been generated from whole genome DNA extracts. In this case the genomic DNA is fragmented and then Illumina adapter sequences ligated so that common sequencing primers can be used to prime from the fragments and generate sequence. The fragmentation and ligation can either be done in one step (e.g. with the enzymatic NextEra system that uses a process they call tagmentation) or in two steps such as with the TruSeq process where the DNA is sheared physically and the adapters ligated subsequently. The Illumina sequencing technology works best when the fragment size (also known as insert size) is 300 - 500bp. With paired-end sequencing both ends of the fragment are sequenced but these reads (one in the forward and one in the reverse direction) may not meet. If the insert size is too small the reads can be, and for assembly, should be merged. + +At the end of the sequencing process, when paired end sequencing is used, each sample sequenced will produce a pair of fastq files corresponding to one file with all the read 1s from the fragments and in the second file all the read 2s. The order of the reads in these files is the same so that the R1 and R2 for each fragment are synchronised. However since the fragments are generated randomly in the case of library preparation from whole genomes they are not in any order in relation to the genome. This can only be achieved by giving each read context through de novo assembly or aligning (usually known as mapping) to a close reference genome. + +![](images/encapsulated/assembly1.png) + +### FASTQ format +A FASTQ file normally has four lines per sequence fragment. + +Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). +Line 2 contains the sequence as letters that represent the nucleotides. +Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. +Line 4 encodes the quality values for the sequence in Line 2, and contains the same number of symbols as letters in the sequence. +So for example FASTQ file containing a single sequence might look like this: + +``` +@SEQ_ID +GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT ++ +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 +``` +The quality metrics are encoded as shown below (taken from https://en.wikipedia.org/wiki/FASTQ_format#Encoding). Most sequence data is now in Phred+33 format where quality ranges from 0 to 40 and is encoded as ! to J. + +![](images/encapsulated/assembly2.png) + +Open a terminal window and make sure that you are in your home directory. Create a directory for the assembly. + +``` +cd ~ +mkdir assembly +``` +The FASTQ files containing the short-read data can be found in a tar archive on the GitHub site for this course. Download this to the new assembly directory +``` +cd assembly +wget https://github.com/WCSCourses/GCM24/raw/main/course_data/assembly/WTAC.fastq.tar +``` +Extract the short-read data FASTQ files from tar archive using the `tar` command and then list the contents of the directory: +``` +tar xvf WTAC.fastq.tar +ls +``` +You should now see 3 files in the directory - the original tar file and the extracted forward and reverse compressed FASTQ files (WTAC_1.fastq.gz and WTAC_2.fastq.gz). + +### FASTQ file QC assessment +We should first assess the quality of our FASTQ data using the fastqc tool. To check the forward traces enter: +``` +fastqc WTAC_1.fastq.gz +``` +This will generate a HTML output file. You can use a web browser to visualize this. +``` +firefox WTAC_1_fastqc.html +``` +In the web browser you should see the following output: + +![](images/encapsulated/assembly3.png) + +If you click on ‘Per base sequence quality’ you will see that the quality of this fastq file is very high. Anything above 25 is usually considered good quality. + +A key step in the preparation of nucleic acid for NGS is fragmentation and there are different methods in which nucleic acids can be shortened prior to library construction. A description of these can be found at https://tinyurl.com/yadb6dvd. + +Enzymatic methods, in particular, can be used and these use nucleases or transposases to cleave DNA into small pieces. Clicking on the adapter content link will identify any contamination arising from the Nextera transposase sequence. + +![](images/encapsulated/assembly4.png) + +You will notice that this sample does not contain any contamination. This is because the nucleic acid in this example was mechanically sheared. + +The figure below provides an example of what contamination with the Nextera Transposase sequence would look like suggesting that the sequencing inserts may be small in some of the sequence data. In this case the sequencing has read all the way through from one side of the insert to the other into the adapter sequences on the other side of the sequence insert. Since this is not sequence from the originating DNA this should be removed. + +![](images/encapsulated/assembly5.png) + +### Assembly +Finally, we come to assembly. For this we will use the SPAdes tool. We need to tell the assembler which files are the forward and reverse FASTQ files; a range of kmer sizes to try - it will automatically select the best assembly generated from each of these at the end; and that it should only run the assembler and not use error correction. Finally, as the virtual machines we are running this on are limited in memory, we can limit the memory used to 8Gb (the job will fail if it tries to use more, but as least the machine won't crash!). We also need to tell SPAdes where to save the output, so with get it to create a new directory called 'output'. Type the following - it will take a few minutes to run! +``` +spades -1 WTAC_1.fastq.gz -2 WTAC_2.fastq.gz -k 21,33,43,53,63,75 --memory 8 -o output --only-assembler +``` +While you are waiting you may want to read: +* the original SPAdes publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519 +* the SPAdes manual: https://github.com/ablab/spades/blob/spades_3.15.4/README.md +* and if you are really keen this primer about bacterial WGS http://cmr.asm.org/content/29/4/881.full.pdf + +Once SPAdes has finished running, navigate to the output directory and list the contents: +``` +cd output +ls +``` +You will see a number of files including: + +* scaffolds.fasta - contains resulting scaffolds in fasta format (contigs joined by paired end read information but where the gaps are padded with Ns, see this article for some more details https://genome.jgi.doe.gov/help/scaffolds.jsf) +* contigs.fasta - contains resulting contigs in fasta format + +### Assembly QC assessment +Now you will assess the quality of the assembly using Quast. + +Type the following to run Quast on the contigs.fasta file: + +``` +quast.py -o quast_output contigs.fasta +``` +This will create a new output directory called 'quast_output' in the assembly directory. + +There are a number of different output files in this directory. The easiest one to view is the HTML report file. Enter the output directory and load this into Firefox: + +``` +cd quast_output +firefox report.html +``` +You will see some basic stats about the assembly. + +![](images/encapsulated/assembly6.png) + +The critical figures to look at are: + +* \# (number) of contigs: smaller the better +* Total length: this should be approximately the size of the genome expected for the species (which is ~2.2Mbp for *Neisseria meningitidis*) +* N50: larger the better (see https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics for an explanation) + +Well done - you have completed a genome assembly and simple QC! + +> The assembly protocol was modified from a tutorial written by Anthony Underwood. \ No newline at end of file diff --git a/modules/images/encapsulated/assembly1.png b/modules/images/encapsulated/assembly1.png new file mode 100644 index 0000000..eed5624 Binary files /dev/null and b/modules/images/encapsulated/assembly1.png differ diff --git a/modules/images/encapsulated/assembly2.png b/modules/images/encapsulated/assembly2.png new file mode 100644 index 0000000..455f488 Binary files /dev/null and b/modules/images/encapsulated/assembly2.png differ diff --git a/modules/images/encapsulated/assembly3.png b/modules/images/encapsulated/assembly3.png new file mode 100644 index 0000000..4fe9a3d Binary files /dev/null and b/modules/images/encapsulated/assembly3.png differ diff --git a/modules/images/encapsulated/assembly4.png b/modules/images/encapsulated/assembly4.png new file mode 100644 index 0000000..0a179f6 Binary files /dev/null and b/modules/images/encapsulated/assembly4.png differ diff --git a/modules/images/encapsulated/assembly5.png b/modules/images/encapsulated/assembly5.png new file mode 100644 index 0000000..b0c5c53 Binary files /dev/null and b/modules/images/encapsulated/assembly5.png differ diff --git a/modules/images/encapsulated/assembly6.png b/modules/images/encapsulated/assembly6.png new file mode 100644 index 0000000..69775de Binary files /dev/null and b/modules/images/encapsulated/assembly6.png differ diff --git a/modules/images/encapsulated/linux1.jpg b/modules/images/encapsulated/linux1.jpg new file mode 100644 index 0000000..37252d3 Binary files /dev/null and b/modules/images/encapsulated/linux1.jpg differ diff --git a/modules/images/encapsulated/linux_phylo.jpg b/modules/images/encapsulated/linux_phylo.jpg new file mode 100644 index 0000000..4f0040f Binary files /dev/null and b/modules/images/encapsulated/linux_phylo.jpg differ diff --git a/modules/images/encapsulated/linux_phylo.png b/modules/images/encapsulated/linux_phylo.png new file mode 100644 index 0000000..28b28ef Binary files /dev/null and b/modules/images/encapsulated/linux_phylo.png differ diff --git a/modules/images/encapsulated/linux_phylo2.jpg b/modules/images/encapsulated/linux_phylo2.jpg new file mode 100644 index 0000000..22d8a38 Binary files /dev/null and b/modules/images/encapsulated/linux_phylo2.jpg differ diff --git a/modules/images/encapsulated/mega1.png b/modules/images/encapsulated/mega1.png new file mode 100644 index 0000000..b70364b Binary files /dev/null and b/modules/images/encapsulated/mega1.png differ diff --git a/modules/images/encapsulated/mega10.png b/modules/images/encapsulated/mega10.png new file mode 100644 index 0000000..4f5a56c Binary files /dev/null and b/modules/images/encapsulated/mega10.png differ diff --git a/modules/images/encapsulated/mega11.png b/modules/images/encapsulated/mega11.png new file mode 100644 index 0000000..e042444 Binary files /dev/null and b/modules/images/encapsulated/mega11.png differ diff --git a/modules/images/encapsulated/mega12.png b/modules/images/encapsulated/mega12.png new file mode 100644 index 0000000..f2f645f Binary files /dev/null and b/modules/images/encapsulated/mega12.png differ diff --git a/modules/images/encapsulated/mega13.png b/modules/images/encapsulated/mega13.png new file mode 100644 index 0000000..cfa052e Binary files /dev/null and b/modules/images/encapsulated/mega13.png differ diff --git a/modules/images/encapsulated/mega14.png b/modules/images/encapsulated/mega14.png new file mode 100644 index 0000000..83e402e Binary files /dev/null and b/modules/images/encapsulated/mega14.png differ diff --git a/modules/images/encapsulated/mega15.png b/modules/images/encapsulated/mega15.png new file mode 100644 index 0000000..2d443f6 Binary files /dev/null and b/modules/images/encapsulated/mega15.png differ diff --git a/modules/images/encapsulated/mega16.png b/modules/images/encapsulated/mega16.png new file mode 100644 index 0000000..1382b1c Binary files /dev/null and b/modules/images/encapsulated/mega16.png differ diff --git a/modules/images/encapsulated/mega17.png b/modules/images/encapsulated/mega17.png new file mode 100644 index 0000000..d0e360a Binary files /dev/null and b/modules/images/encapsulated/mega17.png differ diff --git a/modules/images/encapsulated/mega18.png b/modules/images/encapsulated/mega18.png new file mode 100644 index 0000000..b047da1 Binary files /dev/null and b/modules/images/encapsulated/mega18.png differ diff --git a/modules/images/encapsulated/mega19.png b/modules/images/encapsulated/mega19.png new file mode 100644 index 0000000..8387eed Binary files /dev/null and b/modules/images/encapsulated/mega19.png differ diff --git a/modules/images/encapsulated/mega2.png b/modules/images/encapsulated/mega2.png new file mode 100644 index 0000000..978596d Binary files /dev/null and b/modules/images/encapsulated/mega2.png differ diff --git a/modules/images/encapsulated/mega20.png b/modules/images/encapsulated/mega20.png new file mode 100644 index 0000000..4023a8e Binary files /dev/null and b/modules/images/encapsulated/mega20.png differ diff --git a/modules/images/encapsulated/mega21.png b/modules/images/encapsulated/mega21.png new file mode 100644 index 0000000..e1b3149 Binary files /dev/null and b/modules/images/encapsulated/mega21.png differ diff --git a/modules/images/encapsulated/mega22.png b/modules/images/encapsulated/mega22.png new file mode 100644 index 0000000..eba30ab Binary files /dev/null and b/modules/images/encapsulated/mega22.png differ diff --git a/modules/images/encapsulated/mega23.png b/modules/images/encapsulated/mega23.png new file mode 100644 index 0000000..1e3b6c5 Binary files /dev/null and b/modules/images/encapsulated/mega23.png differ diff --git a/modules/images/encapsulated/mega3.png b/modules/images/encapsulated/mega3.png new file mode 100644 index 0000000..d1c918d Binary files /dev/null and b/modules/images/encapsulated/mega3.png differ diff --git a/modules/images/encapsulated/mega4.png b/modules/images/encapsulated/mega4.png new file mode 100644 index 0000000..22935f9 Binary files /dev/null and b/modules/images/encapsulated/mega4.png differ diff --git a/modules/images/encapsulated/mega5.png b/modules/images/encapsulated/mega5.png new file mode 100644 index 0000000..3ffa641 Binary files /dev/null and b/modules/images/encapsulated/mega5.png differ diff --git a/modules/images/encapsulated/mega6.png b/modules/images/encapsulated/mega6.png new file mode 100644 index 0000000..56e6add Binary files /dev/null and b/modules/images/encapsulated/mega6.png differ diff --git a/modules/images/encapsulated/mega7.png b/modules/images/encapsulated/mega7.png new file mode 100644 index 0000000..23cf337 Binary files /dev/null and b/modules/images/encapsulated/mega7.png differ diff --git a/modules/images/encapsulated/mega8.png b/modules/images/encapsulated/mega8.png new file mode 100644 index 0000000..f553227 Binary files /dev/null and b/modules/images/encapsulated/mega8.png differ diff --git a/modules/images/encapsulated/mega9.png b/modules/images/encapsulated/mega9.png new file mode 100644 index 0000000..42d35be Binary files /dev/null and b/modules/images/encapsulated/mega9.png differ diff --git a/modules/images/encapsulated/pubmlst1.png b/modules/images/encapsulated/pubmlst1.png new file mode 100644 index 0000000..4525b60 Binary files /dev/null and b/modules/images/encapsulated/pubmlst1.png differ diff --git a/modules/images/encapsulated/pubmlst10.png b/modules/images/encapsulated/pubmlst10.png new file mode 100644 index 0000000..8595c0d Binary files /dev/null and b/modules/images/encapsulated/pubmlst10.png differ diff --git a/modules/images/encapsulated/pubmlst11.png b/modules/images/encapsulated/pubmlst11.png new file mode 100644 index 0000000..f52f8eb Binary files /dev/null and b/modules/images/encapsulated/pubmlst11.png differ diff --git a/modules/images/encapsulated/pubmlst12.png b/modules/images/encapsulated/pubmlst12.png new file mode 100644 index 0000000..daffb89 Binary files /dev/null and b/modules/images/encapsulated/pubmlst12.png differ diff --git a/modules/images/encapsulated/pubmlst13.png b/modules/images/encapsulated/pubmlst13.png new file mode 100644 index 0000000..fbcaf0d Binary files /dev/null and b/modules/images/encapsulated/pubmlst13.png differ diff --git a/modules/images/encapsulated/pubmlst14.png b/modules/images/encapsulated/pubmlst14.png new file mode 100644 index 0000000..98455d3 Binary files /dev/null and b/modules/images/encapsulated/pubmlst14.png differ diff --git a/modules/images/encapsulated/pubmlst15.png b/modules/images/encapsulated/pubmlst15.png new file mode 100644 index 0000000..b546e27 Binary files /dev/null and b/modules/images/encapsulated/pubmlst15.png differ diff --git a/modules/images/encapsulated/pubmlst16.png b/modules/images/encapsulated/pubmlst16.png new file mode 100644 index 0000000..328a6a9 Binary files /dev/null and b/modules/images/encapsulated/pubmlst16.png differ diff --git a/modules/images/encapsulated/pubmlst17.png b/modules/images/encapsulated/pubmlst17.png new file mode 100644 index 0000000..add29fd Binary files /dev/null and b/modules/images/encapsulated/pubmlst17.png differ diff --git a/modules/images/encapsulated/pubmlst18.png b/modules/images/encapsulated/pubmlst18.png new file mode 100644 index 0000000..6e9863a Binary files /dev/null and b/modules/images/encapsulated/pubmlst18.png differ diff --git a/modules/images/encapsulated/pubmlst19.png b/modules/images/encapsulated/pubmlst19.png new file mode 100644 index 0000000..be5d220 Binary files /dev/null and b/modules/images/encapsulated/pubmlst19.png differ diff --git a/modules/images/encapsulated/pubmlst2.png b/modules/images/encapsulated/pubmlst2.png new file mode 100644 index 0000000..6e9145b Binary files /dev/null and b/modules/images/encapsulated/pubmlst2.png differ diff --git a/modules/images/encapsulated/pubmlst20.png b/modules/images/encapsulated/pubmlst20.png new file mode 100644 index 0000000..775a23e Binary files /dev/null and b/modules/images/encapsulated/pubmlst20.png differ diff --git a/modules/images/encapsulated/pubmlst21.png b/modules/images/encapsulated/pubmlst21.png new file mode 100644 index 0000000..9bd9fad Binary files /dev/null and b/modules/images/encapsulated/pubmlst21.png differ diff --git a/modules/images/encapsulated/pubmlst22.png b/modules/images/encapsulated/pubmlst22.png new file mode 100644 index 0000000..09b6a3d Binary files /dev/null and b/modules/images/encapsulated/pubmlst22.png differ diff --git a/modules/images/encapsulated/pubmlst23.png b/modules/images/encapsulated/pubmlst23.png new file mode 100644 index 0000000..ce7d2eb Binary files /dev/null and b/modules/images/encapsulated/pubmlst23.png differ diff --git a/modules/images/encapsulated/pubmlst24.png b/modules/images/encapsulated/pubmlst24.png new file mode 100644 index 0000000..7efaba2 Binary files /dev/null and b/modules/images/encapsulated/pubmlst24.png differ diff --git a/modules/images/encapsulated/pubmlst25.png b/modules/images/encapsulated/pubmlst25.png new file mode 100644 index 0000000..e68beab Binary files /dev/null and b/modules/images/encapsulated/pubmlst25.png differ diff --git a/modules/images/encapsulated/pubmlst26.png b/modules/images/encapsulated/pubmlst26.png new file mode 100644 index 0000000..4b73c9d Binary files /dev/null and b/modules/images/encapsulated/pubmlst26.png differ diff --git a/modules/images/encapsulated/pubmlst27.png b/modules/images/encapsulated/pubmlst27.png new file mode 100644 index 0000000..bc8014b Binary files /dev/null and b/modules/images/encapsulated/pubmlst27.png differ diff --git a/modules/images/encapsulated/pubmlst28.png b/modules/images/encapsulated/pubmlst28.png new file mode 100644 index 0000000..4525b60 Binary files /dev/null and b/modules/images/encapsulated/pubmlst28.png differ diff --git a/modules/images/encapsulated/pubmlst29.png b/modules/images/encapsulated/pubmlst29.png new file mode 100644 index 0000000..6e9145b Binary files /dev/null and b/modules/images/encapsulated/pubmlst29.png differ diff --git a/modules/images/encapsulated/pubmlst3.png b/modules/images/encapsulated/pubmlst3.png new file mode 100644 index 0000000..94ccbdf Binary files /dev/null and b/modules/images/encapsulated/pubmlst3.png differ diff --git a/modules/images/encapsulated/pubmlst30.png b/modules/images/encapsulated/pubmlst30.png new file mode 100644 index 0000000..6e6a716 Binary files /dev/null and b/modules/images/encapsulated/pubmlst30.png differ diff --git a/modules/images/encapsulated/pubmlst31.png b/modules/images/encapsulated/pubmlst31.png new file mode 100644 index 0000000..eb57695 Binary files /dev/null and b/modules/images/encapsulated/pubmlst31.png differ diff --git a/modules/images/encapsulated/pubmlst32.png b/modules/images/encapsulated/pubmlst32.png new file mode 100644 index 0000000..b727e2b Binary files /dev/null and b/modules/images/encapsulated/pubmlst32.png differ diff --git a/modules/images/encapsulated/pubmlst33.png b/modules/images/encapsulated/pubmlst33.png new file mode 100644 index 0000000..8d9d8f8 Binary files /dev/null and b/modules/images/encapsulated/pubmlst33.png differ diff --git a/modules/images/encapsulated/pubmlst34.png b/modules/images/encapsulated/pubmlst34.png new file mode 100644 index 0000000..c2ad1b7 Binary files /dev/null and b/modules/images/encapsulated/pubmlst34.png differ diff --git a/modules/images/encapsulated/pubmlst35.png b/modules/images/encapsulated/pubmlst35.png new file mode 100644 index 0000000..1acec7a Binary files /dev/null and b/modules/images/encapsulated/pubmlst35.png differ diff --git a/modules/images/encapsulated/pubmlst36.png b/modules/images/encapsulated/pubmlst36.png new file mode 100644 index 0000000..f1d5c3f Binary files /dev/null and b/modules/images/encapsulated/pubmlst36.png differ diff --git a/modules/images/encapsulated/pubmlst37.png b/modules/images/encapsulated/pubmlst37.png new file mode 100644 index 0000000..f8d26dd Binary files /dev/null and b/modules/images/encapsulated/pubmlst37.png differ diff --git a/modules/images/encapsulated/pubmlst38.png b/modules/images/encapsulated/pubmlst38.png new file mode 100644 index 0000000..b2d0c10 Binary files /dev/null and b/modules/images/encapsulated/pubmlst38.png differ diff --git a/modules/images/encapsulated/pubmlst39.png b/modules/images/encapsulated/pubmlst39.png new file mode 100644 index 0000000..a18646d Binary files /dev/null and b/modules/images/encapsulated/pubmlst39.png differ diff --git a/modules/images/encapsulated/pubmlst4.png b/modules/images/encapsulated/pubmlst4.png new file mode 100644 index 0000000..a81df63 Binary files /dev/null and b/modules/images/encapsulated/pubmlst4.png differ diff --git a/modules/images/encapsulated/pubmlst40.png b/modules/images/encapsulated/pubmlst40.png new file mode 100644 index 0000000..6b6a606 Binary files /dev/null and b/modules/images/encapsulated/pubmlst40.png differ diff --git a/modules/images/encapsulated/pubmlst41.png b/modules/images/encapsulated/pubmlst41.png new file mode 100644 index 0000000..c07ab95 Binary files /dev/null and b/modules/images/encapsulated/pubmlst41.png differ diff --git a/modules/images/encapsulated/pubmlst42.png b/modules/images/encapsulated/pubmlst42.png new file mode 100644 index 0000000..b6d1d1d Binary files /dev/null and b/modules/images/encapsulated/pubmlst42.png differ diff --git a/modules/images/encapsulated/pubmlst43.png b/modules/images/encapsulated/pubmlst43.png new file mode 100644 index 0000000..996b861 Binary files /dev/null and b/modules/images/encapsulated/pubmlst43.png differ diff --git a/modules/images/encapsulated/pubmlst44.png b/modules/images/encapsulated/pubmlst44.png new file mode 100644 index 0000000..d031327 Binary files /dev/null and b/modules/images/encapsulated/pubmlst44.png differ diff --git a/modules/images/encapsulated/pubmlst45.png b/modules/images/encapsulated/pubmlst45.png new file mode 100644 index 0000000..c65833d Binary files /dev/null and b/modules/images/encapsulated/pubmlst45.png differ diff --git a/modules/images/encapsulated/pubmlst46.png b/modules/images/encapsulated/pubmlst46.png new file mode 100644 index 0000000..c5906e3 Binary files /dev/null and b/modules/images/encapsulated/pubmlst46.png differ diff --git a/modules/images/encapsulated/pubmlst47.png b/modules/images/encapsulated/pubmlst47.png new file mode 100644 index 0000000..af20d99 Binary files /dev/null and b/modules/images/encapsulated/pubmlst47.png differ diff --git a/modules/images/encapsulated/pubmlst48.png b/modules/images/encapsulated/pubmlst48.png new file mode 100644 index 0000000..4e1bb8c Binary files /dev/null and b/modules/images/encapsulated/pubmlst48.png differ diff --git a/modules/images/encapsulated/pubmlst49.png b/modules/images/encapsulated/pubmlst49.png new file mode 100644 index 0000000..28fd4de Binary files /dev/null and b/modules/images/encapsulated/pubmlst49.png differ diff --git a/modules/images/encapsulated/pubmlst5.png b/modules/images/encapsulated/pubmlst5.png new file mode 100644 index 0000000..4525b60 Binary files /dev/null and b/modules/images/encapsulated/pubmlst5.png differ diff --git a/modules/images/encapsulated/pubmlst50.png b/modules/images/encapsulated/pubmlst50.png new file mode 100644 index 0000000..7fb10cb Binary files /dev/null and b/modules/images/encapsulated/pubmlst50.png differ diff --git a/modules/images/encapsulated/pubmlst51.png b/modules/images/encapsulated/pubmlst51.png new file mode 100644 index 0000000..70a5d88 Binary files /dev/null and b/modules/images/encapsulated/pubmlst51.png differ diff --git a/modules/images/encapsulated/pubmlst52.png b/modules/images/encapsulated/pubmlst52.png new file mode 100644 index 0000000..adf060e Binary files /dev/null and b/modules/images/encapsulated/pubmlst52.png differ diff --git a/modules/images/encapsulated/pubmlst53.png b/modules/images/encapsulated/pubmlst53.png new file mode 100644 index 0000000..ab6156c Binary files /dev/null and b/modules/images/encapsulated/pubmlst53.png differ diff --git a/modules/images/encapsulated/pubmlst54.png b/modules/images/encapsulated/pubmlst54.png new file mode 100644 index 0000000..1501669 Binary files /dev/null and b/modules/images/encapsulated/pubmlst54.png differ diff --git a/modules/images/encapsulated/pubmlst55.png b/modules/images/encapsulated/pubmlst55.png new file mode 100644 index 0000000..e91f268 Binary files /dev/null and b/modules/images/encapsulated/pubmlst55.png differ diff --git a/modules/images/encapsulated/pubmlst56.png b/modules/images/encapsulated/pubmlst56.png new file mode 100644 index 0000000..2324c4a Binary files /dev/null and b/modules/images/encapsulated/pubmlst56.png differ diff --git a/modules/images/encapsulated/pubmlst6.png b/modules/images/encapsulated/pubmlst6.png new file mode 100644 index 0000000..ed37aad Binary files /dev/null and b/modules/images/encapsulated/pubmlst6.png differ diff --git a/modules/images/encapsulated/pubmlst7.png b/modules/images/encapsulated/pubmlst7.png new file mode 100644 index 0000000..a775723 Binary files /dev/null and b/modules/images/encapsulated/pubmlst7.png differ diff --git a/modules/images/encapsulated/pubmlst8.png b/modules/images/encapsulated/pubmlst8.png new file mode 100644 index 0000000..ca1d6bb Binary files /dev/null and b/modules/images/encapsulated/pubmlst8.png differ diff --git a/modules/images/encapsulated/pubmlst9.png b/modules/images/encapsulated/pubmlst9.png new file mode 100644 index 0000000..eae67c0 Binary files /dev/null and b/modules/images/encapsulated/pubmlst9.png differ diff --git a/modules/images/encapsulated/stop-sign.png b/modules/images/encapsulated/stop-sign.png new file mode 100644 index 0000000..f72f193 Binary files /dev/null and b/modules/images/encapsulated/stop-sign.png differ diff --git a/modules/pubmlst.md b/modules/pubmlst.md new file mode 100644 index 0000000..346e49d --- /dev/null +++ b/modules/pubmlst.md @@ -0,0 +1,286 @@ +# Genomics and Clinical Microbiology 2024 + +14 - 19 January 2024 + +Wellcome Genome Campus, Hinxton UK + +## Topic: Using PubMLST + +**Instructors: Dr. Keith Jolley, Dr. Made Krisna, Dr. Kasia Parfitt and Prof. Martin Maiden** + +### Identifying alleles and sequence types + +#### Determining allele identifier from a sequence +The allele numbers of specific sequences can be determined using the PubMLST typing databases. You may have sequences for MLST loci, surface antigens such as *porA*, or antibiotic resistance genes such as *dhpS* (sulphonamide resistance). Sequence querying works on any length of sequence, including whole genome data, so you can upload contig assemblies to the website for sequence determination of individual loci. + +1. From the PubMLST *Neisseria* front page (https://pubmlst.org/neisseria/), navigate to typing database: + +![](images/encapsulated/pubmlst1.png) + +2. Click the ‘Query a sequence’ … ‘Single sequence’ query link: + + ![](images/encapsulated/pubmlst2.png) + + +3. Select the locus from the drop-down box and paste your sequence into the form. Press ‘Submit’. + +![](images/encapsulated/pubmlst3.png) + +If the sequence has been previously identified, the website will display the corresponding allele/variant identifier: + +![](images/encapsulated/pubmlst4.png) + +#### Determining ST from MLST allelic profile +If you have been provided with MLST allelic profile results, you can look up the sequence type (ST) and clonal complex information as follows: + +1. From the PubMLST Neisseria front page (https://pubmlst.org/neisseria/), navigate to the typing database: + +![](images/encapsulated/pubmlst5.png) + +2. Click the ‘Search for allelic profiles’ … ‘by allelic profile’ link: + +![](images/encapsulated/pubmlst6.png) + +3. Enter the allelic profile into the web form and press ‘Search’: + +![](images/encapsulated/pubmlst7.png) + +4. The ST and clonal complex (if defined) will then be displayed: + +![](images/encapsulated/pubmlst8.png) + +### Querying PubMLST isolate databases +Select the isolate database from a species page on pubmlst.org (e.g. https://pubmlst.org/neisseria/): + +![](images/encapsulated/pubmlst9.png) + +Expand the search menu (click the + next to ‘Search’), and then select ‘Search database’: + +![](images/encapsulated/pubmlst10.png) + +#### Searching provenance data +The standard query form initially provides a means to query provenance information, e.g. isolate name, country and year of isolation etc. To search for all isolates from Africa, select ‘continent’ in the field dropdown box, and enter ‘Africa’ as the value. Click ‘Search’: + +![](images/encapsulated/pubmlst11.png) + +You can build up more complex queries by adding further search terms. Additional query boxes can be added to the form by clicking the ‘+’ button: + +![](images/encapsulated/pubmlst12.png) + +For example, you can combine the previous query with year of isolation, e.g. to select records for isolates isolated from the year 2000 onwards, select ‘year’ in the newly appeared dropdown field box, choose ‘>=’ as the modifier, and enter ‘2000’ as the field value. Click ‘Search’: + +![](images/encapsulated/pubmlst13.png) + +#### Modifying query interface +Sometimes you will want to query by more than just provenance fields. Additional search criteria can be added to the form by clicking the ‘Modify form options’ tab on the top-right hand side of the page. This has an image of a wrench: + +![](images/encapsulated/pubmlst14.png) + +This displays a box that allows you to add additional query types: + +![](images/encapsulated/pubmlst15.png) + +To allow searching by ST or by allele designations, click the ‘Show’ button next to ‘Allele designations/scheme field values’: + +![](images/encapsulated/pubmlst16.png) + +Then close the modification box by clicking the [X] in the top left of the box or clicking the ‘Modify form options’ tab again. + +#### Searching by MLST alleles/STs +With the allele designation/scheme field query type displayed (see previous section), you can now search by ST, clonal complex or allele designation, e.g. to search the entire database for ST-11 isolates, select ‘ST (MLST)’ from the field list and enter ‘11’ as the value. Click ‘Search’: + +![](images/encapsulated/pubmlst17.png) + +As before, additional query terms can be combined by adding new form elements by clicking the ‘+’ button. These query terms will be combined with any provenance field queries, e.g. all ST-11 isolates from Africa in years from 2000: + +![](images/encapsulated/pubmlst18.png) + +#### Filtering queries +Clicking the ‘Modify form options’ tab, allows you to display various query filters by clicking the ‘Show’ button next to ‘Filters’: + +![](images/encapsulated/pubmlst19.png) + +Newly appeared filters include ‘Publication’, ‘MLST profile’ completion and ‘clonal complex’. Other filters can be added by selecting them from the dropdown list and clicking ‘Add filter’. Any filter used will be combined with queries entered in other areas of the form: + +![](images/encapsulated/pubmlst20.png) + +For example, to show ST-11 clonal complex African isolates from 2000, combine the query as below: + +![](images/encapsulated/pubmlst21.png) + +#### Analysing returned datasets +At the bottom of any page of results you will find a large number of buttons that will take you to analysis functions using the results of your query: + +![](images/encapsulated/pubmlst22.png) + +For example, you can breakdown the results by provenance field by clicking the ‘Fields’ button: + +![](images/encapsulated/pubmlst23.png) + +A series of maps and charts will be displayed. You can move between charts by selecting different fields in the dropdown box. + +![](images/encapsulated/pubmlst24.png) + +Data for these can additionally be exported in text or Excel formats. +You can also break one field down against another using the ‘Two Field’ breakdown: + +![](images/encapsulated/pubmlst25.png) + +This allows you to combine any field (provenance, allele designation, ST etc.). For example, country vs capsule group: + +![](images/encapsulated/pubmlst26.png) + +This will display a table of combinations. + +![](images/encapsulated/pubmlst27.png) + +### Whole genome analysis using the BIGSdb platform in PubMLST +A separate database instance has been set up for each participant for when you come to upload and analyse your genome data. This is an isolate database with all Neisseria loci defined as currently used on PubMLST. It also contains genome data for contemporaneous isolates from the same geographical location as the outbreak. You will have curator access to this database in order to upload, scan, tag and analyse sequence data for this practical. +You will be told the web address to use in order to connect to the database which will vary slightly by your user number. + +#### Extracting typing data +You have been provided with FASTA files of the sequence contigs assembled from Illumina short read data using the VELVET assembler. For full analysis you can upload these sequences to the database and associate with an isolate record and we will do this. You can, however, quickly extract standard typing information (ST, clonal complex, PorA and FetA variants) by querying these sequences against the sequence definition database. We will do this for just one of the isolates as a demonstration. + +1. From the PubMLST Neisseria front page, navigate to the typing database: + +![](images/encapsulated/pubmlst28.png) + +2. Click the ‘Query a sequence’ … ‘Single sequence’ query link: + +![](images/encapsulated/pubmlst29.png) + +3. You can either copy and paste the contigs in FASTA format or choose to upload the file. We will do the latter. Select the contig file for NZ98/254 by drag and dropping it on to the ‘Upload FASTA’ area, select ‘MLST’ in the locus/scheme dropdown box. Then click ‘Submit’. + +![](images/encapsulated/pubmlst30.png) + +4. The ST should be displayed in the output section. Repeat, selecting ‘Finetyping antigens’ to get the PorA and FetA results. + +![](images/encapsulated/pubmlst31.png) + +#### Uploading isolate records to the database +While the rapid extraction of typing data is useful for a quick look, more detailed analysis requires loading an isolate record and associated sequence data into the database. + +The first stage of this process is creating isolate records for each of the genomes we want to analyse. This can be done through the curation interface either record-by-record - which is ok if you only have one isolate to do, or more usually using a batch upload method. The batch upload can be prepared in Excel, or any other spreadsheet package, and then copy-and-pasted into the batch add web form. + +1. Select your user number on https://pubmlst.org/training/. + +![](images/encapsulated/pubmlst32.png) + +2. Log in using the account details that you have been provided with: + +![](images/encapsulated/pubmlst33.png) + +3. Access the curator interface by clicking the red user icon in the top right of the screen. + +![](images/encapsulated/pubmlst34.png) + +4. You should now have reached the database curator’s page. Click the ‘Batch Add’ isolates link. + +![](images/encapsulated/pubmlst35.png) + +This takes you to a page that allows you to paste in the prepared batch data. If you were to prepare this data yourself, there is a link that provides an Excel template for you. + +![](images/encapsulated/pubmlst36.png) + +However, an upload file has been prepared for you (available in Excel [isolate_records.xlsx]). + +5. Open the isolate_upload.xlsx file in a spreadsheet program, then copy and paste the entire set of data, including headings into the web form. Do not worry if the formatting appears to be messed up. + +![](images/encapsulated/pubmlst37.png) + +6. Click ‘Submit’. The data will be checked for formatting and any problems will be highlighted (there should be no errors as the file has been prepared for you). Finally click the ‘Import data’ button. + +![](images/encapsulated/pubmlst38.png) + +#### Uploading genome data to isolate records +Now that isolate records have been created, the genome data can be uploaded and associated with the appropriate record. + +1. Either, i) immediately after uploading the isolate records, click the DNA icon: + +![](images/encapsulated/pubmlst39.png) + +or ii) from the curator index page, click the ‘Batch Add’ sequences link: + +![](images/encapsulated/pubmlst40.png) + +2. Now we need to tell the system which assembly file belongs to which record. Open the file ‘genome_upload.xlsx’. This contains data in 2 columns: the isolate name, and the assembly filename. Paste the spreadsheet contents into the web form and select ‘isolate’ as the identifying field name. Click ‘Submit’. + +![](images/encapsulated/pubmlst41.png) + +Now drag and drop the files in to the drop file zone. + +![](images/encapsulated/pubmlst42.png) + +The files will be uploaded to a staging area. + +![](images/encapsulated/pubmlst43.png) + +3. Click ‘Validate’ to check that these files are valid FASTA format. The files will be checked and a table will be displayed showing the total sequence size and number of contigs found. Click ‘Upload validated contigs’. + +![](images/encapsulated/pubmlst44.png) + +4. A confirmation screen will be displayed after a few seconds. + +![](images/encapsulated/pubmlst45.png) + +#### Scanning and tagging the typing loci +Blasting the genomes against all known alleles of a particular locus is a process known as ‘scanning’. Marking these identified alleles in the database is termed ‘tagging’. + +1. From the curator’s index page, click ‘Sequence tags .. scan’: + +![](images/encapsulated/pubmlst46.png) + +2. Select a few of your isolates from the isolate selection list and the ‘MLST’ and ‘Finetyping antigens’ checkboxes from the schemes list. Select ‘Scan selected loci together’ and leave other controls at their default settings and click ‘Scan’: + +![](images/encapsulated/pubmlst47.png) + +Scanning takes a few seconds per genome (or a couple of minutes if all loci are selected). The scan will only look at loci which do not already have an allele designation recorded in the database. The PorA VR and FetA VR loci are defined by peptide sequence - these take slightly longer to scan since the server has to perform a TBLASTX query to determine these. + +3. When scanning is complete, you should see a list of identified allelic matches and ticked checkboxes. These indicate that you will tag these alleles in the database. Click the ‘Tag alleles/sequences’ button: + +![](images/encapsulated/pubmlst48.png) + +The database should update and display a confirmation: + +![](images/encapsulated/pubmlst49.png) + +4. Normally, scanning is an automated process that happens in the background. You can either repeat scanning and tagging for all the other isolates, but if you wait then this will be performed automatically by the autotagger script. This will scan all loci that have been defined in the database. + +#### Genome-wide comparison +The information you have extracted so far could have been achieved using conventional MLST and antigen gene sequencing. With the genomes available, however, we can now look at relationships between isolates at a much higher resolution. + +The Genome Comparator tool can be used to compare isolates using any sets of loci or against a complete annotated genome. + +Make sure you are in the standard query interface (not the curator interface). Either go back to the training front page (https://pubmlst.org/training) and select your user number again, or from the curator interface click the user icon in the top right corner: + +![](images/encapsulated/pubmlst50.png) + +First, we will use ribosomal MLST (rMLST). + +1. From the database contents page (not curator’s interface), click the ‘Search database’ link. + +![](images/encapsulated/pubmlst51.png) + +2. Perform a search to return the isolate records that you have uploaded – your records should all have id numbers less than 20 (other records that already existed in the database and which are contemporaneous isolates have higher id numbers), e.g. search for id <= 20. + +![](images/encapsulated/pubmlst52.png) + +3. At the bottom of the results table, click the ‘Genome Comparator’ button. + +![](images/encapsulated/pubmlst53.png) + +4. Your isolates will be selected. Choose ‘Ribosomal MLST’ from the recommended schemes list. The database contains other contemporaneous isolates which may help you in your analysis, but for now, limit the analysis to your records. Click ‘Submit’: + +![](images/encapsulated/pubmlst54.png) + +5. The analysis will be submitted to the job queue and should normally run within a few seconds (it may be a bit longer if the server is busy). + +![](images/encapsulated/pubmlst55.png) + +When the analysis completes, a table showing the alleles at each of the rMLST loci will be displayed along with a NeighborNet network. + +6. Now we will analyse the isolates using the cgMLST scheme. This scheme consists of 1422 loci and provides higher resolution than the rMLST scheme. Repeat the Genome Comparator analysis but this time select the cgMLST scheme (*N. meningitidis* cgMLST v2). + +![](images/encapsulated/pubmlst56.png) + +Another tool that you may find very useful is GrapeTree. This generates minimum-spanning trees and allows you to colour isolate nodes with any metadata fields that you wish. Look at the online documentation for further details (https://bigsdb.readthedocs.io/en/latest/data_analysis/grapetree.html).