Skip to content

Latest commit



114 lines (86 loc) · 4.18 KB

File metadata and controls

114 lines (86 loc) · 4.18 KB

STAMPS 2018 - intro to shotgun metagenomes, and assembly


Outline, in brief

  1. Introduction and welcome!
  2. Some terminology:
    • "metagenomics" == "shotgun metagenomics"
    • "16s sequencing" is a type of "amplicon sequencing" or "targeted sequencing"
  3. Shotgun sequencing - an intro
  4. Shotgun sequencing of metagenomes - why? why not?
    • Cons vs amplicon:
      • lower coverage / more expensive (good? bad? what are the tradeoffs?)
      • much more computationally challenging to analyze
    • Pros vs amplicon:
      • different bias (no primers)
      • virus/phage can be detected
      • function can be detected
      • recover (putative) genomes
  5. What can we do with shotgun sequencing?
    • do taxonomic analysis directly on the reads - (Tuesday!)
    • search the reads for genes of interest (function, taxonomy)
    • assemble the reads into contigs, longer stretches of DNA - (next few days)
      • annotate contigs with taxonomy, function
      • (note distinction between de novo assembly and reference-based assembly)
      • cluster contigs together to extract genome bins, aka "metagenome assembled genomes" - (Monday!)
      • compare contig abundances across samples to look for differentially abundant sequences
    • (see Mike Lee's diagram of All the Tools!)
  6. Important notes on assembly:
    • assembly squashes abundance 8:
    • assembly ignores complicated regions :(
    • assembly is surprisingly accurate and (when using megahit, at least) computationally tractable :)
  7. Important notes on quantifying assembled contigs:
  8. Some open computational research questions:
    • right now assembly based approaches simply ...discard some proportion of the data. we should figure out a better way.
    • what is the value of long reads in shotgun metagenomics?

Question 1!

Question 1

Let's assemble something!

First, log in to the class server and activate the necessary software environment:

export PATH="/class/stamps-software/miniconda3/bin:$PATH"

Now, download some mystery data:

mkdir ~/mystery
cd ~/mystery
curl -o sample.R1.fq.gz -L
curl -o sample.R2.fq.gz -L

Assemble using the megahit metagenome assembler:

megahit -1 sample.R1.fq.gz -2 sample.R2.fq.gz -o mystery.assembly -m 5e9

(Digression: where did this "megahit" software come from!? See: "What is bioconda?" to learn about the magical software installation system known as "conda", and the amazing set of community-maintained packages that is "bioconda".)

After about 5 minutes, you should see output like this:

--- [STAT] 7774 contigs, total 4987609 bp, min 200 bp, max 8658 bp, avg 642 bp, N50 1049 bp

what does this mean?

Questions you might ask yourself at this point:

  • what are the basic assembly metrics?
  • is this a good assembly? (what does "good" mean?)
  • what microbial genome(s) are present in this assembly?

We can get a preliminary set of metrics with QUAST.

quast mystery.assembly/final.contigs.fa
cat quast_results/latest/report.txt

Open topics for discussion!

How much should I sequence?

How accurate/effective is functional classification on shotgun metagenome data?