Skip to content

Commit

Permalink
adding mmseqs2
Browse files Browse the repository at this point in the history
  • Loading branch information
linsalrob committed Jul 19, 2023
1 parent 8dbbf0d commit 6400ca0
Show file tree
Hide file tree
Showing 2 changed files with 134 additions and 1 deletion.
114 changes: 114 additions & 0 deletions MMSeqs2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# MMSeqs2

[MMSeqs2](https://github.com/soedinglab/MMseqs2) is a fast sequence searching algorithm that we use in replace of [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi) or [diamond](https://github.com/bbuchfink/diamond)

MMSeqs2 has a lot of different options, and we have not yet included them all here, but you can find full details about MMSeqs2 in their [detailed manual](https://mmseqs.com/latest/userguide.pdf).

# MMSeqs2 databases

Like many tools, MMSeqs2 has precomputed databases that you can download.

There is a [complete list on their website](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases)

You can download a database with the `databases` command. For example, to download the [UniRef50](https://www.uniprot.org/help/uniref) database:

```bash
mkdir -p UniRef50
mmseqs databases --threads 8 UniRef50 UniRef50/UniRef50 /tmp
```

Some of the databases have taxonomy included with them, and that enables you to use `mmseqs easy-taxonomy` to explore the metagenome.

# Easy Taxonomy

We use the MMSeqs2 easy taxonomy a _lot_ for analysing metagenomes, especially by comparing to the [UniRef50](https://www.uniprot.org/help/uniref) database.

First, `mmseqs easy-taxonomy` _requires_ `fasta` files and does not work with `fastq` files. We have a [fast way to convert fastq to fasta](https://edwards.flinders.edu.au/fastq-to-fasta/) or you can find some tools online.

We also take advantage of `mmseqs` [sensitivity sweep](https://github.com/soedinglab/MMseqs2/wiki#set-sensitivity--s-parameter) but you should consider comparing [sensitivity and resources](https://github.com/soedinglab/MMseqs2/wiki#optimizing-sensitivity-and-consumption-of-resources). There is a lot of discussion on the [MMSeqs2 wiki](https://github.com/soedinglab/MMseqs2/wiki) about setting sensitivity.


We typically use this command to run the easy taxonomy:


```bash
mkdir easy-taxonomy
mmseqs easy-taxonomy sequence.fasta UniRef50/UniRef50 easy-taxonomy/sequence_taxonomy /tmp --start-sens 1 --sens-steps 3 -s 7 --threads 32
```

The results will be in a series of files in the `easy-taxonomy` directory, whose names start with `sequence_taxonomy`:

SAGCFN_22_00809_S34_lca.tsv.gz SAGCFN_22_00809_S34_report.gz SAGCFN_22_00809_S34_tophit_aln.gz SAGCFN_22_00809_S34_tophit_report.gz

- `sequence_taxonomy_lca.tsv.gz`: The lowest common ancestor of the sequences in tab separated text.

Example output:

```
R100400180029:20220829140225:V350082744:2:1145432:5:58/2/2 310915 species Pangasianodon hypophthalmus 2 2 1 0.540
```

Columns are:
1. the sequencing read
2. the taxonomy ID from [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree). For example, this is [310915](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/310915/)
3. the taxonomic clade. `Species` in this example
4. The organism name. `Pangasianodon hypophthalmus`
5.


- `sequence_taxonomy_report.gz` a Kraken2 style output report

Example output:

```
0.8561 9653 9653 species 310915 Pangasianodon hypophthalmus
```

- `sequence_taxonomy_tophit_aln.gz` the `blast m8` format

Example output:

```
R100400180029:20220829140225:V350082744:2:1145432:5:58/2/2 UniRef50_UPI00147C5152 0.382 163 30 0 0 50 0 163 1.796E-26 108
```

The columns are:

1. Sequence Read
2. Match database ID. In this case from the [UniRef50](https://www.uniprot.org/) we have sequence [UniRef50_UPI00147C5152](https://www.uniprot.org/uniref/UniRef50_UPI00147C5152)
3. Similarity (38.2% identity)
4. Alignment length (163 bases)
5. Gaps (30 bases)
6. Mismatches (0 bases)
7. Start on the sequence read (0)
8. End on the sequence read (50)
9. Start on the database sequence (0)
10. End on the database sequence (163)
11. E-value (1.796E-26)
12. Bit score


- `sequence_taxonomy_tophit_report.gz` the taxonomy and matches to all of the proteins

Example output

```
UniRef50_UPI00147C5152 6970 0.312 1849.374 0.367 310915 species Pangasianodon hypophthalmus
```

The columns are:

1. Database ID. In this case from the [UniRef50](https://www.uniprot.org/) we have sequence [UniRef50_UPI00147C5152](https://www.uniprot.org/uniref/UniRef50_UPI00147C5152)
2. Number of sequences aligning to target
3. Unique coverage of target uniqueAlignedResidues / targetLength
4. Target coverage alignedResidues / targetLength
5. Average sequence identity
6. [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree) identifier
7. Taxonomic level `species`
8. Taxonomic name, in this case `Pangasianodon hypophthalmus`






21 changes: 20 additions & 1 deletion Workshops/INRB2023.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,26 @@ Then, we can assemble the filtered bacterial sequences using spades:
spades.py --meta -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o not_human_assembly
```

Once you have assembled the sequences, the first step is to visualise the assembly with [bandage](https://rrwick.github.io/Bandage/).
Once you have assembled the sequences, the first step is to visualise the assembly with [bandage](https://rrwick.github.io/Bandage/), and hopefully you will find an image like this

![Bandage plot](images/bandage.png)


# Hecatomb

You can [find the Hecatomb tutorial on the readthedocs website](https://hecatomb.readthedocs.io/)

# Kraken annotations

You can download the [kraken2 databases](https://benlangmead.github.io/aws-indexes/k2) and [run Kraken2](../Kraken2/) on your samples.

# Functional annotations

Just as with taxonomy, there are two broad approaches to figuring out the functions that are present.

There are a suite of tools that use [heuristics](https://en.wikipedia.org/wiki/Heuristic) (en Français: heuristique). For example, [superfocus](../SUPER-FOCUS) uses _k_-mers, short sequences, to find the functions that are present.

Another way to find the functions, is to use [MMSeqs2](https://github.com/soedinglab/MMseqs2) easy-taxonomy. You can find more details about the easy-taxonomy in the [MMSeqs2 manual](https://mmseqs.com/latest/userguide.pdf)



Expand Down

0 comments on commit 6400ca0

Please sign in to comment.