-
Notifications
You must be signed in to change notification settings - Fork 7
Using rextraccnt to extract entries from nt like databases
Rextraccnt extracts entries from metagenomic databases with the FASTA format of the NCBI nt database, which lists the accession number of each entry. The name of the script is intentionally close to rextract
because both scripts have several analogies. Rextraccnt is useful when you have a very large database (e.g., a decontaminated version of NCBI BLAST nt) and need to get a subset attending to taxonomic criteria, such as entries belonging to organisms under a given clade, or the contrary, by excluding some branch of the taxonomic tree. Below, you have details about the command layout, and you can find some examples at the end of the page, but first we will see the expected input format and how to obtain an accession to taxid mapping file.
The format of the input is expected to be the used in the NCBI BLAST nt fasta files, which has the accession number as the sequence id. For example:
>X51700.1 Bos taurus mRNA for bone Gla protein
GTCCACGCAGCCGCTGACAGACACACCATGAGAACCCCCATGCTGCTCGC...
To run rextraccnt
command, you will need a file mapping the accessions in the input file and the taxonomic identifiers (NCBI Taxonomy), one per line, in no particular order. For example:
X51700.1 9913
...
You pass the name of this file to rextraccnt
via the argument --mapfile
. For example, for the NCBI nt database, you can get this file using NCBI blastcmd suite:
blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fa.taxidmapping
The layout of the Rextraccnt (rextraccnt
) command (ver. 1.12.0) is:
usage: rextraccnt [-h] [-d] [-l NUMBER] [-e NUMBER] [-n PATH] [-i TAXID]
[-x TAXID] -m FILE [-f FILE] [-c] [-V]
-h, --help show this help message and exit
-d, --debug increase output verbosity and perform additional checks
-l NUMBER, --limit NUMBER
limit of nt DB entries to extract; default: no limit
-e NUMBER, --entrymax NUMBER
maximum number of nt DB entries to search for the taxa; default: no maximum
-n PATH, --nodespath PATH
path for the nodes information files (nodes.dmp and names.dmp from NCBI)
-m FILE, --mapfile FILE
Mapping (accession to taxid) file
-c, --compress Output FASTA file will be gzipped
-V, --version show program's version number and exit
-i TAXID, --include TAXID
NCBI taxid code to include a taxon and all underneath
(multiple -i is available to include several taxid);
by default all the taxa is considered for inclusion
-x TAXID, --exclude TAXID
NCBI taxid code to exclude a taxon and all underneath
(multiple -x is available to exclude several taxid)
-f FILE, --ntfastafile FILE
NCBI nt formatted FASTA file
For example, if you:
- want to extract all the fungal (taxid: 4751) entries of a decontaminated nt database
nt_decon.fa
, - have cloned the repo in
~/recentrifuge
, - have taxonomy files downloaded and expanded to /my/tax/dir —or just use
retaxdump
!, - have generated a mapping file with name 'nt.fa.taxidmapping',
- want to get some extra information about the taxonomy, then you may run:
~/recentrifuge/rextraccnt -d -n /my/tax/dir -i 4751 -m nt.fa.taxidmapping -f nt_decon.fa
Since the current size of NCBI nt DB is circa 1 TB, the process may take more than one hour to complete, and then you will get the file nt_decon_rxnt_incl4751.fa
as a result.
If you use Recentrifuge in your research, please consider citing the paper. Thanks!
Martí JM (2019) Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967