-
Notifications
You must be signed in to change notification settings - Fork 15
Data format standards
You need ONE genbank file for every organism. Concatinate the genbank files for all the contigs (after generating a raw file, if you need one) The following information is taken from the genbank files at a minimum:
- Organism (in the /organism="[organism name]" line )
- Tax ID (in a /db_xref="Taxon:[taxid]" line )
- Genome sequences
These should be present in all the genbank files from Genbank (ftp.ncbi.nih.gov/genomes/Bacteria). It is assumed that any duplicates of such lines that are present in the Genbank file are all identical. In addition, for convertGenbankToTable.py to work you'll need CDS fields containing the locations of each gene and the translated amino acid sequences.
ITEP will only work if Biopython can successfully parse your Genbank file. This won't be a problem from most data sources (tested to work with JGI, NCBI, RAST, and PUBSEED Genbank files and the ones generated with our KBase interface).
Raw files are tab-delimited files containing information needed for our analysis. They are automatically generated from Genbank files and are placed in the ${ROOTDIR}/raw folder. In case you're curious this is identical to the "spreadsheet (tab delimited)" format offered by RAST on the online interface, so if you want to you can go in there and download these directly instead of running convertGenbankToTable.py
The columns of a raw file are as follows:
contig_id feature_id type location start stop strand function aliases figfam evidence_codes nucleotide_sequence aa_sequence
-
The feature_id for any protein-encoding gene must have the format:
fig|#.#.peg.# (e.g. fig|83333.1.peg.1)
-
The first two numbers (83333.1) must match the organism ID for the organism containing that gene.
-
The overall feature ID must be unique for each gene.
The Type column should be "peg" for all proteins. Anything that is not a protein is ignored.
The Start/stop columns refer to the the start/stop of the actual gene on the specified contig (start > stop for - strand genes).
The start\stop are 1-indexed from the beginning of the contig on which the feature is found.
Strand is + or -
Function is the functional annotation.
nucleotide_sequence is the nucleotide sequence encoding for the protein and aa_sequence is the translated amino acid sequence.
All other fields (location, aliases, figfam, evidence_codes, ...) are not used for anything by ITEP.
A file called "organisms" is automatically generated from the names of the Genbank files in genbank/ and from the organism field of those Genbank files. It is a two-column table with organism name in the first column and organism ID in the second column.
The organism ID matches the regular expression "\d+.\d+".
Organism names can have spaces or some special characters but semicolons and quotes are not allowed. Many functions that output formats that are sensitive to special characters (SVG, Newick) will sanitize the names of organisms and\or their IDs by replacing all non-alphanumeric characters with underscores.
The groups file is automatically-generated with an "all" group containing all organisms in the ITEP database. It is a two-column tab-delimited table; the first column contains the group's name and the second column is a semicolon-delimited list of organisms in that group.
Organism names in the groups file must match the organism names in the organisms file exactly. You are not allowed to have multiple group names for the same group of organisms or to have the same name refer to different groups of organisms.
Most of the ITEP scripts output tab-delimited files with various fields (see individual functions for details). The others support specific widely-used file formats:
- Alignments: FASTA
- Trees: Newick
- Graphs: GML
- Images: SVG or PNG