Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Data format standards

mattb112885 edited this page May 9, 2013 · 9 revisions

Input file formats

Genbank files (required)

You need ONE genbank file for every organism. Concatinate the genbank files for all the contigs (after generating a raw file, if you need one) The following information is taken from the genbank files at a minimum:

  1. Organism (in the /organism="[organism name]" line )
  2. Tax ID (in a /db_xref="Taxon:[taxid]" line )
  3. Genome sequences

These should be present in all the genbank files from Genbank (ftp.ncbi.nih.gov/genomes/Bacteria). It is assumed that any duplicates of such lines that are present in the Genbank file are all identical. In addition, for convertGenbankToTable.py to work you'll need CDS fields containing the locations of each gene and the translated amino acid sequences.

ITEP will only work if Biopython can successfully parse your Genbank file. This won't be a problem from most data sources (tested to work with JGI, NCBI, RAST, and PUBSEED Genbank files and the ones generated with our KBase interface).

Raw file format (automatically generated from Genbank files using convertGenbankToTable.py)

Raw files are tab-delimited files containing information needed for our analysis. They are automatically generated from Genbank files and are placed in the ${ROOTDIR}/raw folder. In case you're curious this is identical to the "spreadsheet (tab delimited)" format offered by RAST on the online interface, so if you want to you can go in there and download these directly instead of running convertGenbankToTable.py

The columns of a raw file are as follows:

contig_id feature_id type location start stop strand function aliases figfam evidence_codes nucleotide_sequence aa_sequence

  • The feature_id for any protein-encoding gene must have the format:

    fig|#.#.peg.# (e.g. fig|83333.1.peg.1)

  • The first two numbers (83333.1) must match the organism ID for the organism containing that gene.

  • The overall feature ID must be unique for each gene.

The Type column should be "peg" for all proteins. Anything that is not a protein is ignored.

The Start/stop columns refer to the the start/stop of the actual gene on the specified contig (start > stop for - strand genes).

The start\stop are 1-indexed from the beginning of the contig on which the feature is found.

Strand is + or -

Function is the functional annotation.

nucleotide_sequence is the nucleotide sequence encoding for the protein and aa_sequence is the translated amino acid sequence.

All other fields (location, aliases, figfam, evidence_codes, ...) are not used for anything by ITEP.

Clone this wiki locally