-
Notifications
You must be signed in to change notification settings - Fork 15
Itep architecture
This page gives an overview of the files included in the ITEP toolkit and those created when a user's data is fully loaded. It does not describe how to create them - for directions on how to load data properly into ITEP see here.
ITEP interfaces with SQLite to organize and efficiently extract genomic, homology, clustering and other data. After the user runs the provided database-building scripts, the SQLite database will be placed in the following location:
$ $root/db/DATABASE.sqlite
All ITEP scripts and library functions are intended to be used after building this database. If you know SQL you can open the database yourself and build your own queries if you require the ability to extract data that cannot be extracted using the provided interfaces. Otherwise, file a github issue and the developers will help address your concerns.
When you load Genbank files into ITEP, an alias file is automatically generated. The file is located at the following location:
$ $root/aliases/aliases
The alias file contains two columns: one contains ITEP IDs and the other contains aliases (particularly locus tags and gene names) that are pulled from the Genbank files. If you want to put your own names in, you can do so (this should be done after formatting your Genbank files with convertGenbank2Table.py but before running setup_step1.sh). The aliases will automatically be added to the annotations for those genes so you can search for genes by alias.
When you run the commands (setup_step1.sh - setup_step5.sh) to load the ITEP database, many files are created in new directories. This section describes those directories. The toolkit is set up to avoid re-creating results that already exist (useful for e.g. adding new organisms to an existing database). As a side effect of this you must make sure to use the standard methods (described TODO) to add and remove organisms to avoid continuity problems.
This directory contains BLASTP results all vs. all (one file for each pair of organisms) in tab-delimited format.
This directory contains BLASTN results all vs. all (one file for each pair of organisms) in tab-delimited format.
The CDD will automatically be downloaded here when running setup_step4.sh to calculate RPSBLAST results. It will also contain the translation table used to convert RPSBLAST results to standard cluster names (like pfam####)
This directory will contain clusters in the MCL cluster format (if you used another clustering method than MCL to make your clusters but provide files in this format they will end up copied here).
The MCL cluster format has one row for each cluster:
[gene1] [gene2] [gene3] [gene4] [gene5]
The run ID (which is used to identify a particular method you used for clustering) is assumed to be the same as the name of the file.
This directory contains the SQLite database as well as concatenated data files that are used for building said database.
This directory will contain one protein FASTA file for each organism in the database.
This directory contains "flattened" cluster files in this tab-delimited format:
[RunID] [clusterid] [geneid]
If you choose to import clusters from another method in this format, the input files will be copied here. The run ID should match the name of the file.
This directory will contain one nucleotide (transcript) FASTA file for each organism in the database.
This directory will contain RPSBLAST results for each of your organisms (one per organism) against NCBI's CDD database.
An "organisms" file is automatically generated in the root ITEP directory. It contains two columns: the organism's name and the ITEP organism ID for that organism.
ITEP includes many command line scripts in the src/ directory. This section describes their architecture.
Most of these scripts (most exceptions are the ones intended to be the beginning or end of workflows) accept input from standard in (stdin) and print their results to standard out (stdout). For the most part, the scripts that start with "db_" require access to the SQLite database, while those that do not start with db_ can be run independently of the existence of the database.
If you have a file with data you wish to input to one of these functions, you can do so either using the cat command:
$ cat [infile] | [command]
or by using the "<" operator
$ [command] < [infile]
The output to any command that normally prints results to the screen can be printed to a file instead using the ">" operator (or ">>" to append to an existing file)
$ [command] > [outfile]
The input to one command can be outputted to another using pipes (as shown in the "cat" above). In general this is done by using commands like this:
$ [command1] | [command2] > [results_file]
This is THE SAME as doing the following (which you might want to do in some cases, for example to prevent pipelines from getting too long or to check for issues):
$ [command1] > [intermediate_file]
$ cat [intermediate_file] | [command2] > [results_file]
An arbitrary number of commands can be linked in either of these manner (up to UNIX string length limits).
If you just have one item (e.g. a single gene ID) you can pipe it into a command using "echo":
$ echo "[geneid]" | [command]
Make sure you enclose any IDs provided as direct input arguments (e.g. to echo) in quotes - if they contain pipes, for example, then the pipe will be interpreted by the shell as separating two non-existent commands and you will get an error. You don't need to put quotes around them if they are in a file.
Often you will want to search through results before piping them to another function - UNIX provides many functions for manipulating text files, such as "sort" (with -u for uniquing rows), "cut", and "grep" which come in handy in many situations. See the help for those functions for details on how to use them.
ITEP also includes a number of utilities in src/utilities that help setting up pipelines. These include a function (makeTabDelimitedRow.py) for creating a tab-delimited row from a set of strings, a function (transposeFile.py) to transpose a file, and a function (joinColumns.py) to perform inner and outer joins on tab-delimited files on the fly (it is more memory intensive than the UNIX join command but also less picky).
It is straightforward to make a "bash script" that captures all of the command-line scripts used in a particular analysis. Just copy the commands that you used in your analysis and paste them into a text file (.sh), and then when you run it again (with appropriate input arguments) it will reproduce the results that you obtained in the interactive shell.
The scripts/ directory contains several examples of scripts that can be used to help you build your own pipelines.
ITEP contains several Python libraries that can be used for programmatic access to the data in the SQLite database. The PYTHONPATH variable is automatically appended appropriately when you source the SourceMe.sh file in the root of the repository so all you have to do is use this to get access to all the functions in a particular library.
>>> from [packagename] import *