This is a bash based pipeline wrapping different steps in PacBio Iso-Seq analysis in SGE. It takes RS II cells (i.e. A01_1) as input and generates HTML reports ready to be presented, bigWig/bigBed files ready to be visualized in genome browsers. It was developped internally for PacBio experimentalists to use for RS II data (not for Sequel). It is made available in the hope that it will be useful, but without any warranty. If you have found any issue, please file an issue on github instead of a pacbio custom support ticket.
This pipeline parses a sample csv file and generates a batch of inter-dependent SGE jobs. Smrtportal modules Consensus
and Classify
run first. Barcoded libraries are splitted based on the barcode been identified on the classify
stage. Full-length, non-chimeric
reads generated by Classify
are used to evaluate the RNA integrity, the precision of the 5' and 3' ends, and to quantify mRNAs from the annotation. The outputs are organized into HTML and pdf files. In the meanwhile, Cluster
job runs to generate consensus sequences for each distinct mRNA, different classes of transcripts (i.e., matched, novel, truncated) are identified using cuffcompare
and presented as bigBed
files, which can be loaded into genome browser and different classes are shown in different colors.
For prokaryotic genomes, gmap
alignment is replaced by bwa mem
and Cluster
module does not run.
The following software/tools are needed
- Specify the home of smrtanalysis in the
config/config.sh
file, such as
declare -x SMRT_HOME='/pbi/analysis/smrtportal/beta/install/smrtanalysis_2.3.0.140936'
- Specify the job submission command in the
config/config.sh
file, such as
declare -x SUBMIT_CMD='qsub -V -cwd -pe smp 8 -l h_rt=240:0:0,mem_free=8G'
# explanation of the parameters used here:
# -V: load enviromental variable
# -cwd: use current working directory
# -pe smp 8: use smp 8 as the parallel enviroment; use `qconf -spl` to list all parallel enviroment
# -l: limitations, consult your SGE admin; a typical job runs overnight
- Currently only SGE is supported
- Edit
config/config.sh
to specify the directory related enviromental variables, including
# R libraries will be installed in this directory if not already exist
declare -x RLIBDIR='/home/bhan/Rlib'
# the directory storing genome sequence, index, and annotation files
declare -x ANNOTATION_DIR='/home/bhan/annotation'
- It is likely that each user needs to have write permission in those directories.
- Packages used:
biopython
# to install
pip3 install biopython
virtualenv
is highly recommended and the commandsource /path/to/pyenv/bin/activate
can be put inconfig/config.sh
- Relatively new version; missing packages will be installed automatically during the pipeline run.
- But it is highly recommended to install them beforehand.
# to install used packages
install.packages(c("readr", "dplyr", "tidyr", "rmarkdown", "corrplot", "ggplot2", "ggthemes"))
- If the user desires to use the
annotation
pipeline to prepare for annotation files automatically, MySQL is required to obtain the genome sequence, transcriptome annotation files (in GTF and refFlat formats) from UCSC.
- Please follow their own instructions to build/install
- They need to be made available in the
$PATH
mrna_size_from_gff and colmerge
- Edit
config/config.sh
- Prepare genome sequence, annotation files using the annotation pipeline
isoseq.sh anno -g hg19
isoseq.sh anno -g hg38
isoseq.sh anno -g mm10
isoseq.sh anno -g dm3
# Note: the pipeline submits jobs to build gmap index, so please wait
# until those jobs finish before running analysis!
- Prepare the eukaryotic annotation files manually. For a genome called
foo
, the user needs to place its genomic sequence file and transcriptome annotation file in the annotation directory specified by theconfig.sh
file and name themfoo.fa
andfoo.genes.gtf
respectively. - Prepare the prokaryotic annotation file manually. For a genome called
bar
, the user needs to place its genomic sequence file and transcriptome annotation file in the annotation directory specified by theconfig.sh
file and name thembar.fa
andbar.genes.gtf
respectively. RefFlat.txt
file will be generated from thegtf
file if missing.
# copy example.csv
cp pacbio_isoseq_pipeline/sample/example.csv my_isoseq_sample.csv
# edit my_isoseq_sample.csv by filling it in your sample
vi my_isoseq_sample.csv
- The user will need to fill the CSV file which specify the information needed, including:
1. genome
-
The name of the genome whose annotation has been obtained using the
annotation
pipeline. -
Specifically, the genome fasta and transcriptome genes.gtf files need to be presented in the
${ANNOTATION_DIR}
folder, which is defined inconfig/config.sh
. -
The "annotation" pipeline
isoseq.sh anno
prepares those files for popular genome assemblies using UCSC srouce.
2. tissue
- Libraries of different tissues are compared separately.
3. treatment
- Libraries of different treatment are compared directly.
4. size bin
- The size bin used when the libraries are generated, like "1-3kb".
- Only libraries in the same size bins are compared directly.
- Make sure to put "kb" as suffix (especially when you are doing this in Microsoft Excel).
5. Path to the RS II cells
- The complete, ABSOLUTE path to the RS II cells (usually named as [A-Z]0[12]_1).
- At least three
bax.h5
files are expected in theAnalysis_Results
subdirectory. - There can only be input one path, thus for samples with more than one smrtcells, the user will need to create a dummy folder with symbol links to all the
.h5
files in different smrtcells; the pipeline has provided a module for doing this:
isoseq.sh gather -o /path/to/dummy_cell -i /path/to/cell1 -i /path/to/cell2 -i /path/to/cell3 ...
# then the user should put "/path/to/dummy_cell" in the sample.csv file
6. Path to the custom (barcode) sequence file
- The absolute path to a fasta file with custom barcode sequence.
- A example of barcode fasta file looks like:
>F0
GCAGTCGAACATGTAGCTGACTCAGGTCACTCAGACGATGCGTCATGGTAGTAAGCAGTGGTATCAACGCAGAGTAC
>R0
GTACTCTGCGTTGATACCACTGCTTACTACCATGACGCATCGTCTGAGTGACCTGAGTCAGCTACATGTTCGACTGC
>F1
GCAGTCGAACATGTAGCTGACTCAGGTCACCTGCGTGCTCTACGACGGTAGTAAGCAGTGGTATCAACGCAGAGTAC
>R1
GTACTCTGCGTTGATACCACTGCTTACTACCGTCGTAGAGCACGCAGGTGACCTGAGTCAGCTACATGTTCGACTGC
>F2
GCAGTCGAACATGTAGCTGACTCAGGTCACCATAGCGACTATCGTGGGTAGTAAGCAGTGGTATCAACGCAGAGTAC
>R2
GTACTCTGCGTTGATACCACTGCTTACTACCCACGATAGTCGCTATGGTGACCTGAGTCAGCTACATGTTCGACTGC
- Forward and reverse primer are intervened.
- If default primers are used (no barcode), fill
NA
. - If custom primers without barcodes are used, put its absolute path here and use
NA
in the next field. - If a mixture of barcodes are used, also put its absolute path here and use
NA
in the next field.
7. Which barcode is used
- The position of the barcode FR pairs in the custom barcode fasta file. Start counting from 0.
- It has nothing to do with the names of any PacBio barcoding products.
- If no barcode is used, fill
NA
. - Use underscore
_
to separate words. Avoid using special characters including but not limited to!@#$%.*-,<>?/|\
;
It is recommended to start from the template in the sample/example.csv
file. Do NOT change the header.
- Run the pipeline for eukaryotic species (with introns)
isoseq.sh all -c my_isoseq_sample.csv \
-J myjobname -E my@emailaddress.com \
-F ftp.pacificbiosciences.com -U usrname1 -P mypassword
## options explain
-J: name of the job
-E: email to be notified when the job is done
-F, -U, -P: the ftp address, username, and password. If provided, the pipeline
will upload the bam/bigBed/bigWig files to the ftp and generate custom tracks
that can be loaded into the UCSC genome browser. The password will be explicitly
stored in the track content and visible in the command history or things like top,
so use it with causion.
- Run the pipeline for prokaryotic species (without introns)
isoseq.sh pro -c my_isoseq_sample.csv \
-J myjobname -E my@emailaddress.com \
-F ftp.pacificbiosciences.com -U usrname1 -P mypassword
annotation
- annotation files generated during the run will be stored here
- the reason to store them inside of the run folder instead of the installation directory is to avoid racing between different jobs
bam
- genome alignment
bam
files. Already sorted and indexed and can be loaded into genome browser directly
bed
- genome alignment
bed
andbigBed
files. gffcompare.combined.chained.code.bb
ThisbigBed
file stored the "assembled transcriptome" from your IsoSeq run. Different transcripts are colored according to its relationship with the provided annotation, judged bygffcompare/cuffcompare
It can be directly loaded into UCSC genome browser using custom track or hub.
Such as
track name="My IsoSeq Transcripts" type=bigBed itemRgb="On" visibility=full bigDataUrl="ftp://.../gffcomppare.combined.chained.code.bb"
- The type of the transcripts, defined by
gffcompare/cuffcompare
is prepended to the transcripts names. The complete code can be found here. - Different types of transcripts are colored differently. For example, transcripts matching the intron chain with one of the annotated mRNAs (type
=
) are colored black, while novel transcripts (typej
) are colored red.
bigWig
- bigWig files for genome browser are stored in this folder
- They can also be directly loaded into UCSC genome browser such as
track name="0-2kb.+" type=bigWig visibility=full color=0,0,255 maxHeightPixels=25:25:8 bigDataUrl=ftp://...Forward.bw
track name="0-2kb.-" type=bigWig visibility=full color=255,0,0 maxHeightPixels=25:25:8 bigDataUrl=ftp://...Reverse.bw
- Note that Watson and Crick strands are separated into two files
- And Crick strand signals are all negated so that the signal goes down
fasta
...0-2kb.CCS.fa
,...2-6kb.CCS.fa
: theconsensus
sequences, with smrtbells removed, still with Iso-Seq adaptors on them....0-2kb.isoseq_flnc.fasta
: the full-length, non-chimeric reads, in fasta format. The is the final output you will get when runningclassify
job on smrtportal...0-2kb.isoseq_nfl.fasta
: the non-full-length reads in fasta format...0-2kb.isoseq_flnc.trima.fa
: the full-length, non-chimeric reads with more robust polyA trimming, which has not been incorporated into smrtanalysis...0-2kb.final.consensus.fa
: the output of isoseqcluster
output, beforeQuiver
polishing...0-2kb.all_quivered_hq.100_30_0.99.fasta
: the output of quiver-polished consensus- Find more detailed explanations here;
fofn
- all file of file names
gff
...isoseq_flnc.trima.hg19lnc.gtf
: GTF file storing the alignment offlnc
reads to the genomegffcompare.combined.gtf
: the output ofgffcompare
, which combinesflnc
alignments together for quantification purpose
html
- Where most reports are stored.
gene_abundance.html
andmRNA_abundance.html
: two HTML files to generate interactive scatter-plot. This youtube video explains how to setup the plot- The final report:
setup
smrtpotal report
The images generated by smrtportal. Each Cell will have three images representing itsread length distribution
,Number of passes
, andQuality
.
All barcoded libraries in the same cell are plotted together.
length distribution of full length non-chimeric reads
Transcription Start/End Site (TSS/TES)
The distribution of the abundances of each gene
Scatter plot comparing two different tissues
Correlation matrix between different samples
jobout
- the raw output of each job
jobs
- the job script submitted to SGE
log
- the log files for each SGE job
pdf
- the pdf version of part of the report
table
...isoseq_flnc.trima.sizes
: the size of eachfull-length non-chimeric reads
...isoseq_flnc.trima.coverage.tsv
: the scaled, coverage plot for RNA integrity, calculated usingpiccard
...isoseq_flnc.gene.count
: the count of each gene, usingfull-length non-chimeric reads
by utilizing thetmap
output ofcuffcompare
. Note that all categories (=
,c
,j
, et al) are considered for gene counting...isoseq_flnc.mRNA.count
: the count of each gene, usingfull-length non-chimeric reads
by utilizing therefmap
output ofcuffcompare
. Note that only=
andc
reads are used for mRNA. If one read matches n mRNAs in the annotation, each mRNA gets1/n
counts from this read.gene.counts.tsv
mRNA.counts.tsv
coverage.tsv
mRNA_gene.sizes.tsv
: information with all libraries
track
- The content in those files can be copy-pasted to the custom track of UCSC genome browser directly (make sure the correct genome is selected)
alignments.UCSC.genome_browser.tracks
stores the paths to thebam
files on ftp (for flnc read alignments)signal.UCSC.genome_browser.tracks
stores the paths to thebigWig
files on ftp (for genome-oriented signal of flnc reads)assembly.UCSC.genome_browser.tracks
stores the paths to thebigBed
files on ftp (for transcripts assembled in thecluster
pipeline). This file doesn't exist for prokaryotic genomes sincecluster
module is not run.
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.