Skip to content

huyuem/PacificBiosciences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PacBio Iso-Seq Pipeline Wrapper

Introduction

This is a bash based pipeline wrapping different steps in PacBio Iso-Seq analysis in SGE. It takes RS II cells (i.e. A01_1) as input and generates HTML reports ready to be presented, bigWig/bigBed files ready to be visualized in genome browsers. It was developped internally for PacBio experimentalists to use for RS II data (not for Sequel). It is made available in the hope that it will be useful, but without any warranty. If you have found any issue, please file an issue on github instead of a pacbio custom support ticket.

Flow

This pipeline parses a sample csv file and generates a batch of inter-dependent SGE jobs. Smrtportal modules Consensus and Classify run first. Barcoded libraries are splitted based on the barcode been identified on the classify stage. Full-length, non-chimeric reads generated by Classify are used to evaluate the RNA integrity, the precision of the 5' and 3' ends, and to quantify mRNAs from the annotation. The outputs are organized into HTML and pdf files. In the meanwhile, Cluster job runs to generate consensus sequences for each distinct mRNA, different classes of transcripts (i.e., matched, novel, truncated) are identified using cuffcompare and presented as bigBed files, which can be loaded into genome browser and different classes are shown in different colors.
For prokaryotic genomes, gmap alignment is replaced by bwa mem and Cluster module does not run.

Install

The following software/tools are needed

Linux enviroment with smrtanalysis_2.3.0.140936 installed

  • Specify the home of smrtanalysis in the config/config.sh file, such as
declare -x SMRT_HOME='/pbi/analysis/smrtportal/beta/install/smrtanalysis_2.3.0.140936'

SGE job submission envirment

  • Specify the job submission command in the config/config.sh file, such as
declare -x SUBMIT_CMD='qsub -V -cwd -pe smp 8 -l h_rt=240:0:0,mem_free=8G'
# explanation of the parameters used here:
#   -V: load enviromental variable
#   -cwd: use current working directory 
#   -pe smp 8: use smp 8 as the parallel enviroment; use `qconf -spl` to list all parallel enviroment
#   -l: limitations, consult your SGE admin; a typical job runs overnight 
  • Currently only SGE is supported

User specific directories set-up

  • Edit config/config.sh to specify the directory related enviromental variables, including
# R libraries will be installed in this directory if not already exist
declare -x RLIBDIR='/home/bhan/Rlib'
# the directory storing genome sequence, index, and annotation files
declare -x ANNOTATION_DIR='/home/bhan/annotation'
  • It is likely that each user needs to have write permission in those directories.

Python3

  • Packages used: biopython
# to install
pip3 install biopython
  • virtualenv is highly recommended and the command source /path/to/pyenv/bin/activate can be put in config/config.sh

R

  • Relatively new version; missing packages will be installed automatically during the pipeline run.
  • But it is highly recommended to install them beforehand.
# to install used packages
install.packages(c("readr", "dplyr", "tidyr", "rmarkdown", "corrplot", "ggplot2", "ggthemes"))

MySQL

  • If the user desires to use the annotation pipeline to prepare for annotation files automatically, MySQL is required to obtain the genome sequence, transcriptome annotation files (in GTF and refFlat formats) from UCSC.

Other tools

  • Please follow their own instructions to build/install
  • They need to be made available in the $PATH

             trim_isoseq_polyA

             mrna_size_from_gff and colmerge

             gff2refFlat

             deepTools

             picard

             kent tools

             gmap and gmap_build

Usage

1. Setup the enviroment

  • Edit config/config.sh

2. Prepare the genomic annotation

  • Prepare genome sequence, annotation files using the annotation pipeline
isoseq.sh anno -g hg19
isoseq.sh anno -g hg38
isoseq.sh anno -g mm10
isoseq.sh anno -g dm3
# Note: the pipeline submits jobs to build gmap index, so please wait 
# until those jobs finish before running analysis!
  • Prepare the eukaryotic annotation files manually. For a genome called foo, the user needs to place its genomic sequence file and transcriptome annotation file in the annotation directory specified by the config.sh file and name them foo.fa and foo.genes.gtf respectively.
  • Prepare the prokaryotic annotation file manually. For a genome called bar, the user needs to place its genomic sequence file and transcriptome annotation file in the annotation directory specified by the config.sh file and name them bar.fa and bar.genes.gtf respectively.
  • RefFlat.txt file will be generated from the gtf file if missing.

3. Prepare the sample information

# copy example.csv 
cp pacbio_isoseq_pipeline/sample/example.csv my_isoseq_sample.csv
# edit my_isoseq_sample.csv by filling it in your sample
vi my_isoseq_sample.csv
  • The user will need to fill the CSV file which specify the information needed, including:

  1. genome

  • The name of the genome whose annotation has been obtained using the annotation pipeline.

  • Specifically, the genome fasta and transcriptome genes.gtf files need to be presented in the ${ANNOTATION_DIR} folder, which is defined in config/config.sh.

  • The "annotation" pipeline isoseq.sh anno prepares those files for popular genome assemblies using UCSC srouce.

  2. tissue

  • Libraries of different tissues are compared separately.

  3. treatment

  • Libraries of different treatment are compared directly.

  4. size bin

  • The size bin used when the libraries are generated, like "1-3kb".
  • Only libraries in the same size bins are compared directly.
  • Make sure to put "kb" as suffix (especially when you are doing this in Microsoft Excel).

  5. Path to the RS II cells

  • The complete, ABSOLUTE path to the RS II cells (usually named as [A-Z]0[12]_1).
  • At least three bax.h5 files are expected in the Analysis_Results subdirectory.
  • There can only be input one path, thus for samples with more than one smrtcells, the user will need to create a dummy folder with symbol links to all the .h5 files in different smrtcells; the pipeline has provided a module for doing this:
    isoseq.sh gather -o /path/to/dummy_cell -i /path/to/cell1 -i /path/to/cell2 -i /path/to/cell3 ...
    # then the user should put "/path/to/dummy_cell" in the sample.csv file

  6. Path to the custom (barcode) sequence file

  • The absolute path to a fasta file with custom barcode sequence.
  • A example of barcode fasta file looks like:
>F0
GCAGTCGAACATGTAGCTGACTCAGGTCACTCAGACGATGCGTCATGGTAGTAAGCAGTGGTATCAACGCAGAGTAC
>R0
GTACTCTGCGTTGATACCACTGCTTACTACCATGACGCATCGTCTGAGTGACCTGAGTCAGCTACATGTTCGACTGC
>F1
GCAGTCGAACATGTAGCTGACTCAGGTCACCTGCGTGCTCTACGACGGTAGTAAGCAGTGGTATCAACGCAGAGTAC
>R1
GTACTCTGCGTTGATACCACTGCTTACTACCGTCGTAGAGCACGCAGGTGACCTGAGTCAGCTACATGTTCGACTGC
>F2
GCAGTCGAACATGTAGCTGACTCAGGTCACCATAGCGACTATCGTGGGTAGTAAGCAGTGGTATCAACGCAGAGTAC
>R2
GTACTCTGCGTTGATACCACTGCTTACTACCCACGATAGTCGCTATGGTGACCTGAGTCAGCTACATGTTCGACTGC
  • Forward and reverse primer are intervened.
  • If default primers are used (no barcode), fill NA.
  • If custom primers without barcodes are used, put its absolute path here and use NA in the next field.
  • If a mixture of barcodes are used, also put its absolute path here and use NA in the next field.

  7. Which barcode is used

  • The position of the barcode FR pairs in the custom barcode fasta file. Start counting from 0.
  • It has nothing to do with the names of any PacBio barcoding products.
  • If no barcode is used, fill NA.
  • Use underscore _ to separate words. Avoid using special characters including but not limited to !@#$%.*-,<>?/|\;

It is recommended to start from the template in the sample/example.csv file. Do NOT change the header.

To run the pipeline

  • Run the pipeline for eukaryotic species (with introns)
isoseq.sh all -c my_isoseq_sample.csv \
    -J myjobname -E my@emailaddress.com \
    -F ftp.pacificbiosciences.com -U usrname1 -P mypassword
## options explain
-J: name of the job 
-E: email to be notified when the job is done
-F, -U, -P: the ftp address, username, and password. If provided, the pipeline
will upload the bam/bigBed/bigWig files to the ftp and generate custom tracks 
that can be loaded into the UCSC genome browser. The password will be explicitly 
stored in the track content and visible in the command history or things like top,
so use it with causion. 
  • Run the pipeline for prokaryotic species (without introns)
isoseq.sh pro -c my_isoseq_sample.csv \
    -J myjobname -E my@emailaddress.com \
    -F ftp.pacificbiosciences.com -U usrname1 -P mypassword

Output Explanation

annotation

  • annotation files generated during the run will be stored here
  • the reason to store them inside of the run folder instead of the installation directory is to avoid racing between different jobs

bam

  • genome alignment bam files. Already sorted and indexed and can be loaded into genome browser directly

bed

  • genome alignment bed and bigBed files.
  • gffcompare.combined.chained.code.bb This bigBed file stored the "assembled transcriptome" from your IsoSeq run. Different transcripts are colored according to its relationship with the provided annotation, judged by gffcompare/cuffcompare
    It can be directly loaded into UCSC genome browser using custom track or hub.
    Such as
track name="My IsoSeq Transcripts" type=bigBed itemRgb="On" visibility=full bigDataUrl="ftp://.../gffcomppare.combined.chained.code.bb"

And it looks like:

  • The type of the transcripts, defined by gffcompare/cuffcompare is prepended to the transcripts names. The complete code can be found here.
  • Different types of transcripts are colored differently. For example, transcripts matching the intron chain with one of the annotated mRNAs (type =) are colored black, while novel transcripts (type j) are colored red.

bigWig

  • bigWig files for genome browser are stored in this folder
  • They can also be directly loaded into UCSC genome browser such as
track name="0-2kb.+" type=bigWig visibility=full color=0,0,255 maxHeightPixels=25:25:8 bigDataUrl=ftp://...Forward.bw
track name="0-2kb.-" type=bigWig visibility=full color=255,0,0 maxHeightPixels=25:25:8 bigDataUrl=ftp://...Reverse.bw
  • Note that Watson and Crick strands are separated into two files
  • And Crick strand signals are all negated so that the signal goes down

fasta

  • ...0-2kb.CCS.fa, ...2-6kb.CCS.fa: the consensus sequences, with smrtbells removed, still with Iso-Seq adaptors on them.
  • ...0-2kb.isoseq_flnc.fasta: the full-length, non-chimeric reads, in fasta format. The is the final output you will get when running classify job on smrtportal
  • ...0-2kb.isoseq_nfl.fasta: the non-full-length reads in fasta format
  • ...0-2kb.isoseq_flnc.trima.fa: the full-length, non-chimeric reads with more robust polyA trimming, which has not been incorporated into smrtanalysis
  • ...0-2kb.final.consensus.fa: the output of isoseq cluster output, before Quiver polishing
  • ...0-2kb.all_quivered_hq.100_30_0.99.fasta: the output of quiver-polished consensus
  • Find more detailed explanations here;

fofn

  • all file of file names

gff

  • ...isoseq_flnc.trima.hg19lnc.gtf: GTF file storing the alignment of flnc reads to the genome
  • gffcompare.combined.gtf: the output of gffcompare, which combines flnc alignments together for quantification purpose

html

  • Where most reports are stored.
  • gene_abundance.html and mRNA_abundance.html: two HTML files to generate interactive scatter-plot. This youtube video explains how to setup the plot
  • The final report:
    setup

    smrtpotal report
    The images generated by smrtportal. Each Cell will have three images representing its read length distribution, Number of passes, and Quality.
    All barcoded libraries in the same cell are plotted together.

length distribution of full length non-chimeric reads

RNA coverage

Transcription Start/End Site (TSS/TES)

The distribution of the abundances of each gene


Scatter plot comparing two different tissues

  • for a more interactive scatter plot, check gene_abundance.html and mRNA_abundance.html.

Correlation matrix between different samples

jobout

  • the raw output of each job

jobs

  • the job script submitted to SGE

log

  • the log files for each SGE job

pdf

  • the pdf version of part of the report

table

  • ...isoseq_flnc.trima.sizes: the size of each full-length non-chimeric reads
  • ...isoseq_flnc.trima.coverage.tsv: the scaled, coverage plot for RNA integrity, calculated using piccard
  • ...isoseq_flnc.gene.count: the count of each gene, using full-length non-chimeric reads by utilizing the tmap output of cuffcompare. Note that all categories (=, c, j, et al) are considered for gene counting
  • ...isoseq_flnc.mRNA.count: the count of each gene, using full-length non-chimeric reads by utilizing the refmap output of cuffcompare. Note that only = and c reads are used for mRNA. If one read matches n mRNAs in the annotation, each mRNA gets 1/n counts from this read.
  • gene.counts.tsv mRNA.counts.tsv coverage.tsv mRNA_gene.sizes.tsv: information with all libraries

track

  • The content in those files can be copy-pasted to the custom track of UCSC genome browser directly (make sure the correct genome is selected)
  • alignments.UCSC.genome_browser.tracks stores the paths to the bam files on ftp (for flnc read alignments)
  • signal.UCSC.genome_browser.tracks stores the paths to the bigWig files on ftp (for genome-oriented signal of flnc reads)
  • assembly.UCSC.genome_browser.tracks stores the paths to the bigBed files on ftp (for transcripts assembled in the cluster pipeline). This file doesn't exist for prokaryotic genomes since cluster module is not run.

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published