Skip to content

Commit

Permalink
CoRE-ATAC documentation and prediction script.
Browse files Browse the repository at this point in the history
  • Loading branch information
ajt986 committed May 18, 2020
1 parent 1cc1933 commit 7642f34
Show file tree
Hide file tree
Showing 8 changed files with 85 additions and 62 deletions.
4 changes: 2 additions & 2 deletions .classpath
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<attribute name="owner.project.facets" value="java"/>
</attributes>
</classpathentry>
<classpathentry kind="lib" path="/Users/athib/Desktop/DeepLearning_ATAC/DataExtraction/htsjdk-2.18.1.jar"/>
<classpathentry kind="lib" path="/Users/athib/Downloads/commons-math3-3.6.1/commons-math3-3.6.1.jar"/>
<classpathentry kind="lib" path="/Users/athib/Desktop/scATAC/htsjdk/htsjdk-2.21.2-SNAPSHOT.jar"/>
<classpathentry kind="lib" path="/Users/athib/Documents/commons-math3-3.6.1/commons-math3-3.6.1.jar"/>
<classpathentry kind="output" path="bin"/>
</classpath>
Binary file added CoRE-ATAC_Manual.docx
Binary file not shown.
Binary file added CoRE-ATAC_Manual.pdf
Binary file not shown.
65 changes: 65 additions & 0 deletions CoREATAC_PredictionTool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from __future__ import print_function
import tensorflow as tf
import keras
from tensorflow.keras.models import load_model
from keras import backend as K
from keras.layers import Input
import numpy as np
import subprocess
from tensorloader import TensorLoader as tl
import partitioning_util as part
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, roc_curve, auc, precision_recall_curve,average_precision_score, confusion_matrix
import pandas as pd
from sklearn import impute
import argparse
import os
import time


#Step 0: Process arguments
parser = argparse.ArgumentParser(description='CoRE-ATAC Prediction Tool')
parser.add_argument("datadirectory")
parser.add_argument("model")
parser.add_argument("outputfile")
args = parser.parse_args()

datadirectory = args.datadirectory
model = args.model
outputfile = args.outputfile

def predict(datadirectory, model, outputfile):
basename = os.path.basename(os.path.normpath(datadirectory))
model = load_model(model)

featurefile = "./PEAS/features.txt"
labelencoder = "./PEAS/labelencoder.txt"

#Step 1: Load the data
start_time = time.time()
seqdata,sigdata,annot,summitpeaks,peaks = tl.readTensors(basename, datadirectory, 600, sequence=True, signal=True)
peasfeatures = tl.getPEASFeatures(datadirectory+"/peak_features/"+basename+"_features.txt", featurefile, labelencoder, peaks)
num_classes = 4
peasfeatures = np.expand_dims(peasfeatures, axis=2)
sigseqdata = tl.getSeqSigTensor(seqdata, sigdata)

print("--- Data loaded in %s seconds ---" % (time.time() - start_time))

x_test_sigseq = sigseqdata
x_test_sigseq = np.moveaxis(x_test_sigseq, 1, -1) #Originally had channels first, but CPU tensorflow requires channels last
x_test_peas = peasfeatures

#Step 2: Make predictions
start_time = time.time()
sig_predictions, peas_predictions, predictions = model.predict([x_test_sigseq, x_test_peas])

print("--- Data predicted in %s seconds ---" % (time.time() - start_time))

#Write the output file:
columns = ["Chr", "Start", "End", "Promoter Probability", "Enhancer Probability", "Insulator Probability", "Other Probability"]
pd.DataFrame(np.concatenate((peaks, predictions), axis=1), columns=columns).to_csv(outputfile, header=None, index=None, sep="\t")


predict(datadirectory, model, outputfile)
5 changes: 3 additions & 2 deletions DeepLearningPEAS.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ cd peak_features

jarpath="${args[9]}/"

keepdups="${args[10]}"

##Start with peaks and extract features.
##Sort BAM #TODO add option to skip this step if already sorted
Expand Down Expand Up @@ -58,7 +59,7 @@ annotatePeaks.pl "${inputpeaks}" "${homerref}" -m "${ctcfmotifs}" -nmotifs > ${p

#Get the insert size threshold to remove outlier inserts
echo "--- Getting insert size threshold. ---"
java -jar "${jarpath}PEASTools.jar" insertsizethresh "${prefix}_sorted.bam" "${outDir}/peak_features" --keepduplicates #!!!TODO!!!#
java -jar "${jarpath}PEASTools.jar" insertsizethresh "${prefix}_sorted.bam" "${outDir}/peak_features" ${keepdups}
thresh=$(cat "thresh.txt")


Expand All @@ -67,7 +68,7 @@ echo "--- Getting insert features. ---"
for i in {1..22}
do
chr=chr$i
java -jar "${jarpath}PEASTools.jar" insertmetrics "${chr}" "${chr}.bam" "${inputpeaks}" "${prefix}_${chr}_insertmetrics.txt" "$thresh" --keepduplicates #!!!TODO!!!#
java -jar "${jarpath}PEASTools.jar" insertmetrics "${chr}" "${chr}.bam" "${inputpeaks}" "${prefix}_${chr}_insertmetrics.txt" "$thresh" ${keepdups}
rm ${chr}.bam
cat ${prefix}_${chr}_insertmetrics.txt >> ${prefix}_insertmetrics.txt
rm "${prefix}_${chr}_insertmetrics.txt"
Expand Down
19 changes: 13 additions & 6 deletions FeatureExtractor.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,16 @@ done
bamfile="${args[0]}"
peakfile="${args[1]}"
outdir="${args[2]}"
fasta="${args[3]}"
sepfasta="${args[4]}"
path="${args[5]}/"
refgenome="${args[3]}"
fasta="${args[4]}"
sepfasta="${args[5]}"
path="${args[6]}/"

keepdups=""
if ${args[7]} == "TRUE"
then
keepdups="--keepduplicates"
fi

prefix=$(basename ${bamfile} | sed 's/.bam//g')
inDir=$(dirname ${bamfile})
Expand All @@ -22,7 +29,7 @@ reformattedpeaks=${outdir}${prefix}_peaks.txt

INSERTTHRESH=900
CHARPERLINE=50
HOMERREF="hg19"
HOMERREF=${refgenome}
HOMERMOTIFS=${path}PEAS/humantop_Nov2016_HOMER.motifs
CONSERVATION=${path}PEAS/phastCons46wayPlacental.bed
CTCFMOTIFS=${path}PEAS/CTCF.motifs
Expand All @@ -35,11 +42,11 @@ python ${path}PeakFormatter.py ${peakfile} ${sepfasta} ${reformattedpeaks}


#Step 2: Extract features from BAM
java -jar ${path}BAMExtractor.jar ${bamfile} "${reformattedpeaks}" ${sepfasta} ${CHARPERLINE} ${outdir} ${prefix} ${INSERTTHRESH}
java -jar ${path}BAMExtractor.jar ${bamfile} "${reformattedpeaks}" ${sepfasta} ${CHARPERLINE} ${outdir} ${prefix} ${INSERTTHRESH} ${keepdups}


#Step 3: Extract PEAS features
${path}DeepLearningPEAS.sh ${inDir} ${prefix} ${outdir} ${fasta} ${HOMERREF} ${HOMERMOTIFS} ${CONSERVATION} ${CTCFMOTIFS} "${reformattedpeaks}_original.txt" ${path}PEAS
${path}DeepLearningPEAS.sh ${inDir} ${prefix} ${outdir} ${fasta} ${HOMERREF} ${HOMERMOTIFS} ${CONSERVATION} ${CTCFMOTIFS} "${reformattedpeaks}_original.txt" ${path}PEAS ${keepdups}


#Step 4: Move PEAS features to directory
Expand Down
49 changes: 0 additions & 49 deletions FeatureExtractor_hg38.sh

This file was deleted.

5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

Classification of Regulatory Elements with ATAC-seq (CoRE-ATAC).

2-3-2020 - Update
Added code/compiled jar files for extracting features as well as a pre-trained CoRE-ATAC model (CoRE-ATAC_bulk_hg19_Dec_2019.h5).
See CoRE-ATA_Manual for instructions on encoding data and predicting cis-RE function.

I will be adding files used for training the Sequence/Signal and PEAS components of CoRE-ATAC soon..
A pretrained model that works for TensorFlow CPU is available in releases.

0 comments on commit 7642f34

Please sign in to comment.