This project is a (re)-analysis of a reduced-representation bisulfite sequencing (RRBS) dataset from DNMT1-TET-OFF mESCs generated by my PhD thesis mentor J.R. Chaillet (U.Pitt) and initially analyzed by S. McGraw and J. Trasler (McGill University). 1. 2.
The McGraw et al. RRBS data is found on the European Nucleotide Archive (ENA) Record PRJEB6698.
- Download read file TSV report directly from ENA
- Use Awk one-liner to extract R1 and R2 FTP locations for 10 samples
cat filereport_read_run_PRJEB6698.tsv | awk -F"\t" '{print $7}' | awk -F";" -v OFS="\n" 'NR>1 {print $1, $2}' > PRJEB6698_ACC_List.txt
- Using the wget bash script
download corresponding _R1 and _R2 fastq records, check out log to make sure no errors, should add checksum feature - Download read experiment XML
- Use '.bash' or the below command to extract sample metadata
cat ena_PRJEB6698_read_experiment.xml | \
grep -o -E ">SAMEA\d{7}<|>ERR\d{6}<|refname\=\"[dN].*>" | \
tr -d '>|<' | tr '\n' ' ' | \
awk -v OFS="\t" ' {print $3, $2, $1, "\n", $6, $5, $4, "\n", $9, $8, $7, "\n", $12, $11, $10, "\n", $15, $14, $13, "\n", $18, $17, $16, "\n", $21, $20, $19, "\n", $24, $23, $22, "\n", $27, $26, $25, "\n", $30, $29, $28}' | \
sed 's/^\t//' | sed 's/refname=\"//' | sed 's/\"//' > PRJEB6698_metadata.tsv
- Download the GRC38 Mouse Reference Genome to a Reference folder using the commands below:
wget -O CAST_EiJ_v1.dna.toplevel.fa.gz
- Prepare bisulfite converted reference genome with CT and GA stranded C-->T deamination transition
module load bismark/0.20.0
bismark_genome_preparation ./
- Run
to perform pre-trim QC - Run
to perform RRBS trimming and post-trim QC with trimgalore - Run multiqc in both pretrim and trim QC directories:
multiqc --filename "PRJEB6698_multiqc_pretrim_report.html" . &
Note: For RRBS it is not recommended to deduplicate
- Run
to align reads to Bisulfite converted genomes - Check BAM files [optional] Bismark BAM output is unsorted and unindexed
samtools view -H *ERR560529_1_val_1_bismark_bt2_pe.bam | grep SO
@HD VN:1.0 SO:unsorted
Bismark Alignment with Directional (MspI/HpaII CCGG cut) is PE with unstranded (genomic)
gunzip -c Mus_musculus.GRCm39.109.gtf.gz > Mus_musculus.GRCm39.109.gtf
module load rseqc/2.6.6
module load bedops/2.4.35
awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' Mus_musculus.GRCm39.109.gtf| gtf2bed - > Mus_musculus.GRCm39.109.bed -r /ix1/mmann/KoppesEA/REF_Sequences/Mus_musculus/GRCm39_ref/Mus_musculus.GRCm39.109.bed -i /ix1/mmann/KoppesEA/PRJEB6698/Bismark/ERR560527_1_val_1_bismark_bt2_pe_sorted.bam
- Run
to tabulate methylation fractions for each Cytosine; output with focus on CpG methylation
- Run
(Work in progress) to perform differential methylation and annotate CGIs
- DNMT1-TETOFF Paper: Transient DNMT1 suppression reveals hidden heritable marks in the genome. 2015. NAR. Serge McGraw, Jacques X Zhang, Mena Farag, Donovan Chan, Maxime Caron, Carolin Konermann, Christopher C Oakes, K Naga Mohan, Christoph Plass, Tomi Pastinen, Guillaume Bourque, J Richard Chaillet, Jacquetta M Trasler.
- FastQC:
- TrimGalore:
- Bismark:
- methylKit: