This project is a (re)-analysis of a reduced-representation bisulfite sequencing (RRBS) dataset from DNMT1-TET-OFF mESCs generated by my PhD thesis mentor J.R. Chaillet (U.Pitt) and initially analyzed by S. McGraw and J. Trasler (McGill University). 1.https://pubmed.ncbi.nlm.nih.gov/25578964/ 2.https://www.ebi.ac.uk/ena/browser/view/PRJEB6698
The McGraw et al. RRBS data is found on the European Nucleotide Archive (ENA) Record PRJEB6698.
- Download read file TSV report directly from ENA
- Use Awk one-liner to extract R1 and R2 FTP locations for 10 samples
cat filereport_read_run_PRJEB6698.tsv | awk -F"\t" '{print $7}' | awk -F";" -v OFS="\n" 'NR>1 {print $1, $2}' > PRJEB6698_ACC_List.txt
- Using the wget bash script
PRJEB6698_ERR_wgetdownload.bash
download corresponding _R1 and _R2 fastq records, check out log to make sure no errors, should add checksum feature - Download read experiment XML
- Use '.bash' or the below command to extract sample metadata
cat ena_PRJEB6698_read_experiment.xml | \
grep -o -E ">SAMEA\d{7}<|>ERR\d{6}<|refname\=\"[dN].*>" | \
tr -d '>|<' | tr '\n' ' ' | \
awk -v OFS="\t" ' {print $3, $2, $1, "\n", $6, $5, $4, "\n", $9, $8, $7, "\n", $12, $11, $10, "\n", $15, $14, $13, "\n", $18, $17, $16, "\n", $21, $20, $19, "\n", $24, $23, $22, "\n", $27, $26, $25, "\n", $30, $29, $28}' | \
sed 's/^\t//' | sed 's/refname=\"//' | sed 's/\"//' > PRJEB6698_metadata.tsv
- Download the GRC38 Mouse Reference Genome to a Reference folder using the commands below:
wget -O CAST_EiJ_v1.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-109/fasta/mus_musculus_casteij/dna/Mus_musculus_casteij.CAST_EiJ_v1.dna.toplevel.fa.gz
- Prepare bisulfite converted reference genome with CT and GA stranded C-->T deamination transition
module load bismark/0.20.0
bismark_genome_preparation ./
- Run
PRJEB6698_pretrimqc.bash
to perform pre-trim QC - Run
PRJEB6698_trimRRBS.bash
to perform RRBS trimming and post-trim QC with trimgalore - Run multiqc in both pretrim and trim QC directories:
multiqc --filename "PRJEB6698_multiqc_pretrim_report.html" . &
Note: For RRBS it is not recommended to deduplicate
- Run
PRJEB6698_Bismark_Align_GRCm39_BT2.bash
to align reads to Bisulfite converted genomes - Check BAM files [optional] Bismark BAM output is unsorted and unindexed
samtools view -H *ERR560529_1_val_1_bismark_bt2_pe.bam | grep SO
@HD VN:1.0 SO:unsorted
Bismark Alignment with Directional (MspI/HpaII CCGG cut) is PE with unstranded (genomic)
wget https://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm39.109.gtf.gz
gunzip -c Mus_musculus.GRCm39.109.gtf.gz > Mus_musculus.GRCm39.109.gtf
module load rseqc/2.6.6
module load bedops/2.4.35
awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' Mus_musculus.GRCm39.109.gtf| gtf2bed - > Mus_musculus.GRCm39.109.bed
infer_experiment.py -r /ix1/mmann/KoppesEA/REF_Sequences/Mus_musculus/GRCm39_ref/Mus_musculus.GRCm39.109.bed -i /ix1/mmann/KoppesEA/PRJEB6698/Bismark/ERR560527_1_val_1_bismark_bt2_pe_sorted.bam
- Run
PRJEB6698_Bismark_MethExtractor.bash
to tabulate methylation fractions for each Cytosine; output with focus on CpG methylation
- Run
PRJEB6698_methylKit_script.R
(Work in progress) to perform differential methylation and annotate CGIs
- DNMT1-TETOFF Paper: Transient DNMT1 suppression reveals hidden heritable marks in the genome. 2015. NAR. Serge McGraw, Jacques X Zhang, Mena Farag, Donovan Chan, Maxime Caron, Carolin Konermann, Christopher C Oakes, K Naga Mohan, Christoph Plass, Tomi Pastinen, Guillaume Bourque, J Richard Chaillet, Jacquetta M Trasler.
- FastQC: https://github.com/s-andrews/FastQC
- TrimGalore: https://github.com/FelixKrueger/TrimGalore
- Bismark: https://github.com/FelixKrueger/Bismark
- methylKit: https://bioconductor.org/packages/release/bioc/html/methylKit.html