Scripts for performing alignment of paired-end NGS reads to the reference genomes using BWA-MEM on the BIH HPC cluster.
You can download raw fastq data from the genomics facility file exchange server using your MDC account (by changing user@MDC-BERLIN) as follows (adapt link from Email):
wget -rnd -np --user=user@MDC-BERLIN --ask-password -A '*.gz' https://file-exchange.bihealth.org/c2eda02f-523b-4d1a-8909-aea0cd0f5a2d/
Find more information on how to get raw data here: https://bimsbstatic.mdc-berlin.de/genomics/howto/user_transfer_fileboxes.html. Note that file boxes will be kept for 14 days only!
Depending on how closely the Genomics Platform paid attention to our submission form, you may have to edit the filenames of the FASTQ files to get them into our desired format (i.e. P3069_i301.1.fastq.gz
).
This can be achieved with the script here, which compares the indices in the FASTQ header to those that we submitted , and uses this information to rename the file accordingly.
You can download this repository like so:
git clone -b main git@github.com:Sanders-Lab/bih-alignment.git
And install the required conda environments (which have all required packages) into your BIH cluster conda workspace like so (you only need to do this once and can then re-use it):
# bash environment
conda env create --file bih-alignment/alignmentenv_20231010.yml
# R environment
conda env create --file bih-alignment/alignmentr.yml
Once the repo is cloned you can launch the complete alignment and QC pipeline like so:
sbatch \
-J alignment \
-o /data/cephfs-2/unmirrored/groups/sanders/projects/${myname}/logs/$(date +%Y%m%d)_${project_name}_alignment.txt \
bih-alignment/scripts/alignment_script.sh \
$project_name \
.1.fastq.gz \
human
-
Where
$project_name
is the the name of sample (e.g. set toP1593
if you are aligning reads in/fast/groups/ag_sanders/work/data/P1593/fastq
). -
The second command line variable (
.1.fastq.gz
in this example) should be the shared suffix of the first mate FASTQ files. The suffix of the second mate is assumed to be the same with a 2 in place of the 1. -
The third command line variable is the organism to which you want to align the reference genome, currently this can be set to
human
(hg38),human_t2t
(T2T) ormouse
(mm39).
You can either edit ${myname}
manually, or (to make your life easier) add them as an environmental bash variables (e.g. by adding export myname=benedict
to your ~/.bashrc
file).
To ensure the pipeline can locate and execute the scripts in the exec/
directory, please submit the slurm job from the directory into which you initially cloned this repository (i.e. run sbatch bih-alignment/scripts/alignment_script.sh
as in the example above).
This repo contains the following scripts:
-
alignment_script.sh
is the complete pipeline and is therefore the recommended script to use. -
alignment_qc_script.sh
can be run if you only wish to run quality control on data that is already aligned.
The files in exec/
are called by the main scripts and should not be executed in isolation.
If you already have aligned BAM files and wish to run the quality control script in isolation, you can do so like this:
sbatch \
-J alnQC \
-o /data/cephfs-2/unmirrored/groups/sanders/projects/${myname}/logs/$(date +%Y%m%d)_${project_name}_alignment_qc.txt \
bih-alignment/scripts/alignment_qc_script.sh \
$project_name \
human
Please contact us with any problems or submit them as an issue in this Github repository.