This repository contains the codes of our established Multi-level functional variants-deciphering pipeline based on snakemake, which was used for functional fine-mapping of eQTL SNPs our OA-eQTL project.
The pipeline is organized as a snakemake workflow in which invoked some custom python and R scripts, so we assume the users are familiar with the basic principles of snakemake, python and R.
The pipeline was tested with the snakemake v7.32.4. The three conda environments on which the pipeline depends was deposited as yaml configure files in conda_environments directory. Users could create the corresponding environments using:
conda env create -f enrironment.yaml
All only thing the users need to do is modify the workflow configure file config.yaml
to that located the same directory with snakemake workflow file.
The config.yaml
file include three types of parameters to set
- Biosoftware path in your system
- Parameters for some analysis steps
- Input files path
- Users can get clear insight of the format for the most of required input data by scaning the example
config.yaml
file, such as fastq files of ATAC-seq data, fasta format of reference genome file - For some custom input file, we provided example file in
annoFiles
directory
- Users can get clear insight of the format for the most of required input data by scaning the example
After preparing all input data and correctely set values in config.yaml
file, users can run the analysis workflow in a very easy way:
snakemake --use-conda -s MFVD.snakefile
- Allele specific open chromatin result:
output_dir/ASoC_result/all.txt
, which contain the following columns:- CHROM: Chromosome
- POS: Position of SNP
- REF: Ref allele of the SNP
- ALT: Alternative allele of the SNP
- heterozygosis_sample_count: Number of samples in which the SNP was tested as heterozygote
- REF_count: Number of reads mapped to Ref allele across all heterozygous samples
- ALT_count: Number of reads mapped to Alternative allele across all heterozygous samples
- DP_count: Sum of REF_count and ALT_count
- P_value: Pvalue of Binomial test
- FDR: FDR calulated with p.adjust R package
- TFi-QTL result:
output_dir/tf_regulation/tfieQTL_result_merged/tfieQTL.all.top_eVariant_for_each_TF_target_pair.withGeneName
, which contain the following main columns- TF_id: Ensemble id of TF;
- TF_name; Gene symbol of the TF
- target_id; Ensemble id of eGene corresponding to
SNP
- target_name: Gene symbol of eGene corresponding to
SNP
- SNP: eSNP id
- interactP: Pvalue of interaction term in regression model
Users can download the ATAC-seq data (GSE108301) we used in our paper that deposite in bioRxiv 'https://www.biorxiv.org/content/10.1101/2024.06.11.598401v2' to test the workflow. It will cost several days to get the all results when using 16 ATAC-seq data.