This repository stores input files, example output files, and description of the problem which is designed to assist in evaluation of skills of job candidates.
Given genotypes saved in compressed Variant Call Format (VCF) files (one file per chromosome) and located inside input/
directory, implement a workflow (see Figure below) which:
-
Constructs a list of unrelated individuals. Two individuals are related if kinship coefficient between them is >=0.0442. Use existing published open source software to compute kinship between individuals e.g. PLINK2 or KING.
-
Builds new VCF files which include only unrelated individuals. For each variant, it computes the following VCF INFO fields:
- alternate allele count (AC)
- alternate allele frequency (AF)
- minor allele frequency (MAF)
- total number of alleles (AN)
- average depth (AVG_DP), i.e. average value across FORMAT DP values
- number of heterozygous alternate allele carriers (N_ALT_HET), i.e. with genotypes 0/1 or 1/0
- number of homozygous alternate allele carriers (N_ALT_HOM), i.e. with genotype 1/1
Monomorphic variants (i.e. with AF = 0.0 or AF = 1.0) must be excluded.
-
For each variant, computes average depth's percentile value (AVG_DP_P) in VCF INFO field. If variant vi has the average depth value AVG_DPi, then its average depth's percentile value AVG_DP_Pi is equal to the percentage of variants across all VCF files with AVG_DP < AVG_DPi.
To implement this workflow you may:
- use any workflow system of your choice, which is compatible with HPC (e.g. Nextflow, Snakemake, Luigi, Cromwell)
- use any existing published open source sowtware tools (e.g. bcftools, vcftools, PLINK2, KING, R and R packages)
- use any scripting/programming language (or any combination of them) of your choice (e.g. Python, C/C++, Java, Perl, R, shell scripting)
Please, send us a single compressed archive which includes:
- README. It should provide: (a) a list of all open-source software tools (and their versions) which were used; (b) any additional requirements for the operating system and/or system libraries; (c) any compilation instructions if such exists (d) detailed step-by-step description on how to run the workflow.
- Workflow source code. Include also configuration files if such are needed to run the workflow.
- Source of your custom scripts/code. Please include detailed comments in your source code. This will help us better understand your code.
Don't send VCF output files!
The following will be evaluated:
- Workflow can be easily installed and run.
- Selection of unrelated individuals follows best practices.
- All requested INFO fields are present in the final output VCF files.
- Values in the requested INFO fields in the final output VCF files are correct for a set of identified unrelated individuals.
Depicted sequence of steps in the workflow may not be optimal and any improvements (if such possible) are highly encouraged. Please, leave descriptive comments, which may help us to better understand your choices.