Skip to content

Commit

Permalink
Update manipulating_vcf.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Netzach authored Aug 22, 2024
1 parent b34315a commit b822907
Showing 1 changed file with 29 additions and 26 deletions.
55 changes: 29 additions & 26 deletions docs/Case_studies/manipulating_vcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,22 +33,31 @@ Finally, full gene names and functions found in this [file](https://ftp.flybase.

* For shorter run times, extract chromosome 4 and look only at that.

#### Chromosome 4
### Making awking the data easier
Start by splitting the task into sub-tasks. This makes it easier to see what happens and you might get interesting intermediary results.

??? "_Hint_ **What do we need?**"
Extract chr4 from the vcf and the gff and make new files
### *bonus result*
A table with counted and sorted different genomic features in chromosome 4.

??? "_Hint_"
All lines from chromosome 4 start with a *4*
### *bonus result*
SNPs sorted by number. Just like the coins on day one.

??? "_Solution_"
`awk '/^4/{print $0}' Drosophila_melanogaster.BDGP6.28.101.gff3 > Drosophila_melanogaster.chr4.gff3`

`awk '/^4/{print $0}' dgrp2_trimmed.vcf > dgrp2_chr4.vcf`
### The exercise
Identify the steps you need and use awk to do those. Open the hints if you get stuck.


<br>
<br>
<br>
<br>
<br>
<br>


#### Making awking the data easier
Start by splitting the task into sub-tasks. This makes it easier to see what happens and you might get interesting intermediary results.

## Hints, ordered by subject. Don't use them unless necessary.

### Overall, an example of things to look for

??? "_Hint_ **Example**"
Let's say we want to find out all genes that contains a variant and all variants that are located within a gene. What do we want to do first? Take a look at the vcf file. That is the one that contains all the variants. Then look at the gff file, which contains the genes and other annotations. Finally, take a look at the DNA sequence. You will need to combine all three to answer the question.
Expand All @@ -59,25 +68,19 @@ Start by splitting the task into sub-tasks. This makes it easier to see what hap
* Separation of variants (SNPs and INDELs) into two groups, inside and outside genes (and CDSs)
* Separation of genes/CDSs into those with and without variants (and maybe how many there are per gene)

### The exercise
Identify the steps you need to do and what each step does. Open the hints if you get stuck.

#### *bonus result*
A table with counted and sorted different genomic features in chromosome 4.
### Chromosome 4

#### *bonus result*
SNPs sorted by number. Just like the coins on day one.

<br>
<br>
<br>
<br>
<br>
<br>
??? "_Hint_ **What do we need?**"
Extract chr4 from the vcf and the gff and make new files

??? "_Hint_"
All lines from chromosome 4 start with a *4*

??? "_Solution_"
`awk '/^4/{print $0}' Drosophila_melanogaster.BDGP6.28.101.gff3 > Drosophila_melanogaster.chr4.gff3`

`awk '/^4/{print $0}' dgrp2_trimmed.vcf > dgrp2_chr4.vcf`

## Hints, don't use unless necessary.


??? "_bonus result example_ **A table with counted and sorted different genomic features in chromosome 4.**"
Expand Down

0 comments on commit b822907

Please sign in to comment.