-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmirdeep2_tutorial.html
188 lines (171 loc) · 10.1 KB
/
mirdeep2_tutorial.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta name="author" content="Sebastian Mackowiak">
<title>miRDeep2 tutorial</title>
<style>
h2 {
color: #6495ed;
}
.command {
color: green;
padding-top:0.5cm;
padding-bottom:0.5cm;
word-spacing:10px
}
.main {
margin-left: 100px;
width: 550px;
font-size: 14pt;
}
</style>
</head>
<body>
<div class="main">
<h1>miRDeep2 tutorial</h1>
Extended miRDeep2 tutorial with step by step instructions. It will cover the mapper.pl
for preprocessing and mapping, the miRDeep2.pl for de-novo prediction.
<h2>Disclaimer</h2>
This tutorial comes with no warranty and demands common sense of the reader. I am not
responsible for any damage that happens to your computer by using this tutorial.
For comments or questions just create an 'issue' here
<a
href="https://github.com/Drmirdeep/drmirdeep.github.io/issues">https://github.com/Drmirdeep/drmirdeep.github.io/issues</a>.
Apparently you will need a github account for that.
<h2>Preface</h2>
This is a step by step guide for a full small RNA sequencing data analysis using the
miRDeep2 package. The first part will describe the general workflow
to do de-novo miRNA predictions based on a small RNAseq data seq and the second part will
focus on expression analysis with the quantification module.
<h2> Installation instructions</h2>
If you haven't installed them yet you can obtain the main package from
<a href="https://github.com/rajewsky-lab/mirdeep2">miRDeep2</a>
by clicking on 'Clone or download' and then on 'Download Zip'.
Extract the zipped files and then open a command line window.
If you have git installed you can obtain the packages also directly from the command line
by typing
<div class="command">git clone https://github.com/rajewsky-lab/mirdeep2.git</div>
To install the miRDeep2 package enter the directory to which the package was extracted
to. If you extracted the folder on the Desktop then typing
<div class="command">cd ~/Desktop/mirdeep2</div>
will bring you to the mirdeep2 folder.
Then typing
<div class="command">perl install.pl</div>
will start the installer and download and install third party software. In particular.
bowtie version 1, RNAfold, randfold and the perl packages PDF:API and TTF will be
installed. Please follow the instructions on the screen.
When mirdeep2 was installed successfully please open a new terminal window and just type
<div class="command">miRDeep2.pl</div>
If you see the miRDeep2 usage instructions on the screen you can continue.
Otherwise something went wrong during the installation.
Download the miRBase reference files for version 21 by typing
<div class="command">mirbase.pl 21</div>
This will download the hairpin.fa.gz and mature.fa.gz for version 21 to directory
~/mirbase/21/
<br><br>
(The ~ sign will be expanded by your computer to your home directory).
<br><br>
If you want the gff files as well then you need to type
<div class="command">mirbase.pl 21 1</div>
The second argument can be anything but 0 which will tell the script to also get the gff
files from mirbase.
For miRNA quantification we next extract the miRNAs for our species of interest. For
that you need to know the 3-letter code of miRBase for your species. For humans
this will be 'hsa' and mouse would be mmu.
To extract the mature sequences from the mirbase file we downloaded before
you just need to type
<div class="command">extract_miRNAs.pl ~/mirbase/21/mature.fa.gz hsa >
mature_ref.fa</div>
and to get the hairpin sequences by typing
<div class="command">extract_miRNAs.pl ~/mirbase/21/hairpin.fa.gz hsa >
hairpin_ref.fa</div>
For running miRDeep2 for de-novo miRNA prediction it is beneficial to supply
also mature miRNAs from related species. You could for example use mouse 'mmu' and
chimp 'ptr' as related species. For extracting those miRNAs you can type
<div class="command">extract_miRNAs.pl ~/mirbase/21/mature.fa.gz mmu,ptr >
mature_other.fa</div>
<h2>Tutorial files</h2>
In case you want to follow the tutorial with example data you can get
the files from here
<a href="https://github.com/Drmirdeep/drmirdeep.github.io/archive/master.zip">tutorial_files</a> and type
<div class="command">unzip drmirdeep.github.io-master.zip</div>
change to the unzipped directory with
<div class="command">cd drmirdeep.github.io-master</div> and continue with the tutorial.
<h2>The data and what else you will need</h2>
Usually you will have gotten a small RNA sequencing data file from a collaborator that
wants you to analyze the data file. Before you can start with any kind of analysis you
should either already know the small RNA sequencing adapter that was used for the
sequencing of the sample or ask your collaborator to sent it to you. If you don't clip
the adapter then the majority of the reads having an adapter
are likely not to be aligned to anywhere.
<br><br>
Once you know the adapter sequence you should do a simple check to see how many
of your sequences contain the adapter. This you can do by typing
<div class="command">grep -c TGGAATTC example_small_rna_file.fastq</div>
where 'TGGAATTC' are the first 8 nucleotides of the adapter that has been used
for this sample. Replace it with your own sequencing adapter. MicroRNAs have a mean length
of 22 nucleotides in animals so if you have sequenced one of those it will likely have the
sequencing adapter attached to it. If the resulting number of sequences with an adapter
is around 70% of the number of your input sequences the data set can be considered as
reasonably good. Note: In case that only adapters have been sequenced predominantly you will
also get a high number which is obviously not good. If you only get 10% of sequences with
an adpater then likely something went wrong during the
sequnecing library preparation or your sample doesn't contain too many small RNAs.
For novel miRNA prediction we need to map the reads against a reference database
which has to be indexed by bowtie 1. For this we take a reference database file,
lets call it refdb.fa (This can be a genome file or simply a file
with scaffolds) and build a bowtie index by typing
<div class="command">bowtie-build refdb.fa refdb.fa</div>
The first argument is here the actual file to index, the second argument is the
prefix for the bowtie index files. You can name it differently but for ease of
use I use the same name as my reference database file. Depending on the input
file size it can take several hours (for the human genome for example) to be
indexed. However, the bowtie website has already some prebuild index files
for download. If you decide to download index files you will also need to download
the fasta file with which the index was build. Otherwise the results in the miRDeep2
prediction will be not reliable.
<h2>Data preprocessing for novel miRNA prediction</h2>
Since the miRDeep2 package was designed as a complete solution for miRNA prediction and
quantification it also contains data preprocessing routines that will also clip the
sequencing adapter. The main function of the mapping module is the mapping
of the preprocessed reads file to reference database. The reference database is typically
an annotated genome sequence but can also be simply a scaffold assembly if no genome is
available. The scaffolds itself however should be at least 200 nucleotides long so that
a sane miRNA precursor plus some flanking region fits into it. Apart from clipping
adapters the module does sanity checks on your sequencing reads and also collapse read
sequences to reduce the file size which will save computing time.
<div class="command">mapper.pl example_small_rna_file.fastq -e -h -i -j -k TGGAATTC -l 18
-m -p refdb.fa -s reads_collapsed.fa -t reads_vs_refdb.arf -v -o 4</div>
What does this command do? The first argument needs to be your sequencing file.
Typically, this will be a fastq file. The format of the fastq file is designated by
specifying option '-e'. If your file is in fasta format already you specify option '-c'
instead. If your reads file is not in fasta format you need to specify option '-h' which
advises the mapper module to parse your file to fasta format. Option -i will convert RNA
to DNA and option '-j' will remove sequences that contain characters other than ACGTN.
<br><br> Now comes the actual adapter clipping which is only done if a adapter sequence
is given by option -k. Only the first 6 nucleotides of this sequence will be used to
search for an exact match in the sequencing reads. Option '-m' will collapse the reads
to remove redundancy and decrease the file size. A sequnecing read seen 10 times in your
raw file will occur only once in the collapsed file and have a _x10 in its identifier.
<br><br> After that the reads will be mapped to the given refence genome which index file
was specified by option '-p'. Option -s indicates the preprocessed read file name which
is output by the mapper module and option -t is the file name of the read mappings to the
reference database ('refdb.fa') in miRDeep2's arf format. A mapping file in arf format
can be easily obtained from a standard bowtie 1 output file (This is NOT in 'sam' format
but a proprietary bowtie text file format) by typing
<div class="command">convert_bowtie_output.pl reads_vs_refdb.bwt > reads_vs_refdb.arf</div>
However, if you used the mapping module then the mapped output file is already in
arf format.
<h2>Identification of known and novel miRNAs</h2>
For predicting novel miRNAs the miRDeep2 module from the package is called with a
collapsed reads file and a reference genome file in fasta format. For better
prediction results reference files of miRNAs and related miRNAs should be given since
miRDeep2 considers predicted miRNAs with conserved seeds in other species
more reliable that miRNAs with non-conserved seeds.
<div class="command">miRDeep2.pl reads_collapsed.fa refdb.fa reads_vs_refdb.arf
mature_ref.fa mature_other.fa hairpin_ref.fa -t hsa
2>report.log</div>
</div>
</body>
</html>