-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathhw3.html
376 lines (322 loc) · 11.6 KB
/
hw3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- saved from url=(0077)file:///Users/ppham/src/cse140/13wi/assignments/2-dna-analysis/homework2.html -->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Homework 3: DNA analysis (Part 1)</title>
</head>
<body>
<h1>Homework 3: DNA analysis (Part 1)</h1>
<p>
<b>Due</b>: at 5pm on Friday, October 18, via GitHub.
Create a release of your homework submission as <code>hw3</code>.
Follow instructions for tagging (creating a release) given in Lab 2.
</p>
<p>
You will use, modify, and extend a program to compute the GC content of DNA
data.
The GC content of DNA is the percentage of nucleotides that are
either G or C.
</p>
<p>
DNA can be thought of as a sequence of nucleotides. Each nucleotide is
adenine, cytosine, guanine, or thymine. These are abbreviated as A, C, G,
and T. A nucleotide is also called a nucleotide base, nitrogenous
base, nucleobase, or just a base.
</p>
<p>
Biologists have multiple reasons to be interested in GC content.
</p>
<ul>
<li>GC content can identify genes within the DNA, and can identify types of
genes.
Genes tend to have higher GC content than other parts of the DNA.
Genes with longer coding regions have even higher GC content.
</li>
<li>Regions of DNA with higher GC content
require higher temperatures for certain chemical reactions, such as when
copying/duplicating the DNA.
</li>
<li>GC content can be used in determining
classification of species.
</li>
</ul>
<p>
If you are curious, Wikipedia has more information about
<a href="http://en.wikipedia.org/wiki/GC-content">GC content</a>.
That reading is optional and is not required to complete this assignment.
</p>
<p>
Your program will read files produced by a high-throughput sequencer — a
machine that takes as input some DNA, and produces as output a file
containing a sequence of
nucleotides.
</p>
<p>
Here are the first 8 lines of output from a particular sequencer:
</p>
<pre>@SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
GTGGGGGTGATGTCCACGATTACGCCGACCGGCTGG
+SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
</pre>
<p>
The nucleotide data is in the second line, the sixth line, the tenth line,
etc. Your program will not use the rest of the file, which provides
information about the sequencer and the sequencing process that created the
nucleotide data.
</p>
<h2 id="problem1">Problem 1: Obtain the files, add your name</h2>
<p>
Obtain the files you need by downloading the
<a href="http://courses.cs.washington.edu/courses/cse140/13wi/homework/hw2/homework2.zip">homework2.zip</a> file.
(This is a large download — be
patient.)
</p>
<p>
Unzip the <tt>homework2.zip</tt> file to create a
<tt>homework2</tt> directory/folder. You will do your work here.
The <tt>homework2</tt> directory/folder contains:
</p>
<ul>
<li><tt>dna_analysis.py</tt>, a partial Python program that you will
complete</li>
<li><tt>answers.txt</tt>, a file where you will answer textual questions</li>
<li><tt>data</tt>, a directory. Which contains the data that you will process:
<ul><li><tt>*.fastq</tt> files, which are output from DNA sequencers; this is
the data that the program analyzes</li>
</ul>
</li>
<li><tt>expected_output</tt>, a directory containing example runs of the
final result of your <tt>dna_analysis.py</tt> program.</li>
</ul>
<p>
You will do your work by modifying two
files — <tt>dna_analysis.py</tt> and <tt>answers.txt</tt> — and
then submitting the modified versions. Add your name to the top of each of
these files. Once you are ready to submit your answers, copy these files
to your local working directory, add/commit them to your local git repo,
then push/sync them to github.com.
DO NOT ADD, COMMIT, OR SYNC ANY OTHER FILES. The other files in the
homework are large, and they will
be the same for all students in the class, so there is no need to upload
them to GitHub.
</p>
<p>
Each problem will ask you to make some changes to the program
<tt>dna_analysis.py</tt> (or to write text in the <tt>answers.txt</tt>
file, or both). When you do so, you will generally add to the program. Do
not remove changes from earlier problems when you work on later problems;
your final program should solve all the problems.
</p>
<p>
In either file, keep the amount of characters within a particular line below
80. One technique to do this in python would be to break large equations into
smaller ones by storing subexpressions in variables.
</p>
<p>
By the end of the assignment, we would like <tt>dna_analysis.py</tt> to produce
output of the exact form:
</p>
<pre>GC-content: ___
AT-content: ___
G count: ___
C count: ___
A count: ___
T count: ___
Sum count: ___
Total count: ___
seq length: ___
AT/GC Ratio: ___
GC Classification: ___
</pre>
<p>
Where ___ is replaced by values that you will calculate. Of course, the exact
values in each category will vary depending on the input data that you are
using. We expect the formatting of your program output to exactly match this.
</p>
<p>
You will submit <tt>answers.txt</tt> as a <b>text file</b>. Plain text is
the standard for communicating information among programmers, because it
can be read on any computer without installing proprietary software. You
can edit text files using IDLE or another text editor. If you use a word
processor, then be sure to save the files as text. Windows users should
never use Notepad for any purpose, because Notepad will mangle the line
endings in the file; WordPad or Notepad++ are better alternatives.
</p>
<h2 id="problem2">Problem 2: Run the program</h2>
<p>
It is a good idea to check the correctness of your program by comparing it
to a computation done in some other way, such as by hand or by a different
program. We have provided the <tt>test-small.fastq</tt> file for this
purpose. First, examine the file by hand to determine the GC content.
Then, run your program to verify that it provides the correct answer for
this file.</p>
<p>Run your program by opening a
<a href="file:///Users/ppham/src/cse140/13wi/shell-usage.html">shell
or command prompt</a> (not IDLE's Python interpreter),
navigating to your <tt>homework2</tt> directory, then typing the
following command.
</p>
<p>
On Mac/Linux:
</p>
<pre> python dna_analysis.py data/test-small.fastq
</pre>
<p>
On Windows:
</p>
<pre> python dna_analysis.py data\test-small.fastq
</pre>
<p>
If you get a "can't open file 'dna_analysis.py'" error or a "No such file or
directory" error, then perhaps your working directory is not
<tt>homework2</tt>, or you mistyped the file name.
</p>
<p>
After you have confirmed that your program runs correctly on
<tt>test-small.fastq</tt>,
run your program on each of the 6 real <tt>sample_<em>N</em>.fastq</tt> files provided, by
executing 6 commands such as
</p>
<pre> python dna_analysis.py data/sample_1.fastq
</pre>
<p>
or if you are a Windows user,
</p>
<pre> python dna_analysis.py data\sample_1.fastq
</pre>
<p>
You will have to change <tt>sample_1.fastq</tt> to a different file name in
the subsequent commands. Be patient — you are processing a lot of
data, and it might take a minute or so to run.
<!-- On Mike's laptop (top-of-the-line as of Feb 2008): 56 seconds -->
</p>
<p>
(If you are interested, <tt>sample_3.fastq</tt> and <tt>sample_4.fastq</tt> are from
<a href="http://en.wikipedia.org/wiki/Streptococcus_pneumoniae">Streptococcus
pneumoniae TIGR4</a>, and <tt>sample_5.fastq</tt> is from
<a href="http://en.wikipedia.org/wiki/Cytomegalovirus">Human herpesvirus 5</a>.)
</p>
<p>
If you have already used the Output Comparison Tool (referenced at the bottom
of the page), you might notice that some of your results are different than
the example results. Don't worry about this — this issue will be
resolved in <a href="file:///Users/ppham/src/cse140/13wi/assignments/2-dna-analysis/homework2.html#problem6">Problem
6</a>.
</p>
<p>
</p>
<p>
Cut and paste the line of output regarding GC-content from
<tt>sample_1.fastq</tt> into your <tt>answers.txt</tt> file. For example, your
answer might look like
</p>
<pre>GC-content: 0.42900139393
</pre>
<p>
(Note that this is not the answer you should expect to get, this is just an
example of the format that your answer should be in.)
</p>
<h2 id="problem3">Problem 3: Remove some lines</h2>
<p>
In your program, comment out these lines
</p>
<pre> seq = ""
linenum = 0
</pre>
<p>
by prefixing them by the <tt>#</tt> character. Re-run the program, just as
you did for <a href="file:///Users/ppham/src/cse140/13wi/assignments/2-dna-analysis/homework2.html#problem2">Problem 2</a>.
In <tt>answers.txt</tt>,
explain what happened, and why it happened. Now, restore the lines to
their original state by removing the <tt>#</tt> that you added.
</p>
<p>
What would happen if you commented out this line?
</p>
<pre> gc_count = 0
</pre>
<p>
Explain (in <tt>answers.txt</tt>).
</p>
<h2 id="problem4">Problem 4: Compute AT content</h2>
<p>
Augment your program so that, in addition to computing and printing the GC
ratio, it also computes and prints the AT content. The AT content is the
percentage of nucleotides that are A or T.
</p>
<p>
Two ways to compute the AT content are:
</p>
<ul>
<li>Copy the existing loop that examines each base pair. You will now have
two loops, one of which computes the GC count and one of which computes
the AT count.
</li>
<li>Add more statements into the existing loop, so that one loop computes
both the GC count and the AC count.
</li>
</ul>
<p>
You may use whichever approach you prefer.
</p>
<p>
Check your work by manually computing the AT content for file
<tt>test-small.fastq</tt>, then comparing it to the output of running your program
on <tt>test-small.fastq</tt>.
</p>
<p>
Run your program on <tt>sample_1.fastq</tt>. Cut-and-paste the relevant line of
output into <tt>answers.txt</tt>.
</p>
<h2 id="submit">Collaboration and Reflection</h2>
<p>
You are almost done!
</p>
<p>
At the bottom of your <tt>answers.txt</tt> file, in the “Collaboration”
part, state which students or
other people (besides the course staff) helped you with the assignment, or
that no one did.
</p>
<p>
At the bottom of your <tt>answers.txt</tt> file, in the “Reflection” part,
reflect on this assignment.
What did you learn from this assignment? What do you wish you had known
before you started? What would you do differently? What advice would
you offer to future students?
</p>
<p>
Commit the following files to your GitHub repository:
</p>
<ul>
<li><tt>dna_analysis.py</tt></li>
<li><tt>answers.txt</tt></li>
</ul>
<p>
Please validate your <tt>dna_analysis.py</tt> program's output using the
<a href="http://www.diffchecker.com/diff">Diff Checker</a> before submitting
your assignment. You can compare your output to the files given in the
<tt>expected_output</tt> directory of the homework2 files.
</p>
<!-- LocalWords: GC nucleotides nucleotide nucleobase homework2 dna py txt gc
-->
<!-- LocalWords: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC fastq linenum len A3
-->
<!-- LocalWords: hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh Streptomyces coelicolor
-->
<!-- LocalWords: GTGGGGGTGATGTCCACGATTACGCCGACCGGCTGG Saccharomyces Thale
-->
<!-- LocalWords: cerevisiae Arabidopsis thaliana Plasmodium falciparum
-->
<!-- LocalWords: 11pm CollectIt Dropbox metadata 11pm orking irectory cd
-->
<!-- LocalWords: hange pneumoniae TIGR4 herpesvirus pwd rint PCR turnin
-->
<!-- LocalWords: IDLE's 'dna py'
-->
</body></html>