-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpython3.qmd
executable file
·522 lines (363 loc) · 18.7 KB
/
python3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
---
title: Python Unit 3
execute:
enabled: false
---
## Object-oriented programming
Python supports, amongst others, the object-oriented programming paradigm. While hundreds of books have been written on this paradigm, it is actually quite easy to grasp: Think of a program as a collection of *objects* that communicate with each other. Each object has
- *attributes* describing its state, and
- *methods* that allow to change these attributes.
Before we can create an object, we first have to define a “blueprint” (*class*) for a certain type of object. Then, we may create one or more objects from this class via a process known as *instantiation*. Each of these objects will have the same set of attributes and methods, but the attributes will typically differ in their values.
Since all those concepts may sound quite abstract at first, here's a comprehensive example. We will create a class that represents proteins. Our implementation remains simple, but you will have the opportunity to add functionality in the exercises.
All statements of this section (i.e., the class definitions) should be stored in `simple_protein.py`.
### Attributes and the init method
A protein is characterized by its name and amino acid sequence. Hence, the protein class needs at least two attributes:
```python
class Protein:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence
```
The code above defines a new class:
- keyword `class`
- class name (here `Protein`)
- colon
- indented statements (suite, class body)
We also define an *init method*, which must be named `__init__()` and is called by Python whenever a new object is instantiated from the class. In the body of the init method, we initialize the instance variables `self.name` and `self.sequence`. These variables store the name and sequence of an object and thereby completely describe its state.
We create a new object `ins_A` of class `Protein`. To this end, we call the class name like a function and supply the arguments required by the init method. Once the object `ins_A` has been created, we may access its attributes (`[object name].[attribute name]`).
```python
ins_A = Protein(name="insulin A chain", sequence="GIVEQCCTSICSLYQLENYCN")
print(ins_A.name)
print(ins_A.sequence)
print(type(ins_A))
```
The built-in function `type()` tells us the class from which `ins_A` has been derived.
:::{.callout-important}
You may wonder what happened to the first parameter (`self`) of `__init__()`. Why did we only specify two arguments (which were obviously assigned to the parameters `name` und `sequence`) when we instantiated the insulin object? Each instance method must be defined with at least one parameter (commonly called `self`), which refers to the current instance. However, we *must not* pass an argument to this parameter when invoking the instance method – Python supplies the appropriate value “automatically”.
:::
### Methods
Let's add a method to our class. (Simply insert the definition of `mutate()` into your file `simple_protein.py`. Mind the correct indentation level!)
```python
class Protein:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence
# only add the following lines to your file
def mutate(self, pos, residue):
self.sequence = self.sequence[:pos] + residue + self.sequence[pos+1:]
```
The method `mutate()` creates a point mutation by replacing the amino acid at position `pos` by `residue`.
```python
ins_A = Protein("insulin A chain", "GIVEQCCTSICSLYQLENYCN")
print(ins_A.sequence)
ins_A.mutate(2, "W")
print(ins_A.sequence)
```
`mutate()` works as expected: It changed the third residue (i.e., valine with index 2) to tryptophane.
In the following sections, we will learn about classes that are available from the standard library.
## Regular expressions
### Introduction
The `re` module (from the standard library) implements classes and functions for handling regular expressions. As you might remember from the bash course, a *regular expression* (“regex”) is a string that defines a search pattern and thereby describes sets of other strings.
The `re` module is quite comprehensive. Here, we only present some of its functions; for further information, please consult the [documentation](https://docs.python.org/3/library/re.html) and the article [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html).
`findall()` searches for all non-overlapping occurrences of a regular expression (first argument) in a string (second argument) and returns all matches as list.
```python
from re import findall
findall(r".ython", "Python or rather Jython?")
```
It is recommended to define the regular expression by using a *raw string* (i.e., a string that includes a `r` immediately left of the opening quote). Within a raw string, the backslash is not interpreted as escape character, but as an actual backslash. You will see below why this approach is advantageous.
### Syntax
Python's syntax for regular expressions is similar to the bash syntax:
- The dot `.` denotes an *arbitrary character* (see example above).
- *Character classes* are assembled by surrounding a list of characters in brackets.
```python
findall(r"H[au]nd", "Hand Hund Hend")
```
You may also specify a range of characters …
```python
findall(r"H[a-t]nne", "Hanne Henne Hunne") # characters a to t
```
… or exclude certain characters.
```python
findall(r"H[^e]nne", "Hanne Henne Hunne") # all characters except e
```
Several predefined character classes are available, such as `\d` (all decimal numbers) and `\w` (all alphanumeric characters, including the underscore). If you want to use these classes, raw strings are particularly useful:
```python
findall("\\d\\w", "this is 1a") # many backslashes, right?
findall(r"\d\w", "this is 1a") # better!
```
- *Quantifiers* following a character indicate how often this character may appear.
```python
target = "bt bat baat baaat baaaat baaaaaaaaat"
findall(r"ba*t", target) # zero or more times
findall(r"ba+t", target) # one or more times
findall(r"ba?t", target) # zero or once
findall(r"ba{3}t", target) # exactly three times
findall(r"ba{2,}t", target) # at least two times
findall(r"ba{,3}t", target) # up to three times
findall(r"ba{2,4}t", target) # two to four times
```
- *Alternatives* search for one of two different strings.
```python
findall(r"one|two", "one two three")
```
- *Anchors* restrict matches to the start or end of a string:
```python
findall(r"^a..", "auf dem Haus") # anchor at the start
findall(r"a..$", "auf dem Haus") # anchor at the end
```
- *Groups* are created by parentheses surrounding one or more characters. Groups may be modified by a quantifier and allow to extract parts of a match:
```python
findall(r"animal: (\w+) (\w+)", "animal: Mus musculus (mouse)")
```
:::{.callout-note #example-findall}
#### Example
In the code below, we extract all codes that start with the character `2` from a string. The resulting list contains three elements, each of which is a tuple with two strings representing the first and second matched group, respectively. Of course, we can interate over the list of matches.
```python
codes = "1a3, 1b4, 2c5, 2d6, 2e7, 3f8, 3g9"
codes_starting_with_two = findall(r"(2.)(.)", codes)
codes_starting_with_two
#> [('2c', '5'), ('2d', '6'), ('2e', '7')]
for code in codes_starting_with_two:
print(code[1])
#> 5
#> 6
#> 7
```
:::
### Pattern objects
Functions like `findall()` are useful for simple searches. For more complex cases, we prefer to create a *pattern* object from a regular expression and then use its methods.
```python
from re import compile
p = compile(r"animal: (\w+) (\w+)")
type(p)
print(p.pattern) # attribute: regex used for searching
print(p.groups) # attribute: number of groups
```
The pattern object has a method `search()` that searches for matches in a string and returns a *match* object upon success.
```python
m = p.search("animal: Mus musculus (mouse)")
type(m)
```
Let's have a closer look at the match object:
```python
m.groups() # tuple with found groups
m[0] # whole match
m[1] # first group
m.span() # start and end index of the whole match
m.start(2) # start index of the second group
m.end(1) # end index of the first group
```
## Reading and writing files
### How to read individual characters
By reading user input via `input()` or from command line arguments (`sys.argv`) and by displaying output on the console (`print()`), a program may interact with the user. However, as soon as a program has to read large datasets, it it will become infeasible to enter those data manually. Thus, Python is able to read files existing files and to create new files and write contents to them.
Please copy the file `codon_table.csv` from the folder `/resources/python` into the folder `python3` in your home directory and have a look at its first five lines:
```default
AAA,Lys,K,Lysine
AAC,Asn,N,Asparagine
AAG,Lys,K,Lysine
AAT,Asn,N,Asparagine
ACA,Thr,T,Threonine
```
This file apparently represents a codon table, where an amino acid is specified for each base triplet. We open the file by using the built-in function `open()`, which returns a file object:
```python
f = open("codon_table.csv", "r")
f
```
The second argument of `open()` sets the file access mode – here, we only want to **r**ead the file. Python treats an opened file as a *data stream*, i.e., a (one-dimensional) sequence of data (in our case, characters) that can be accessed randomly.
Python may read from a file in the following manner: Immediately after a file has been opened, the current *stream position* points to the first character (index 0), as the method `tell()`, well, tells us:
```python
f.tell()
```
We read the first 30 characters via `read()` and store them in the variable `s`:
```python
s = f.read(30)
s
```
These 30 characters comprise the first line and part of the second line. After the reading operation, the stream position lies at the 31st character (index 30):
```python
f.tell()
```
We may, for example, go back to index 15 and read ten characters, or jump forward to index 100 and read two characters:
```python
f.seek(15)
f.read(10)
f.tell()
f.seek(100)
f.read(2)
f.tell()
```
### How to read a file line by line
In many cases, we will not need random access to the file. Since text files generally are composed of *lines*, they are usually read line by line (method `readline()`):
```python
f.seek(0)
l1 = f.readline()
l1
l2 = f.readline()
l2
f.tell()
```
The strings `l1` and `l2` now contain the first and second line of the file, respectively, including the end of line character `\n`. The stream position points to the first character in the third line. Note: Unfortunately, the end of a line is marked by different characters depending on the operation system. While Linux uses `\n` (*line feed*), Windows uses `\r\n` (*carriage return* plus line feed), and older versions of operating systems on Mac even used `\r`. Thus, you may run into problems if you process a file that has been created in Windows on Linux.
Fortunately, we don't have to repeatedly call `readline()` to eventually read all files from: A file object may be iterated line by line, which facilitates reading within a `for` loop:
```python
f.seek(0)
for line in f:
print(line)
```
Even more simply, we may save all lines in a list via `readlines()`:
```python
f.seek(0)
all_lines = f.readlines()
all_lines
```
If we don't need a file any longer, we must close it:
```python
f.close()
```
### How to write to a file
In order to be able to write to a file, we have to open it with the respective access mode:
```python
f = open("some_lines.txt", "w")
```
Now the method `write()` allows us to write strings to the file:
```python
f.write("First line\n")
f.write("Second line\n")
```
If we wish to write a collection of strings (e.g., a list of strings), the method `writelines()` allows us to do so. Attention: Python will write the list elements one after each other to the file, without inserting a new line character. Thus, if the list elements represent individual lines, we have to add an explicit newline character to the list elements:
```python
lines = ["three\n", "four\n", "five\n"]
f.writelines(lines)
```
Once we are done, we have to close the file. This is especially important after we have written data to the file, since Python may postpone writing operations. `close()` ensures that any pending operations are finished.
```python
f.close()
```
Finally, have a look at the created file (`cat`).
### Context objects
*Context objects* make file access even simpler. Instead of opening, reading/writing, and closing, a `with` block is used:
```python
with open("codon_table.csv", "r") as f:
all_lines = f.readlines()
for line in all_lines:
print(line)
```
Create a context object by writing
- keyword `with`
- function that returns a context object (here: `open()`)
- keyword `as`
- name for the context object (here: `f`)
- colon
- indented statements (suite)
Within the suite, the context object is defined (in our example, the open file is available for reading). As soon as the control flow exits the suite, Python makes sure that the context object is *deinitialized* (in the case of files, any pending writing operations are executed, and the file is closed).
## Exercises
Store all files that you generate for unit 3 in the folder `python3` in your home directory.
### Exercise 3.1 (3 P)
You have successfully solved this exercise as soon as you have worked through this unit. In particular, the folder `python3` must contain the following files, which you have created in the course of this unit:
- `codon_table.csv`
- `simple_protein.py`
- `some_lines.txt`
### Exercise 3.2 (4 P)
Create a class `Protein` with the following properties:
- The init method requires the arguments `name`, `uniprot_id`, and `sequence`, which are stored in attributes of the same name.
- The method `get_length` returns the number of amino acids in the protein.
- The method `contains` has one parameter `peptide` and returns `True` if the protein sequence contains the given `peptide` sequence.
- The method `get_mw` returns the molecular weight of the protein. The method has an optional parameter `disulfides` that specifies the number of disulfide bridges.
The molecular weight of a protein is given by the following formula:
```
sum of the masses of the amino acid residues
+ mass of a water molecule
- 2 * mass of hydrogen * number of disulfide bridges
```
Please use the following masses:
```python
AA_MASS = dict(
G=57.05132, A=71.0779, S=87.0773, P=97.11518, V=99.13106,
T=101.10388, C=103.1429, L=113.15764, I=113.15764, N=114.10264,
D=115.0874, Q=128.12922, K=128.17228, E=129.11398, M=131.19606,
H=137.13928, F=147.17386, R=156.18568, Y=163.17326, W=186.2099
)
WATER_MASS = 18.01528
H_MASS = 1.00784
```
Store the class in `exercise_3_2.py`.
You may check that your program works correctly by using the following exemplary calls:
```python
galanin = Protein("Galanin", "P22466", "GWTLNSAGYLLGPHAVGNHRSFSDKNGLTS")
print(type(galanin))
#> <class '__main__.Protein'>
print(galanin.get_length())
#> 30
print(galanin.contains("CGSHLV"))
#> False
print(galanin.get_mw())
#> 3157.4102999999996
insulin_B = Protein("Insulin B chain", "P01308", "FVNQHLCGSHLVEALYLVCGERGFFYTPKT")
print(insulin_B.get_length())
#> 30
print(insulin_B.contains("CGSHLV"))
#> True
print(insulin_B.get_mw(disulfides=1))
#> 3427.9056799999994
```
:::{.callout-tip collapse="true"}
- You may build upon the `Protein` class we defined above. The [init method](#attributes-and-the-init-method) has to define three attributes.
- You have also learned how to implement [methods](#methods). Think about the number of parameters, required vs optional, and the return value – these decisions are the same as for [function design](python2.qmd#functions).
- In `get_mw()`, you likely will need a [counter variable](python2.qmd#example-counter) and have to [access dict values](python2.qmd#dictionaries).
:::
### Exercise 3.3 (3 P)
Implement a function `read_masses` which reads a table of atomic weights.
- The function has a single parameter, which receives the name of the file containing the table.
- A table in CSV format is provided (`/resources/python/average_mass.csv`), whose first five lines read
```
H,1.008
He,4.0026
Li,6.94
Be,9.0122
B,10.81
```
Each line contains two records separated by a comma: The element symbol and the average atomic mass. Therefore, you may process each line with a regular expression containing two groups.
- The function should return a dict, where element symbols and masses are the key-value pairs.
Store the function in `exercise_3_3.py`.
You may check that your program works correctly by using the following exemplary calls:
```python
m = read_masses("average_mass.csv")
print(m["N"])
#> 14.007
print(m["S"])
#> 32.06
print(2 * m["H"] + m["O"])
#> 18.015
```
:::{.callout-tip collapse="true"}
- You will have to [open](#how-to-read-a-file-line-by-line) the file and read its contents.
- The regex for parsing each line will contain two [groups](#pattern-objects) that capture the element and weight, respectively.
- Combine your knowledge on [counter variables](python2.qmd#example-counter) and [dicts](python2.qmd#dictionaries).
:::
### Exercise 3.4 (3 P)
Implement a function `calculate_mass`, which calculates the mass of a chemical compound. The function is called with a string that contains the molecular formula (e.g., `"C6 H12 O6"` or `"C Cl4"`) and returns the mass as a float.
Hints:
- Extract individual element symbols and their counts via a regular expression containing two groups. The first group should match one uppercase letter followed by an optional lowercase letter. The second group should match any number of digits.
Please use the following masses:
```python
MASSES = dict(
H=1.008, He=4.0026, Li=6.94, Be=9.0122, B=10.81,
C=12.011, N=14.007, O=15.999, F=18.998, Ne=20.18,
Na=22.99, Mg=24.305, Al=26.982, Si=28.085, P=30.974,
S=32.06, Cl=35.45, Ar=39.948, K=39.098, Ca=40.078
)
```
Store the function in `exercise_3_4.py`.
You may check that your program works correctly by using the following exemplary calls:
```python
print(calculate_mass("C6 H12 O6"))
#> 180.156
print(calculate_mass("H2 O"))
#> 18.015
print(calculate_mass("C34 H46 Cl N3 O10"))
#> 692.203
```
:::{.callout-tip collapse="true"}
- For designing the regex to extract individual elements and their counts, think about how they are written: Each element starts with one uppercase letter, followed by one optional lowercase letter. The count has an arbitrary number of digits (i.e., it may also be missing, which indicates a count of 1).
You will have to [open](#how-to-read-a-file-line-by-line) the file and read its contents.
- You will have to [iterate](#example-findall) over `findall()` results.
:::