-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbash.qmd
executable file
·851 lines (605 loc) · 25.9 KB
/
bash.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
---
title: Bash exercises
execute:
enabled: false
---
## Connecting to the cluster (Day 1)
The command ssh lets you connect to the cluster. You will need VPN active to be able to connect if you are not using the university LAN or "eduroam" WLAN (when you are working from home).
```bash
ssh [username]@corso.came.sbg.ac.at
# For example:
ssh nfortelny@corso.came.sbg.ac.at
```
::: {.callout-tip}
You will see a number of different commands over the next days. The key commands are listed at the [bottom of this page](#commands).
:::
Next you will change your password. Make sure to remember this password, which you will you throughout this course.
```bash
passwd
```
Now we are connected to the cluster. In the following we will explore this file system. Details will be explored in the next lecture of this course. For now, just execute the commands and see if you can make sense of the results.
The cluster has a file system like any computer. Currently you are in your home directory. The current "path" is shown by the following command:
```bash
pwd
```
You can list the content of your home directory using
```bash
ls
ls -l
```
::: {.callout-tip}
Do not confuse the lowercase letter `l` (as used in `ls -l`) with the number `1` ("one").
:::
You can clear the terminal by typing
```bash
clear
```
Now, use the up- and down-arrows on your keyboard to look through the history of commands.
Now create a directory:
```bash
mkdir day1
```
List content of the directory again:
```bash
ls
ls -l # This shows you the home directory (and the day1 directory that is within the home directory)
ls -l day1/ # This shows you the content of the day1 directory
```
::: {.callout-tip}
Note: `folder` and `directory` mean the same thing. We will use `directory` from now on.
:::
What does `mkdir` do? Bring up the manual (press `q` to exit the manual).
```bash
man mkdir
```
Now create some files. There are many ways to create a file.
```bash
# Create an empty file in directory "day1"
touch day1/this_file_is_empty.txt
# Write the text "abcdefgh" in a file.
echo "abcdefgh" > day1/letters.txt
```
List files again.
```bash
ls
ls -l
ls -l day1/
```
Take a look at the file content.
```bash
head day1/*
```
## Files and file systems (Day 2)
### Navigate the file system
First, let's clean up.
```bash
clear
```
Go to the home directory (absolute path).
```bash
cd ~/
ls -l
pwd
```
::: {.callout-tip}
The "tilde" sign `~` is very useful. The names of special symbols are listed at the [bottom of this page](#special-symbols).
:::
Go to the directory `/home` (absolute path), which contains the home directories of all users.
```bash
cd /home
ls -l
pwd
```
Go back to your home directory (absolute path).
```bash
cd ~/
ls -l
pwd
```
Go into the directory created yesterday (relative path - from your home directory).
```bash
cd day1/
ls -l
pwd
```
Go back one level (relative path - from the `day1` directory).
```bash
cd ../
ls -l
pwd
```
Let's create another directory (relative path - from your home directory). What does the `-p` stand for?
```bash
pwd
mkdir -p day2/documents/texts
pwd
```
### Permissions
Change permissions in a file created in the above directory. We will use the `chmod` command (<https://en.wikipedia.org/wiki/Chmod>). This command uses three digits between 0 and 7 to set the permissions of (i) the user, (ii) the group, and (iii) everyone else. The digit defines the permission, as explained in the following table.
x | Permission | rwx | Binary
--- | --- | --- | ---
7 | read, write and execute | `rwx` | 111
6 | read and write | `rw-` | 110
5 | read and execute | `r-x` | 101
4 | read only | `r--` | 100
3 | write and execute | `-wx` | 011
2 | write only | `-w-` | 010
1 | execute only | `--x` | 001
0 | none | `---` | 000
Now create a test file. Then we will change permissions with `chmod` and look at them with `ls -l`. Rund these lines one by one and understand the output.
```bash
# Create the file
echo "Hello!" > day2/greeting.txt
ls -l day2
# Change permissions
chmod 711 day2/greeting.txt
ls -l day2
chmod 722 day2/greeting.txt
ls -l day2
chmod 733 day2/greeting.txt
ls -l day2
chmod 744 day2/greeting.txt
ls -l day2
chmod 755 day2/greeting.txt
ls -l day2
chmod 766 day2/greeting.txt
ls -l day2
chmod 777 day2/greeting.txt
ls -l day2
chmod 700 day2/greeting.txt
ls -l day2
man chmod
```
#### Exercises
- Create a new directory called `secrets` within in the directory `~/day2`.
- Enable full access to this directory for the owner of the directory, read access for the group, and no access for others
### Getting files
Make sure you are in your home directory.
```bash
pwd
cd ~/
# same as "cd " or "cd ~", since the default value
# for the dir argument is HOME (see "man cd")
pwd
```
Obtain the list of gRNAs.
```bash
# look at the man pages of the following commands – what is their purpose?
man mv
man cp
# now try it
mv /resources/bash/gRNAs.txt ~/ # this will not work
cp /resources/bash/gRNAs.txt ~/ # this works
# Why does mv not work? Look at file permissions:
ls -l /resources/bash/gRNAs.txt
ls -l ~/gRNAs.txt
```
Now explore the copied file.
```bash
head gRNAs.txt
head -5 gRNAs.txt
tail -5 gRNAs.txt
less gRNAs.txt
more gRNAs.txt
# Count the number of lines
wc -l gRNAs.txt
```
::: {.callout-tip}
You may be used to a workflow where you first copy a file (e.g., Ctrl+C), then go to the destination directory, and paste it there (e.g., Ctrl+V). By contrast, the copy command `cp` does both the copying *and* pasting.
:::
#### Exercises
Copy the file `gRNAs.txt` from your home directory into the directory `day2`. Then rename the copied file in this directory to `gRNAs_exercise.txt`.
### Editing in nano
Now add a line at the end of the file `gRNAs.txt` that is located in your home directory, adding the gRNA sequence "ACTGACTG". Use the `nano` editor for this purpose. To quit the nano-editor you need to press Ctrl+X. Then type y+Enter to save the changes. Nano commands are shown in the editor and can be found on the internet. A list is provided below.
Nano commands:
Command | Function
--- | ---
ctrl+r | read/insert file
ctrl+o | save file
ctrl+x | close file
alt+a | start selecting text
ctrl+k | cut selection
ctrl+u | uncut (paste) selection
alt+/ | go to end of the file
ctrl+a | go to start of the line
ctrl+e | go to end of the line
ctrl+c | show line number
ctrl+_ | go to line number
ctrl+w | find matching word
alt+w | find next match
ctrl+\ | find and replace
```bash
nano gRNAs.txt
```
Now use nano to modify the shell to make things prettier. To do so change the file `.bash_profile`. This file contains settings for each user (the naming is just by convention). It starts with a `.`, which for Linux means the file is hidden.
```bash
nano ~/.bash_profile
# This creates the file and also opens it in nano
```
Add the following lines to `.bash_profile` in nano, then exit the file and save it - see the commands above.
::: {.callout-tip}
In Windows PowerShell, copy/paste works best if you enable the option to do so via Ctrl+Shift+V, as shown below.

:::
```bash
if [ -x /usr/bin/dircolors ]; then
test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
alias ls='ls --color=auto'
alias grep='grep --color=auto'
fi
```
The changes we added to `.bash_profile` will come into effect next time you log in. To also activate them for your current login, you can `source` the file, executing the commands stored within.
```bash
ls -l *
source ~/.bash_profile
ls -l *
```
Also, notice the difference in `ls` commands to show hidden files (like `.bash_profile`).
```bash
ls -l
ls -al
```
::: {.callout-tip}
If your grade sheet does not show that the content of .bash_profile is correct but it still works, then leave it. There may be a small difference in between your version and the expected one that does not impact the functionality.
:::
### Zipped files
Now let's download all human gene sequences from Ensembl. Download the file to your home directory.
```bash
man wget
wget http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
```
If the above fails, then you can also copy the file from the resources directory into you home directory.
``` bash
cp /resources/bash/Homo_sapiens.GRCh38.cds.all.fa.gz ~/
```
To make sure you have the entire file properly downloaded, compare the MD5 hash of the file. MD5 hash functions are a compact digital fingerprint of a file. The MD5 hash of the file should be `b16d46bf09c3b8b7909624f1e6c414ce`.
``` bash
md5sum ~/Homo_sapiens.GRCh38.cds.all.fa.gz
md5sum /resources/bash/Homo_sapiens.GRCh38.cds.all.fa.gz
```
Use the `du` command to determine the file size.
```bash
du Homo_sapiens.GRCh38.cds.all.fa.gz
# the -h argument displays the file size in a human-readable format
du -h Homo_sapiens.GRCh38.cds.all.fa.gz
```
Have a look at this file.
```bash
head Homo_sapiens.GRCh38.cds.all.fa.gz
```
This doesn't look great. Remember to clean up your terminal.
```bash
clear
```
The above file is zipped. Now unzip it.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz
# This command will run through the entire file which is very long.
# Press Ctrl+C to stop the command.
man gunzip
# -c --stdout --to-stdout
# Write output on standard output; keep original files unchanged.
# If there are several input files, the output consists of a sequence
# of independently compressed members. To obtain better compression,
# concatenate all input files before compressing them.
```
Again, remember to clean up your terminal.
```bash
clear
```
Can we use `head` on the unzipped output? Yes - this is done using a *pipe*.
### Pipes
Linux pipes enables you to pass the output of one command to another command.
Pipe command | Function
--- | ---
`cmd < file` | use `file` as input for command `cmd`
`cmd > file` | write output to `file`
`cmd >> file` | append output to `file`
`cmd 2> stderr` | write error output to `file`
`cmd &> file` | send output and error to `file`
`cmd1 | cmd2` | send output of `cmd1` to `cmd2`
Let's have a look at the first few lines of this file.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head
```
Some programs let you look at decompressed output, for example
```bash
zless Homo_sapiens.GRCh38.cds.all.fa.gz
# very similar to:
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | zless
```
Now we can also count the number of lines in this file:
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | wc -l
```
#### Exercises
Place the following files into the directory `~/day2`, using pipes:
- Store the number of lines of `Homo_sapiens.GRCh38.cds.all.fa.gz` into the file `lineNumber.txt`.
- Write the first 15 lines of `Homo_sapiens.GRCh38.cds.all.fa.gz` into the file `lines1.txt`.
- Write the 31th to 35th line of `Homo_sapiens.GRCh38.cds.all.fa.gz` into the file `lines2.txt`.
- Store the size of `Homo_sapiens.GRCh38.cds.all.fa.gz` in Megabytes into the file `size.txt`.
## Patterns and regular expressions (Day 3)
### File pattern matches
Commands can be executed on multiple files at the same time using pattern matches (we have used this already above).
```bash
ls -l
ls -l *
ls -l day*
wc -l *
wc -l *.gz
```
Description | Pattern
--- | ---
Match zero or more characters | `*`
Match any single character | `?`
Match any of the characters in a set | `[...]`
Match zero or one occurrences of the patterns (extglob) | `?`(patterns)
Match zero or more occurrences of the patterns (extglob) | `*`(patterns)
Match one or more occurrences of the patterns (extglob) | `+`(patterns)
Match one occurrence of the patterns (extglob) | `@`(patterns)
Match anything that doesn't match one of the patterns (extglob) | `!`(patterns)
### Simple regular expressions
Regular expressions can be executed on file names but also content within files. Below is the example from the PDF presented during the lecture.
::: {.callout-tip}
The file `sample` needs to be copied to your home directory from the directory `/resources/bash`.
:::
```bash
cat sample
grep ^a sample
grep -E p\{2} sample
grep "a\+t" sample
sed "s/a/XXX/g" sample
```
Descriptions | Symbol
--- | ---
replaces any character | `.`
matches start of string | `^`
matches end of string | `$`
matches up zero or more times the preceding character | `*`
Represent special characters | `\`
Groups regular expressions | `()`
Matches up exactly one character | `?`
### Checking the nucleotides in the Ensembl fasta file
Now let's use regulare expressions to explore the fasta file. Have a look at the first few lines.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -50
```
We expect all DNA sequences to be made up of A, C, T, and Gs. Now, we will verify this. First, we will get all DNA-sequences from this file, excluding lines starting with a `>` (in a FASTA files, these lines are the lines describing the sequence; see <https://en.wikipedia.org/wiki/FASTA_format>).
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">"
# from "man grep":
# -v Invert the sense of matching, to select non-matching lines.
```
Next we will use `tr` to delete newline characters. This will place all sequences in one (very long) line.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n'
```
Next we place every character on a different line.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o .
```
Finally, we want to remove repetition, to only see the unique nucleotides. To do so, we have to first sort the lines and then make them unique.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | sort | uniq
# what happens if we only make them unique?
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | head -20
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | head -20 | uniq
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | wc -l
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | uniq | wc -l
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | sort | uniq | wc -l
```
#### Exercises
- Create a directory `day3` in your home directory.
- Count the number of A, C, T, and G of the DNA sequences of the first 1000 lines of file `Homo_sapiens.GRCh38.cds.all.fa.gz`.
- Store each number in a file called `nt_A.txt`, `nt_C.txt`,... in the directory `day3`.
### Reformatting the Ensembl fasta file
Next, we will reformat the fasta file, such that every entry is on one line and the DNA sequence is at the end of the line.
Remind ourselves of how this file looks like:
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500
```
In the original format, the `>` separates different entries and sequences are in the lines below the line with `>`. First, we will remove newline characters that separate the sequences. To remove the newline characters, we will use the `sed` command.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | sed -zE 's/([ACTG])\n([ACTG])/\1\2/g'
```
Let breack this up:
- `s///` tells sed to substitute.
- `s///g` tells sed to substitue globally - replacing each occurance of `x`.
- `s/x/y/g` means we substitute `x` by `y`.
- `([ACTG])` matches any of the four letters A, C, T, and G. The brackets `()` tell the regex to *store* the match for later (see `\1` and `\2` below). If we want to use this *storing* we need to use `sed -E`.
- `\n` matches the newline.
- `\1` is going to be replaced by the first match within the first brackets `()`. Therefore, here it will be replaced by the nucleotide before the new line.
- `\2` is going to be replaced by the second match within the second brackets `()`. Therefore, here it will be replaced by the nucleotide after the new line.
- So ultimately the nulceotide before and after the newline are stored, and then they are written again but without the newline.
Next, we will remove the newline character that separates the sequence from the line with the entry information, which starts with `>`, and replace it with ` seq:`.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | sed -zE 's/([ACTG])\n([ACTG])/\1\2/g' | sed -z 's/\n[^>]/ seq:/g'
```
Explanation:
- `[^>]` means to NOT match `>`.
- `\n[^>]` replaces all newline characters that are NOT before a `>`.
We developed the above approach on the first lines. Now run it an the full file. This may take a couple of minutes. The `gzip` command will compress the results. The `>` will store it in a new file `GRCh38_reformatted.gz`. The reformatted file should end up in your home directory.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | sed -zE 's/([ACTG])\n([ACTG])/\1\2/g' | sed -z 's/\n[^>]/ seq:/g' | gzip > GRCh38_reformatted.gz
```
Explore the generated file.
```bash
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | wc -l
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | grep ">" | wc -l
gunzip -c GRCh38_reformatted.gz | wc -l
zless GRCh38_reformatted.gz
```
### Identifying gRNA matches
Now, with the newly formatted file, we can easily identify genes in the database that match a specific gRNA.
```bash
gunzip -c GRCh38_reformatted.gz | head -30
```
We check that all entries have gene symbols:
```bash
gunzip -c GRCh38_reformatted.gz | grep "gene_symbol" | head -30
gunzip -c GRCh38_reformatted.gz | grep -v "gene_symbol" | head -30
# remember: grep -v means to *not* match
```
Next we extract those sequences matching the guide "TTAAGACA":
```bash
gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | head -30
```
Now we extract the gene symbols of these matches. First we match all text (`.*`) from the beginning of the line (`^`) before the text `gene_symbol` and delete it.
```bash
gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | sed 's/^.*gene_symbol://g' | head -30
```
Then we also remove all text after the gene symbol, in this case the space and everything (`.*`) after that until the end of the line (`$`).
```bash
gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g' | head -30
```
Finally, we get unique gene IDs.
```bash
gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g' | sort | uniq
```
#### Exercises
Create the following files in the directory `day3`:
- Write all entries with sequences matching the guide "GCGGTTTC" into the file `guideMatch_GCGGTTTC.txt`.
- Write the count of entries with sequences starting with "TGC" into the file `count_TGC.txt`.
- Write the count of entries of gene "MMP2" into the file `count_MMP2.txt`. Note: Do not count genes MMP20, MMP21,...
- Write the count of unique genes whose symbol starts with "RPL" into the file `count_RPL.txt`.
- Write the count of all unique protein coding genes into the file `count_protein_coding.txt`.
## Loops and variables (Day 4)
### Variables
Variables can hold information which can be passed to different programs.
```bash
x="test variable"
echo $x
echo $x | sed 's/a//g'
```
Now we will write the gRNAs to test and the regexp patterns into variables.
```bash
guide=TTAAGACA
echo $guide
pattern="seq:[ACTG]*${guide}"
echo $pattern
gunzip -c GRCh38_reformatted.gz | grep $pattern | head -30
gunzip -c GRCh38_reformatted.gz | grep $pattern | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g' | head -30
```
### Loops
We will use `while` loops. They have the following syntax:
```bash
while [condition] do [something] done
```
Different types of loops exist (<https://ryanstutorials.net/bash-scripting-tutorial/bash-loops.php>), but those are not important for this course.
The following loop starts with x = 1. It then repeats printing "Welcome ... times" and adding +1 to x, until x is greater than 5 (while x is lower or equal to 5 `$x -le 5`).
```bash
x=1
while [ $x -le 5 ]
do
echo "Welcome $x times"
x=$(( $x + 1 )) # syntax to add a number to x
done
```
Importantly, the loop is ONE construct. Thus when entering `while [ $x -le 5 ]`, the terminal expects additional information, in particular `do [...] done`. While waiting for additional input, the terminal will start lines with `>`. Keep entering the `do`, etc in those lines. The final results will look like this:
```
[user]@corso:~$ while [ $x -le 5 ]
> do
> echo "Welcome $x times"
> x=$(( $x + 1 ))
> done
Welcome 1 times
Welcome 2 times
Welcome 3 times
Welcome 4 times
Welcome 5 times
```
Try to make the above loop count to 10. Or from 10 to 5.
Here are different ways to compare numbers. Assume variable `a` holds 10 and variable `b` holds 20 then:
Operator | Description | Example
--- | --- | ---
`-eq` | Checks if the value of two operands are equal or not; if yes, then the condition becomes true. | `[ $a -eq $b ]` is not true.
`-ne` | Checks if the value of two operands are equal or not; if values are not equal, then the condition becomes true. | `[ $a -ne $b ]` is true.
`-gt` | Checks if the value of left operand is greater than the value of right operand; if yes, then the condition becomes true. | `[ $a -gt $b ]` is not true.
`-lt` | Checks if the value of left operand is less than the value of right operand; if yes, then the condition becomes true. | `[ $a -lt $b ]` is true.
`-ge` | Checks if the value of left operand is greater than or equal to the value of right operand; if yes, then the condition becomes true. | `[ $a -ge $b ]` is not true.
`-le` | Checks if the value of left operand is less than or equal to the value of right operand; if yes, then the condition becomes true. | `[ $a -le $b ]` is true.
You can also use nano to write the code above into a file (for example here: `script.sh`) and then use the following to execute it:
```bash
bash script.sh
```
We can also loop through the content of a file.
```bash
while read p; do
echo "$p"
done < gRNAs.txt
```
#### Exercise
Now we will combine the above variables and loopes to test all guides in file `gRNAs.txt` against all DNA sequences in `GRCh38_reformatted.gz`. To do so you need the following steps (also see the hints below!!!):
- Use the code above (under [Variables](#variables)) that matches guides against DNA sequences and extracts gene symbols for matched sequences.
- Place this function into the loop that iterates through the file `gRNAs.txt`, using the while loop shown above that iterates through a file.
- Store all matches (not just the top 30 coming from `head -30`) of each guide into a file that is named `results_${guide}.txt`. Place all files into a new directory `day4` - ideally already within the loop.
- Double-check the results by hand (no points).
- How many files did you just create?
- Look at one of the created files. Is the content what you would expect?
- When you run it with only one guide (without the loop) and compare to the results generated within the loop: Do you get the right genes for this guide? Do you get the correct number of genes for this guide?
::: {.callout-tip}
Hint: To complete the exercise, you can use the following code to extract genes for one guide. In this code, the guide needs to be defined first using `guide=[...]`. However, when placing this code block inside of the loop, you don't define the guide - it is defined by the loop.
:::
```bash
echo $guide
pattern="seq:[ACTG]*${guide}"
echo $pattern
gunzip -c GRCh38_reformatted.gz | grep $pattern | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g' | sort | uniq
```
Place the above in the loop that iterates throught the gRNA file.
```bash
while read guide; do
echo $guide
done < gRNAs.txt
```
::: {.callout-tip}
Remember: you need to save the result of the last line (`gunzip ... | uniq`) to a file `results_${guide}.txt`.
:::
### Comparing files
Have a look at the files we generated.
```bash
cd ~/day4
ls results_*.txt
ls -l results_*.txt
head results_*.txt
wc -l results_*.txt
```
Now let's compare results in a pairwise manner. The command `comm` will provide the gene symbols unique to one or the other file, plus the intersection of both.
```bash
cd ~/day4
comm results_AAGTTGGC.txt results_GCCATACA.txt
comm -12 results_AAGTTGGC.txt results_GCCATACA.txt
```
The count of genes in the overlap (intersection) can be stored in a new variable. To store results of an expression in a new variable, place the expression into parenthesis `x=$()`.
```bash
cd ~/
guide1=AAGTTGGC
guide2=GCCATACA
overlap=$(comm -12 day4/results_${guide1}.txt day4/results_${guide2}.txt | wc -l)
echo "$guide1 $guide2 $overlap"
```
#### Exercise
Now we will compare each pair of guides to test the overlap of genes. Place the result in the directory `day4`:
- Use one loop (that iterates through all guides) within a second loop (that also iterates through all guides) to compare the overlap between all pairs of guides (thus, for simplicity, we will compare all pairs of guides also the overlap of each guide with itself).
- Use the `comm` command as shown above to extract the overlap, counting the number of genes, and writing the result into a file named `overlap_${guide1}_${guide2}.txt`.
- "Manually" check results (no points):
- How many files did you create? How many did you expect to create?
- Look at one of the created files. Is the content what you would expect?
- Compare individual examples (for example "AAGTTGGC" vs "GCCATACA"). Did you get the right overlap?
## Special symbols
- The `(` is called a `parenthesis`
- The `[` is called a `squared bracket` (sometimes only `bracket`)
- The `{` is called a `curly bracket` (sometimes only `brace`)
- The `~` is called a `tilde`
- The `*` is called an `asterisk`
## Useful links
Linux for bioinformatics
- <https://bioinformaticsworkbook.org/Appendix/Unix/unix-basics-1.html#gsc.tab=0>
- <https://bioinformaticsworkbook.org/Appendix/Unix/UnixCheatSheet.html#gsc.tab=0>
- <https://bioinformatics.uconn.edu/unix-basics/#>
- <https://decodebiology.github.io/bioinfotutorials/>
- <https://www.melbournebioinformatics.org.au/tutorials/tutorials/unix/unix/>
General linux commands
- <https://www.thegeekstuff.com/2010/11/50-linux-commands/>
Regular expressions:
- <https://www.guru99.com/linux-regular-expressions.html>