-
Notifications
You must be signed in to change notification settings - Fork 49
/
Copy path01-intro.Rmd
executable file
·644 lines (389 loc) · 44.9 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
# Getting started with R and RStudio {#chap1}
Although R is not new, its popularity has increased rapidly over the last 10 years or so (see [here][r4stats] for some interesting data). It was originally created and developed by **R**oss Ihaka and **R**obert Gentleman during the 1990's with the first stable version released in 2000. Nowadays R is maintained by the [R Development Core Team][cran-core]. So, why has R become so popular and why should you learn how to use it? Some reasons include:
- R is open source and freely available.
- R is available for Windows, Mac and Linux operating systems.
- R has an extensive and coherent set of tools for statistical analysis.
- R has an extensive and highly flexible graphical facility capable of producing publication quality figures.
- R has an expanding set of freely available ‘packages’ to extend R's capabilities.
- R has an extensive support network with numerous online and freely available documents.
All of the reasons above are great reasons to use R. However, in our opinion, the single biggest reason to use R is that it facilitates robust and reproducible research practices. In contrast to more traditional 'point and click' software, writing code to perform your analysis ensures you have a permanent and accurate record of all the methods you used (and decisions you made) whilst analysing your data. You are then able to share this code (and your data) with other researchers / colleagues / journal reviewers who will be able to reproduce your analysis exactly. This is one of the tenets of [open science][open-sci]. We will cover other topics to facilitate open science throughout this book, including creating [reproducible reports](#rmarkdown_r) and [version control](#github_r).
In this Chapter we'll show you how to download and install R and RStudio on your computer, give you a brief RStudio orientation including working with RStudio Projects, installing and working with R packages to extend R's capabilities, some good habits to get into when working on projects and finally some advice on documenting your workflow and writing nice readable R code.
## Installing R {#install_r}
To get up and running the first thing you need to do is install R. R is freely available for Windows, Mac and Linux operating systems from the [Comprehensive R Archive Network (CRAN) website][cran]. For Windows and Mac users we suggest you download and install the pre-compiled binary versions.
\
```{block2, vid-text1, type='rmdvideo'}
See this [video][install-vid] for step-by-step instructions on how to download and install R and RStudio
```
### Windows users
For Windows users select the 'Download R for Windows' link and then click on the 'base' link and finally the download link `r paste("'", 'Download R ', newest.vers, ' for Windows', "'", sep = "")`. This will begin the download of the '.exe' installation file. When the download has completed double click on the R executable file and follow the on-screen instructions. Full installation instructions can be found at the [CRAN website][cran-windows].
### Mac users
For Mac users select the '[Download R for (Mac) OS X][cran-mac]' link. The binary can be downloaded by selecting the `r paste("'", 'R-', newest.vers, '.pkg', "'", sep = "")`. Once downloaded, double click on the file icon and follow the on-screen instructions to guide you through the necessary steps. See the '[R for Mac OS X FAQ][cran-mac-faq]' for further information on installation.
### Linux users
For Linux users, the installation method will depend on which flavour of Linux you are using. There are reasonably comprehensive instruction [here][cran-linux] for Debian, Redhat, Suse and Ubuntu. In most cases you can just use your OS package manager to install R from the official repository. On Ubuntu fire up a shell (Terminal) and use (you will need root permission to do this):
```{block}
sudo apt update
sudo apt install r-base r-base-dev
```
which will install base R and also the development version of base R (you only need this if you want to compile R packages from source but it doesn't hurt to have it).
If you receive an error after running the code above you may need to add a 'source.list' entry to your etc/apt/sources.list file. To do this open the /etc/apt/sources.list file in your favourite text editor (gedit, vim, nano etc) and add the following line (you will need root permission to do this):
```{block }
deb https://cloud.r-project.org/bin/linux/ubuntu disco-cran35/
```
This is the source.list for the latest version of Ubuntu (19.04 Disco Dingoat the time of writing). If you're using an earlier version of Ubuntu then replace the source.list entry to the one which corresponds to the version of Ubuntu you are using (see [here][cran-ubuntu] for an up to date list). Once you have done this then re-run the apt commands above and you should be good to go.
### Testing R
Whichever operating system you're using, once you have installed R you need to check its working properly. The easiest way to do this is to start R by double clicking on the R icon (Windows or Mac) or by typing `R` into the Console (Linux). You should see the R Console and you should be able to type R commands into the Console after the command prompt `>`. Try typing the following R code and then press enter (don't worry if you don't understand this - we're just checking if R works)
```{r, out.width="75%", fig.align="center"}
plot(1:10)
```
A plot of the numbers 1 to 10 on both the x and y axes should appear. If you see this, you're good to go. If not then we suggest you make a note of any errors produced and then use Google to troubleshoot.
\
## Installing RStudio
Whilst its eminently possible to just use the base installation of R (many people do), we will be using a popular **I**ntegrated **D**evelopment **E**nvironment (IDE) called RStudio. RStudio can be thought of as an add-on to R which provides a more user-friendly interface, incorporating the R Console, a script editor and other useful functionality (like R markdown and GitHub integration). You can find more information about [RStudio here][rstudio].
RStudio is freely available for Windows, Mac and Linux operating systems and can be downloaded from the [RStudio site][rstudio-download]. You should select the 'RStudio Desktop' version. Note: you must install R before you install RStudio (see [previous section](#install_r) for details).
\
```{block2, vid-text2, type='rmdvideo'}
See this [video][install-vid] for step-by-step instructions on how to download and install R and RStudio
```
### Windows and Mac users
For Windows and Mac users you should be presented with the appropriate link for downloading. Click on this link and once downloaded run the installer and follow the instructions. If you don't see the link then scroll down to the 'All Installers' section and choose the link manually.
### Linux users
For Linux users scroll down to the 'All Installers' section and choose the appropriate link to download the binary for your Linux operating system. RStudio for Ubuntu (and Debian) is available as a `*.deb` package. The easiest way to install `deb` files on Ubuntu is by using the `gdebi` command. If `gdebi` is not available on your system you can install it by using the following command in the Terminal (you will need root permission to do this)
```{block}
sudo apt update
sudo apt install gdebi-core
```
To install the `*.deb` file navigate to where you downloaded the file and then enter the following command with root permission
```{block}
sudo gdebi rstudio-xenial-1.2.5XXX-amd64.deb
```
where '-1.2.5XXX' is the current version for Ubuntu (rstudio-xenial-1.2.5019-amd64.deb at the time of writing). You can then start RStudio from the Console by simply typing
```{block}
rstudio
```
or you can create a shortcut on your Desktop for easy startup.
### Testing RStudio
Once installed, you can check everything is working by starting up RStudio (you don't need to start R as well, just RStudio). You should see something like the image below (if you're on a Windows or Linux computer there may be small cosmetic differences).
\
```{r rstudio, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/rstudio.png")
```
## RStudio orientation {#rstudio_orient}
When you open R studio for the first time you should see the following layout (it might look slightly different on a Windows computer).
\
```{r rstudio_start, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/rstudio.png")
```
\
The large window (aka pane) on the left is the **Console** window. The window on the top right is the **Environment / History / Connections** pane and the bottom right window is the **Files / Plots / Packages / Help / Viewer** window. We will discuss each of these panes in turn below. You can customise the location of each pane by clicking on the 'Tools' menu then selecting Global Options --> Pane Layout. You can resize the panes by clicking and dragging the middle of the window borders in the direction you want. There are a plethora of other ways to [customise][rstudio-cutomise] RStudio.
\
```{block2, vid-text3, type='rmdvideo'}
See this [video][rstudio-vid] for a quick introduction to RStudio
```
### Console {#cons}
The Console is the workhorse of R. This is where R evaluates all the code you write. You can type R code directly into the Console at the command line prompt, `>`. For example, if you type `2 + 2` into the Console you should obtain the answer `4` (reassuringly). Don't worry about the `[1]` at the start of the line for now.
\
```{r rstudio_console, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/rconsole_eval.png")
```
\
However, once you start writing more R code this becomes rather cumbersome. Instead of typing R code directly into the Console a better approach is to create an R script. An R script is just a plain text file with a `.R` file extension which contains your lines of R code. These lines of code are then sourced into the R Console line by line. To create a new R script click on the 'File' menu then select New File --> R Script.
\
```{r rstudio_newscript, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/new_script.png")
```
\
Notice that you have a new window (called the Source pane) in the top left of RStudio and the Console is now in the bottom left position. The new window is a script editor and where you will write your code.
\
```{r rstudio_new, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/rstudio_new.png")
```
\
To source your code from your script editor to the Console simply place your cursor on the line of code and then click on the 'Run' button in the top right of the script editor pane.
\
```{r rstudio_run, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/rstudio_run.png")
```
\
You should see the result in the Console window. If clicking on the 'Run' button starts to become tiresome you can use the keyboard shortcut 'ctrl + enter' (on Windows and Linux) or 'cmd + enter' (on Mac). You can save your R scripts as a `.R` file by selecting the 'File' menu and clicking on save. Notice that the file name in the tab will turn red to remind you that you have unsaved changes. To open your R script in RStudio select the 'File' menu and then 'Open File...'. Finally, its worth noting that although R scripts are saved with a `.R` extension they are actually just plain text files which can be opened with any text editor.
### Environment/History/Connections
The Environment / History / Connections window shows you lots of useful information. You can access each component by clicking on the appropriate tab in the pane.
- The 'Environment' tab displays all the objects you have created in the current (global) environment. These objects can be things like data you have imported or functions you have written. Objects can be displayed as a List or in Grid format by selecting your choice from the drop down button on the top right of the window. If you're in the Grid format you can remove objects from the environment by placing a tick in the empty box next to the object name and then click on the broom icon. There's also an 'Import Dataset' button which will import data saved in a variety of file formats. However, we would suggest that you don't use this approach to import your data as it's not reproducible and therefore not robust (see [Chapter 3](#data_r) for more details).
- The 'History' tab contains a list of all the commands you have entered into the R Console. You can search back through your history for the line of code you have forgotten, send selected code back to the Console or Source window. We usually never use this as we always refer back to our R script.
- The 'Connections' tab allows you to connect to various data sources such as external databases.
### Files/Plots/Packages/Help/Viewer
- The 'Files' tab lists all external files and directories in the current working directory on your computer. It works like file explorer (Windows) or Finder (Mac). You can open, copy, rename, move and delete files listed in the window.
- The 'Plots' tab is where all the plots you create in R are displayed (unless you tell R otherwise). You can 'zoom' into the plots to make them larger using the magnifying glass button, and scroll back through previously created plots using the arrow buttons. There is also the option of exporting plots to an external file using the 'Export' drop down menu. Plots can be exported in various file formats such as jpeg, png, pdf, tiff or copied to the clipboard (although you are probably better off using the appropriate R functions to do this - see [Chapter 4](#graphics_base_r) for more details).
- The 'Packages' tab lists all of the packages that you have installed on your computer. You can also install new packages and update existing packages by clicking on the 'Install' and 'Update' buttons respectively.
- The 'Help' tab displays the R help documentation for any function. We will go over how to view the help files and how to search for help in [Chapter 2](#basics_r).
- The 'Viewer' tab displays local web content such as web graphics generated by some packages.
## Alternatives to RStudio
Although RStudio is becoming increasingly popular it might not be the best choice for everyone and you certainly don't have to use it to use R effectively. Rather than using an 'all in one' IDE many people choose to use R and a separate script editor to write and execute R code. If you're not familiar with what a script editor is, you can think of it as a bit like a word processor but specifically designed for writing code. Happily, there are many script editors freely available so feel free to download and experiment until you find one you like. Some script editors are only available for certain operating systems and not all are specific to R. Suggestions for script editors are provided below. Which one you choose is up to you: one of the great things about R is that *YOU* get to choose how you want to use R.
### Advanced text editors
A light yet efficient way to write your R scripts using advanced text editors such as:
- [Atom][atom] (all operating systems)
- [BBedit][BBedit] (Mac OS)
- [gedit][gedit] (Linux; comes with most Linux distributions)
- [MacVim][macvim] (Mac OS)
- [Nano][nano] (Linux)
- [Notepad++][notepad] (Windows)
- [Sublime Text][sublime] (all operating systems)
### Integrated development environments
These environments are more powerful than simple text editors, and are similar to RStudio:
- [Emacs][emacs] and its extension [Emacs Speaks Statistics][ess] (all operating systems)
- [RKWard][rkward] (Linux)
- [Tinn-R][tinn-r] (Windows)
- [vim][vim] and its extension [NVim-R][nvim-r] (Linux)
## R packages {#packages}
The base installation of R comes with many useful packages as standard. These packages will contain many of the functions you will use on a daily basis. However, as you start using R for more diverse projects (and as your own use of R evolves) you will find that there comes a time when you will need to extend R's capabilities. Happily, many thousands of R users have developed useful code and shared this code as installable packages. You can think of a package as a collection of functions, data and help files collated into a well defined standard structure which you can download and install in R. These packages can be downloaded from a variety of sources but the most popular are [CRAN][cran-packages], [Bioconductor][bioconductor] and [GitHub][github]. Currently, CRAN hosts over 15000 packages and is the official repository for user contributed R packages. Bioconductor provides open source software oriented towards bioinformatics and hosts over 1800 R packages. GitHub is a website that hosts git repositories for all sorts of software and projects (not just R). Often, cutting edge development versions of R packages are hosted on GitHub so if you need all the new bells and whistles then this may be an option. However, a potential downside of using the development version of an R package is that it might not be as stable as the version hosted on CRAN (it's in development!) and updating packages won't be automatic.
### CRAN packages {#cran_packages}
```{block2, vid-text4, type='rmdvideo'}
See this [video][pack-vid] for step-by-step instruction on how to install, use and update packages from CRAN
```
\
To install a package from CRAN you can use the `install.packages()`\index{install.packages()} function. For example if you want to install the `remotes`\index{remotes package} package enter the following code into the Console window of RStudio (note: you will need a working internet connection to do this)
```{r, echo = TRUE, eval=FALSE}
install.packages('remotes', dependencies = TRUE)
```
You may be asked to select a CRAN mirror, just select '0-cloud' or a mirror near to your location. The `dependencies = TRUE` argument ensures that additional packages that are required will also be installed.
It's good practice to occasionally update your previously installed packages to get access to new functionality and bug fixes. To update CRAN packages you can use the `update.packages()`\index{update.packages()} function (you will need a working internet connection for this).
```{r, echo = TRUE, eval=FALSE}
update.packages(ask = FALSE)
```
The `ask = FALSE` argument avoids having to confirm every package download which can be a pain if you have many packages installed.
### Bioconductor packages
To install packages from Bioconductor the process is a [little different][bioc-install]. You first need to install the `BiocManager` package. You only need to do this once unless you subsequently reinstall or upgrade R.
```{r, echo=TRUE, eval=FALSE}
install.packages('BiocManager', dependencies = TRUE)
```
Once the BiocManager package has been installed you can either install all of the 'core' Bioconductor packages with
```{r, echo=TRUE, eval=FALSE}
BiocManager::install()
```
or install specific packages such as the 'GenomicRanges' and 'edgeR' packages.
```{r, echo=TRUE, eval=FALSE}
BiocManager::install(c("GenomicRanges", "edgeR"))
```
To update Bioconductor packages just use the `BiocManager::install()` function again.
```{r, echo=TRUE, eval=FALSE}
BiocManager::install(ask = FALSE)
```
Again, you can use the `ask = FALSE` argument to avoid having to confirm every package download.
### GitHub packages
There are multiple options for installing packages hosted on GitHub. Perhaps the most efficient method is to use the `install_github()`\index{install\_github()} function from the `remotes` package (you installed this package [previously](#cran_packages)). Before you use the function you will need to know the GitHub username of the repository owner and also the name of the repository. For example, the development version of `dplyr` from Hadley Wickham is hosted on the tidyverse GitHub account and has the repository name 'dplyr' (just Google 'github dplyr'). To install this version from GitHub, use
```{r, echo=TRUE, eval=FALSE}
remotes::install_github('tidyverse/dplyr')
```
The safest way (that we know of) to update a package installed from GitHub is to just reinstall it using the above command.
### Using packages
Once you have installed a package onto your computer it is not immediately available for you to use. To use a package you first need to load the package by using the `library()`\index{library()} function. For example, to load the `remotes` package you previously installed.
```{r, echo=TRUE, eval=FALSE}
library(remotes)
```
The `library()` function will also load any additional packages required and may print out additional package information. It is important to realise that every time you start a new R session (or restore a previously saved session) you need to load the packages you will be using. We tend to put all our `library()` statements required for our analysis near the top of our R scripts to make them easily accessible and easy to add to as our code develops. If you try to use a function without first loading the relevant R package you will receive an error message that R could not find the function. For example, if you try to use the `install_github()` function without loading the `remotes` package first you will receive the following error.
```{r, echo=TRUE, eval=FALSE}
install_github('tidyverse/dplyr')
# Error in install_github("tidyverse/dplyr") :
# could not find function "install_github"
```
Sometimes it can be useful to use a function without first using the `library()` function. If, for example, you will only be using one or two functions in your script and don't want to load all of the other functions in a package then you can access the function directly by specifying the package name followed by two colons and then the function name.
```{r, echo=TRUE, eval=FALSE}
remotes::install_github('tidyverse/dplyr')
```
This is how we were able to use the `install()` and `install_github()` functions [above][Bioconductor packages] without first loading the packages `BiocManager` and `remotes`. Most of the time we recommend using the `library()` function.
## Projects in RStudio {#rsprojs}
As with most things in life, when it comes to dealing with data and data analysis things are so much simpler if you're organised. Clear project organisation makes it easier for both you (especially the future you) and your collaborators to make sense of what you've done. There's nothing more frustrating than coming back to a project months (sometimes years) later and have to spend days (or weeks) figuring out where everything is, what you did and why you did it. A well documented project that has a consistent and logical structure increases the likelihood that you can pick up where you left off with minimal fuss no matter how much time has passed. In addition, it's much easier to write code to automate tasks when files are well organised and are sensibly named. This is even more relevant nowadays as it's never been easier to collect vast amounts of data which can be saved across 1000's or even 100,000's of separate data files. Lastly, having a well organised project reduces the risk of introducing bugs or errors into your workflow and if they do occur (which inevitably they will at some point), it makes it easier to track down these errors and deal with them efficiently.
Thankfully, there are some nice features in R and RStudio that make it quite easy to manage a project. There are also a few simple steps you can take right at the start of any project to help keep things shipshape.
A great way of keeping things organised is to use RStudio Projects. An RStudio Project keeps all of your R scripts, R markdown documents, R functions and data together in one place. The nice thing about RStudio Projects is that each project has its own directory, workspace, history and source documents so different analyses that you are working on are kept completely separate from each other. This means that you can have multiple instances of RStudio open at the same time (if that's your thing) or you can very easily switch between projects without fear of them interfering with each other.
\
```{block2, vid-text2b, type='rmdvideo'}
See this [video][rstudio-prog-vid] for step-by-step instructions on how to create and work with RStudio projects
```
\
To create a project, open RStudio and select `File` -> `New Project...` from the menu. You can create either an entirely new project, a project from an existing directory or a version controlled project (see the [GitHub Chapter](#github_r) for further details about this). In this Chapter we will create a project in a new directory.
\
```{r new_proj, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/new_proj.png")
```
\
You can also create a new project by clicking on the 'Project' button in the top right of RStudio and selecting 'New Project...'
\
```{r new_proj1, echo=FALSE, out.width="30%", fig.align="center"}
knitr::include_graphics(path = "images/new_proj1.png")
```
\
In the next window select 'New Project'.
\
```{r new_proj2, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/new_proj2.png")
```
\
Now enter the name of the directory you want to create in the 'Directory name:' field (we'll call it `first_project` for this Chapter). If you want to change the location of the directory on your computer click the 'Browse...' button and navigate to where you would like to create the directory. We always tick the 'Open in new session' box as well. Finally, hit the 'Create Project' to create the new project.
\
```{r new_proj3, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/new_proj3.png")
```
\
Once your new project has been created you will now have a new folder on your computer that contains an RStudio project file called `first_project.Rproj`. This `.Rproj` file contains various project options (but you shouldn't really interact with it) and can also be used as a shortcut for opening the project directly from the file system (just double click on it). You can check this out in the 'Files' tab in RStudio (or in Finder if you're on a Mac or File Explorer in Windows).
\
```{r new_proj4, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics(path = "images/new_proj4.png")
```
\
The last thing we suggest you do is select `Tools` -> `Project Options...` from the menu. Click on the 'General' tab on the left hand side and then change the values for 'Restore .RData into workspace at startup' and 'Save workspace to .RData on exit' from 'Default' to 'No'. This ensures that every time you open your project you start with a clean R session. You don't have to do this (many people don't) but we prefer to start with a completely clean workspace whenever we open our projects to avoid any potential conflicts with things we have done in previous sessions. The downside to this is that you will need to rerun your R code every time you open your project.
\
```{r new_proj5, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/new_proj5.png")
```
\
Now that you have an RStudio project set up you can start creating R scripts (or [R markdown](#rmarkdown_r) documents) or whatever you need to complete you project. All of the R scripts will now be contained within the RStudio project and saved in the project folder.
## Working directories {#work-d}
The working directory is the default location where R will look for files you want to load and where it will put any files you save. One of the great things about using RStudio Projects is that when you open a project it will automatically set your working directory to the appropriate location. You can check the file path of your working directory by looking at bar at the top of the Console pane. Note: the `~` symbol above is shorthand for `/Users/nhy163/` on a Mac computer (the same on Linux computers).
\
```{r dir_struct, echo=FALSE, out.width="80%", fig.align="center"}
knitr::include_graphics(path = "images/dir_struct.png")
```
\
You can also use the `getwd()`\index{getwd()} function in the Console which returns the file path of the current working directory.
\
```{r dir_struct2, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/dir_struct2.png")
```
\
In the example above, the working directory is a folder called 'first_project' which is a subfolder of "Teaching' in the 'Alex' folder which in turn is in a 'Documents' folder located in the 'nhy163' folder which itself is in the 'Users' folder. On a Windows based computer our working directory would also include a drive letter (i.e. `C:/Users/nhy163/Documents/Alex/Teaching/first_project`).
If you weren't using an RStudio Project then you would have to set your working directory using the `setwd()`\index{setwd()} function at the start of every R script (something we did for many years).
```{r wd, echo=TRUE, eval=FALSE}
setwd('/Users/nhy163/Documents/Alex/Teaching/first_project')
```
However, the problem with `setwd()` is that it uses an *absolute* file path which is specific to the computer you are working on. If you want to send your script to someone else (or if you're working on a different computer) this absolute file path is not going to work on your friend/colleagues computer as their directory configuration will be different (you are unlikely to have a directory structure `/Users/nhy163/Documents/Alex/Teaching/` on your computer). This results in a project that is not self-contained and not easily portable. RStudio solves this problem by allowing you to use *relative* file paths which are relative to the *Root* project directory. The Root project directory is just the directory that contains the `.Rproj` file (`first_project.Rproj` in our case). If you want to share your analysis with someone else, all you need to do is copy the entire project directory and send to your to your collaborator. They would then just need to open the project file and any R scripts that contain references to relative file paths will just work. For example, let's say that you've created a subdirectory called `raw_data` in your Root project directory that contains a tab delimited datafile called `mydata.txt` (we will cover directory structures [below](#dir_struct)). To import this datafile in an RStudio project using the `read.table()` function (don't worry about this now, we will cover this in much more detail in [Chapter 3](#data_r)) all you need to include in your R script is
```{r rel-fp1, echo=TRUE, eval=FALSE}
dataf <- read.table('raw_data/mydata.txt', header = TRUE,
sep = '\t')
```
Because the file path `raw_data/mydata.txt` is relative to the project directory it doesn't matter where you collaborator saves the project directory on their computer it will still work.
If you weren't using an RStudio project then you would have to use either of the options below neither of which would work on a different computer.
```{r rel-fp2, echo=TRUE, eval=FALSE, tidy=TRUE}
setwd('/Users/nhy163/Documents/Alex/Teaching/first_project/')
dataf <- read.table('raw_data/mydata.txt', header = TRUE, sep = '\t')
# or
dataf <- read.table('/Users/nhy163/Documents/Alex/Teaching/first_project/raw_data/mydata.txt', header = TRUE, sep = '\t')
```
For those of you who want to take the notion of relative file paths a step further, take a look at the `here()` function in the `here` [package][here]. The `here()`\index{here()} function allows you to automagically build file paths for any file relative to the project root directory that are also operating system agnostic (works on a Mac, Windows or Linux machine). For example, to import our `mydata.txt` file from the `raw_data` directory just use
```{r rel-fp23, echo=TRUE, eval=FALSE}
library(here) # you may need to install the here package first
dataf <- read.table(here("raw_data", "mydata.txt"),
header = TRUE, sep = '\t',
stringsAsFactors = TRUE)
# or without loading the here package
dataf <- read.table(here::here("raw_data", "mydata.txt"),
header = TRUE, sep = '\t',
stringsAsFactors = TRUE)
```
## Directory structure {#dir_struct}
In addition to using RStudio Projects, it's also really good practice to structure your working directory in a consistent and logical way to help both you and your collaborators. We frequently use the following directory structure in our R based projects
\
```{r dir_struct2.1, echo=FALSE, out.width="25%", fig.align="center"}
knitr::include_graphics(path = "images/directory_structure.png")
```
\
In our working directory we have the following directories:
- **Root** - This is your project directory containing your `.Rproj` file.
- **data** - We store all our data in this directory. The subdirectory called `raw_data` contains raw data files and only raw data files. These files should be treated as **read only** and should not be changed in any way. If you need to process/clean/modify your data do this in R (not MS Excel) as you can document (and justify) any changes made. Any processed data should be saved to a separate file and stored in the `processed_data` subdirectory. Information about data collection methods, details of data download and any other useful metadata should be saved in a text document (see README text files below) in the `metadata` subdirectory.
- **R** - This is an optional directory where we save all of the custom R functions we've written for the current analysis. These can then be sourced into R using the `source()` function.
- **Rmd** - An optional directory where we save our R markdown documents.
- **scripts** - All of the main R scripts we have written for the current project are saved here.
- **output** - Outputs from our R scripts such as plots, HTML files and data summaries are saved in this directory. This helps us and our collaborators distinguish what files are outputs and which are source files.
Of course, the structure described above is just what works for us most of the time and should be viewed as a starting point for your own needs. We tend to have a fairly consistent directory structure across our projects as this allows us to quickly orientate ourselves when we return to a project after a while. Having said that, different projects will have different requirements so we happily add and remove directories as required.
You can create your directory structure using Windows Explorer (or Finder on a Mac) or within RStudio by clicking on the 'New folder' button in the 'Files' pane.
\
```{r dir_struct3, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/dir_struct3.png")
```
\
An alternative approach is to use the `dir.create()`\index{dir.create()} and `list.files()`\index{list.files()} functions in the R Console.
```{r dir, echo=TRUE, eval=FALSE}
# create directory called 'data'
dir.create('data')
# create subdirectory raw_data in the data directory
dir.create('data/raw_data')
# list the files and directories
list.files(recursive = TRUE, include.dirs = TRUE)
# [1] "data" "data/raw_data" "first_project.Rproj"
```
## File names {#file_names}
What you call your files matters more than you might think. Naming files is also more difficult than you think. The key requirement for a 'good' file name is that it's informative whilst also being relatively short. This is not always an easy compromise and often requires some thought. Ideally you should try to avoid the following!
\
```{r fn, echo=FALSE, out.width="30%", fig.align="center", fig.cap="source:https://xkcd.com/1459/"}
knitr::include_graphics(path = "images/xkcd_files.png")
```
\
Although there's not really a recognised standard approach to naming files (actually [there is][file_wiki], just not everyone uses it), there are a couple of things to bear in mind.
- First, avoid using spaces in file names by replacing them with underscores or even hyphens. Why does this matter? One reason is that some command line software (especially many bioinformatic tools) won't recognise a file name with a space and you'll have to go through all sorts of shenanigans using escape characters to make sure spaces are handled correctly. Even if you don't think you will ever use command line software you may be doing so indirectly. Take R markdown for example, if you want to render an R markdown document to pdf using the `rmarkdown` package you will actually be using a command line LaTeX engine under the hood (called [Pandoc][pandoc]). Another good reason not to use spaces in file names is that it makes searching for file names (or parts of file names) using [regular expressions][regex] in R (or any other language) much more difficult.
- For the reasons given above, also avoid using special characters (i.e. @£$%^&*(:/) in your file names.
- If you are versioning your files with sequential numbers (i.e. file1, file2, file3 ...) and you have more than 9 files you should use 01, 02, 03 .. 10 as this will ensure the files are printed in the correct order (see what happens if you don't). If you have more than 99 files then use 001, 002, 003... etc.
- If your file names include dates, use the ISO 8601 format YYYY-MM-DD (or YYYYMMDD) to ensure your files are listed in proper chronological order.
- Never use the word *final* in any file name - it never is!
Whatever file naming convention you decide to use, try to adopt early, stick with it and be consistent. You'll thank us!
## Project documentation {#proj_doc}
A quick note or two about writing R code and creating R scripts. Unless you're doing something really quick and dirty we suggest that you always write your R code as an R script. R scripts are what make R so useful. Not only do you have a complete record of your analysis, from data manipulation, visualisation and statistical analysis, you can also share this code (and data) with friends, colleagues and importantly when you submit and publish your research to a journal. With this in mind, make sure you include in your R script all the information required to make your work reproducible (author names, dates, sampling design etc). This information could be included as a series of comments `#` or, even better, by mixing executable code with narrative into an [R markdown](#rmarkdown_r) document. It's also good practice to include the output of the `sessionInfo()`\index{sessionInfo()} function at the end of any script which prints the R version, details of the operating system and also loaded packages. A really good alternative is to use the `session_info()`\index{session\_info()} function from the `xfun` package for a more concise summary of our session environment.
Here's an example of including meta-information at the start of an R script.
```{r, echo=TRUE, eval=FALSE}
# Title: Time series analysis of snouters
# Purpose : This script performs a time series analyses on
# snouter count data.
# Data consists of counts of snouter species
# collected from 18 islands in the Hy-yi-yi
# archipelago between 1950 and 1957.
# For details of snouter biology see:
# https://en.wikipedia.org/wiki/Rhinogradentia
# Project number: #007
# DataFile:'data/snouter_pop.txt'
# Author: A. Nother
# Contact details: a.nother@uir.ac.uk
# Date script created: Mon Dec 2 16:06:44 2019 -----------
# Date script last modified: Thu Dec 12 16:07:12 2019 ----
# package dependencies
library(PopSnouter)
library(ggplot2)
print('put your lovely R code here')
# good practice to include session information
xfun::session_info()
```
This is just one example and there are no hard and fast rules so feel free to develop a system that works for you. A really useful shortcut in RStudio is to automatically include a time and date stamp in your R script. To do this, write `ts` where you want to insert your time stamp in your R script and then press the 'shift + tab' keys. RStudio will magically convert `ts` into the current date and time and also automatically comment out this line with a `#`. Another really useful RStudio shortcut is to comment out multiple lines in your script with a `#` symbol. To do this, highlight the lines of text you want to comment and then press 'ctrl + shift + c' (or 'cmd + shift + c' on a mac). To uncomment the lines just use 'ctrl + shift + c' again.
In addition to including metadata in your R scripts it's also common practice to create a separate text file to record important information. By convention these text files are named `README`. We often include a `README` file in the directory where we keep our raw data. In this file we include details about when data were collected (or downloaded), how data were collected, information about specialised equipment, preservation methods, type and version of any machines used (i.e. sequencing equipment) etc. You can create a README file for your project in RStudio by clicking on the `File` -> `New File` -> `Text File` menu.
## R style guide
How you write your code is more or less up to you although your goal should be to make it as easy to read as possible (for you and others). Whilst there are no rules (and no code police), we encourage you to get into the habit of writing readable R code by adopting a particular style. We suggest that you follow Google's [R style guide][style-google] whenever possible. This style guide will help you decide where to use spaces, how to indent code and how to use square `[ ]` and curly `{ }` brackets amongst other things. If all that sounds like too much hard work you can install the `styler` package which includes an RStudio add-in to allow you to automatically restyle selected code (or entire files and projects) with the click of your mouse. You can find more information about the `styler` package including how to install [here][styler]. Once installed, you can highlight the code you want to restyle, click on the 'Addins' button at the top of RStudio and select the 'Style Selection' option. Here is an example of poorly formatted R code.
\
```{r poor_code, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/before_rcode.png")
```
\
Now highlight the code and use the `styler` package to reformat
\
```{r styler, echo=FALSE, out.width="60%", fig.align="center"}
knitr::include_graphics(path = "images/styler.png")
```
\
To produce some nicely formatted code
\
```{r better_code, echo=FALSE, out.width="75%", fig.align="center"}
knitr::include_graphics(path = "images/after_rcode.png")
```
## Backing up projects
Don't be that person who loses hard won (and often expensive) data and analyses. Don't be that person who thinks it'll never happen to me - it will! Always think of the absolute worst case scenario, something that makes you wake up in a cold sweat at night, and do all you can to make sure this never happens. Just to be clear, if you're relying on copying your precious files to an external hard disk or USB stick this is **NOT** an effective backup strategy. These things go wrong all the time as you lob them into your rucksack or 'bag for life' and then lug them between your office and home. Even if you do leave them plugged into your computer what happens when the building burns down (we did say worst case!)?
Ideally, your backups should be offsite and incremental. Happily there are numerous options for backing up your files. The first place to look is in your own institute. Most (all?) Universities have some form of network based storage that should be easily accessible and is also underpinned by a comprehensive disaster recovery plan. Other options include cloud based services such as Google Drive and Dropbox (to name but a few), but make sure you're not storing sensitive data on these services and are comfortable with the often eye watering privacy policies.
Whilst these services are pretty good at storing files, they don't really help with incremental backups. Finding previous versions of files often involves spending inordinate amounts of time trawling through multiple files named *'final.doc'*, *'final_v2.doc'* and *'final_usethisone.doc'* etc until you find the one you were looking for. The best way we know for both backing up files and managing different versions of files is to use Git and GitHub. To find out more about how you can use RStudio, Git and GitHub together see the Git and GitHub [Chapter](#github_r).
## Citing R
Many people have invested huge amounts of time and energy making R the great piece of software you're now using. If you use R in your work (and we hope you do) please remember to give appropriate credit by citing R. To get the most up to date citation for R you can use the `citation()`\index{citation()} function.
```{r citation, echo=TRUE, collapse=TRUE}
citation()
```
If you want to cite a particular package you've used for your data analysis.
```{r pack-citation, echo=TRUE, collapse=TRUE, warning=FALSE}
citation(package = "here")
```
## Exercise 1
```{block2, note-text, type='rmdtip'}
Congratulations, you've reached the end of Chapter 1! Perhaps now's a good time to practice some of what you've learned. You can find an exercise we've prepared for you (and our solutions) on the course website.
```
```{r links, child="links.md"}
```