-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02_introduction_r_rstudio.Rmd
396 lines (285 loc) · 15.5 KB
/
02_introduction_r_rstudio.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
---
title: "Introduction to R and RStudio"
author: "Mireia Ramos-Rodriguez"
date: "March 14, 2019"
output:
unilur::tutorial_html_solution:
theme: cerulean
toc: TRUE
toc_float: TRUE
number_sections: TRUE
unilur::tutorial_pdf: default
unilur::tutorial_pdf_solution:
latex_engine: xelatex
unilur::tutorial_html: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE)
knitr::opts_template$set(solution = list(box.title = "Look at the result",
box.body = list(fill = "#ACFFAF4C"),
box.collapse = TRUE,
box.icon = "fa-lightbulb"),
alert = list(box.title = "Watch out!",
box.body = list(fill = "#fa5b42", colour = "#fdf6d4"),
box.collapse = TRUE,
box.icon = "fa-exclamation-triangle"),
tip = list(box.title = "Pro-tip",
box.body = list(fill = "lightblue"),
box.collapse = TRUE,
box.icon = "fa-star"),
exercise = list(box.title = "Exercise",
box.body = list(fill = "plum"),
box.collapse = TRUE,
box.icon = "fa-pencil"))
```
# Introduction
__R__ is a programming language designed originally for performing statistical analyses. However, it's potential has been increased and nowadays it can be used for general data analysis, creating atractive plots or even for creating websites! You can download R from the official repository: [CRAN](https://www.r-project.org/).
> If you have used spreadsheets to calculate stuff, you have already programmed.
In R, you can assign values to variables using `<-`. You don't have to type it everytime, just `Alt`+`-`!
```{r}
variable <- 2342
variable
```
You can use R from the unix command line or download an Integrated Development Environment (IDE) such as __RStudio__.
# Working with R in RStudio
```{r, echo=FALSE, fig.cap="RStudio overview. Source: http://www.sthda.com/english/wiki/r-basics-quick-and-easy", fig.align="center"}
knitr::include_graphics("http://www.sthda.com/sthda/RDoc/images/rstudio.png")
```
1. __Code editor__. In your code editor you can create, open and save your `Rscripts`. Write your code and run it whenever you want using `Ctrl + Enter`.
2. __R Console__. This is where your R code is ran. You can run it from the code editor or directly typing things in the console.
3. __Environment and History__. Here you can find the objects and variables that are loaded in R and the history of commands you have ran.
4. __Plots, help, files__. All plots you generate and R documentation will appear in this pane.
# R basics
## Data types
The basic types of variables that can be represented in R are the following:
- _Numeric_. `332`, `0.234`.
- _Character_. `"Barcelona"`, `"woman"`. A special type of character is the class _factor_.
- _Logical_. `TRUE` or `FALSE`.
You can check the type of variable using `class()`.
```{block, opts.label="exercise"}
Create one variable of each of the above-mentioned basic types and check that the type is correct using the `class()` function.
```
```{r, solution=TRUE}
var <- "text" # character
class(var)
var <- 3242 # numeric
class(var)
var <- TRUE # logical
class(var)
```
## Arithmetic operators
Operator | Description
------------|---------------
`+` | addition
`-` | subtraction
`*` | multiplication
`/` | division
`^` or `**` | exponentiation
```{r}
3+4*5.2/2^2
```
## Logical operators
Operator | Description
---------|-------------------
`>` | greater than
`>=` | greater than or equal to
`==` | exactly equal to
`!=` | not equal to
```{r}
# Works with numerics
3 > 2
# And also with characters
"class"=="class"
```
## Data structures
We can combine several data types into more complex structures. The different structures one can create in R are the following:
- __Vectors__. Consists on a concatenation `c()` of values. The type of all values should be the same.
```{r}
vector <- c(0,1,2,3,4,6)
vector
```
- __Matrices__. Are basically vectors, but with 2 dimensions (rows and columns). You can create a matrix using `matrix()`. You can check which arguments you can pass to the matrix function typing `?matrix` or `help(matrix)`
```{r}
mat <- matrix(vector, nrow=2)
mat
```
```{block, opts.label="exercise"}
As an example we just generated a vector of numbers. Try to generate a __vector with letters (characters)__ from 'a' to 'f' and then convert it to a matrix with 2 columns.
```
```{r, solution=TRUE}
vector <- c('a', 'b', 'c', 'd', 'e', 'f')
mat <- matrix(vector, ncol=2)
mat
```
- __Dataframes__. Dataframes are very much like spreadsheets: each column is information on one variable and each row is an instance (for example, a patient). When constructing a dataframe, you can use different vectors to represent different information. The only requierement is that all vectors have to be the same length.
```{r}
patients <- data.frame(patientID=1:4,
gender=c("male", "female", "male", "female"),
age=c(23, 45, 55, 22),
dead=c(F, T, T, F))
patients
patients$age # to select one column
```
- __Lists__. Lists are concatenations of any data type or data structure. The first element of your list can be a vector, the second one a matrix and the third one a data frame. You can number elements in your list and access each data structure using `listName[[1]]`.
```{r}
list <- list(name="donors",
patientList=patients)
list
list[[1]]
list[["name"]]
```
## Functions
There are many basic functions that are already loaded into R. For example, `data.frame()` is a function that generates data.frames and `help()` is a function that loads the manual for a specific R function.
You can create you own functions in R as follows:
```{r}
sq <- function(x) {
square <- x*x
return(square)
}
sq(2)
```
Functions are approppriate when you are copy&pasting operations many times, but changing tiny details. It might be more time consuming to create a function, but then you an run it with the parameters you want without any additional effort.
Many times, there are functions out there for the things you want to do. This functions are wrapped in __packages__ and you can download them and use them freely!
# Functions and packages
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. You can find R packages at official repositories like CRAN or in personal repositories like github. Packages found at official repositories are more consistent than those found at personal repositories, which could be still in development.
There are many packages that are already preinstalled in R. You can load those packages which will allow you to access their functions using `library(NameOfPackage)`. Once a package is loaded with `library`, all its function will be available to you untill you close your R session (i.e. close RStudio).
```{block, opts.label="tip"}
If you just want to use just one function from a specific package, you can use `NameOfPackage::name_of_function(arguments)` so you don't have to load all the package!
```
The most important R _repositories_ from which you can donwload packages are:
- __[CRAN](https://cran.r-project.org/)__. Is the official R repository and contains packages for many different purposes. To install packages from CRAN, you just need to type in your _R console_ `installPackages("NameOfPackage")`.
- __[Bioconductor](https://www.bioconductor.org/)__. This repository contains packages dedicated to the analysis and comprehension of high-throughput genomic data. To install packages from Bioconductor, you will first need to install the CRAN package _BiocManager_, if it is not already installed (`install.packages("BiocManager")`). If it is installed, you can use the function `install` from that package with the name of the Bioconductor package you want to install: `BioManager::install("NameOfPackage")`.
# Analyzing ChIP-seq data with R
## GenomicRanges
The ability to efficiently represent and manipulate genomic annotations and alignments is playing a central role when it comes to analyzing high-throughput sequencing data (a.k.a. NGS data). The [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) package defines general purpose containers for storing and manipulating genomic intervals and variables defined along a genome.
To generate a GenomicRanges object, we just need to provide the following parameters:
- `seqnames`. Name of the chromosome (for example `chr1`).
- `ranges`. Range of the region you are trying to define (`start` and `end`). If your range only has one bp (a SNP, for example) you can include only the `start` position.
- `strand`. If you do not specify any strand (`+` or `-`), it will default to `*`.
- `mcols`. You can include a `data.frame` with additional information that will be appended to your ranges.
Below you can see how to create a `GRanges` object with only one range:
```{r}
library(GenomicRanges)
gr <- GRanges(seqnames="chr1",
ranges=IRanges(start=3, end=60))
gr
```
Or you can create and object with multiple ranges at once:
```{r}
gr <- GRanges(seqnames=c("chr1", "chr1", "chr3"),
ranges=IRanges(start=c(2,65,23),
end=c(43, 192, 60)),
mcols=data.frame(id=c("region1", "region2", "region3")))
gr
```
You can access the different elements in your `GRanges` object with different functions:
```{r}
seqnames(gr) # Returns all chr names
start(gr) # Returns start position, same with end(gr)
width(gr)
mcols(gr) # Returns mcols data.frame
gr$mcols.id # Returns a specific mcols data.frame
length(granges) # Number of regions in your GRanges object
```
You can load and convert to `GRanges` the peak files generated from MACS2. The easiest way is using the function `toGRanges` from the package `regioneR`. You just need to provide the path to your peak file and it will load it and convert it to a `GRanges` object. Since we only need one specific function from the GRanges package, we are going to use `regioneR::toGRanges`.
```{r}
peaks <- regioneR::toGRanges("data/ChIP-seq/peaks/NL1_h3k27ac_peaks.broadPeak")
peaks
```
## Differential ChIP-seq sites
There are many packages for detecting differential sites for ChIP-seq data. One of the most popular is __[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)__.
For detecting differential sites with DESeq2 you need the following:
1. A __count matrix__. A `matrix` in which each row will be a genomic region and each column a sample. The values will be the number of reads found in each region, for each sample.
2. __Column data__ or sample data. A `data.frame` including phenotyping or grouping information for each of the samples found in the columns of the count matrix.
3. A __design__. This design will be used by DESeq2 to compute the comparisons and detect differential sites.
### Step 0: Install DESeq2 from Bioconductor {.unnumbered}
DESeq2 is a Bioconductor package, so you can install it using:
```{r, eval=FALSE}
BiocManager::install("DESeq2")
```
### Step 1: Generate your count matrix {.unnumbered}
The comparison we are going to do is normal tissue (__NL__) vs. hepatocellular carcinoma (__HCC__). Therefore, you will need to download the following files from `~/workshop_ChIPseq/ChIPseq/peaks/`:
- `NL1_h3k27ac_peaks.broadPeak`
- `HCC1_h3k27ac_peaks.broadPeak`
- `HCC3_h3k27ac_peaks.broadPeak`
Next, we will load this peak files into R and then merge the overlapping genomic regions to have a single peak dataset with all the H3K27ac enriched regions.
```{r, warning=FALSE, message=FALSE}
# First, we load the peaks into R
files <- paste0("data/ChIP-seq/peaks/",
c("NL1_h3k27ac_peaks.broadPeak", "HCC1_h3k27ac_peaks.broadPeak", "HCC3_h3k27ac_peaks.broadPeak"))
peak.list <- lapply(files, read.delim, header=F) # read peaks in the 3 input files
peaks <- do.call(rbind, peak.list) # convert 3 data.frames in 1 data.frame (row-bind)
head(peaks)
# Convert data.frame to GenomicRanges
library(GenomicRanges)
peaks.gr <- GRanges(seqnames=peaks$V1,
ranges=IRanges(start=peaks$V2,
end=peaks$V3))
peaks.gr
# Keep only autosomal chr
peaks.gr <- peaks.gr[seqnames(peaks.gr) %in% paste0("chr", 1:22),]
# Merge overlapping ranges
peaks.m <- reduce(peaks.gr)
peaks.m
# Convert GRanges to SAF (needed for Rsubread::featureCounts)
saf <- data.frame(peaks.m)
head(saf)
saf <- saf[,-4]
colnames(saf) <- c("Chr", "Start", "End", "Strand")
saf$Strand <- "+"
saf$GeneID <- paste0("region_", 1:nrow(saf))
# Get read counts in each region from BAM files
bam.files <- c("data/ChIP-seq/BAM/NL1_h3k27ac.rmdup.bam",
"data/ChIP-seq/BAM/HCC1_h3k27ac.rmdup.bam",
"data/ChIP-seq/BAM/HCC3_h3k27ac.rmdup.bam")
reads <- Rsubread::featureCounts(bam.files,
annot.ext=saf,
nthreads=10)
# the output of Rsubread::featureCounts() function is a list
names(reads)
counts <- reads$counts
colnames(counts) <- c("NL1", "HCC1", "HCC3") # Set colnames for samples
head(counts) # This is the matrix we need for DESeq2
# Save file for using afterwards
dir.create("data/differentialAnalysis/", F)
save(counts, file="data/differentialAnalysis/countMatrix.rda")
```
### Step 2: Generate your column data
In this step we will need to add information that we feel is important for the differential analysis. The important part is that the rownames of the dataframe generated in this step, need to coincide with the colnames of the count matrix generated in the step above.
```{r}
coldata <- data.frame(id=c("NL1", "HCC1", "HCC3"),
tissue=c("NormalLiver", "HCCarcinoma", "HCCarcinoma"),
patient=c(0, 1, 3))
rownames(coldata) <- coldata$id
# To use normal as reference we have to convert to factor and establish levels
coldata$tissue <- factor(coldata$tissue,
levels=c("NormalLiver", "HCCarcinoma"))
coldata
```
### Step 3: Run DESeq2 with the dataset
```{r, warning=F, message=F}
library(DESeq2)
# Create DESeq dataset
load("data/differentialAnalysis/countMatrix.rda") # Load count matrix
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ tissue)
# Now run DESeq2 differential analysis
dds <- DESeq(dds)
res <- results(dds)
res
# Filter significant regions: padj =< 0.05
res <- as.data.frame(res)
res.sign <- res[!is.na(res$padj) & res$padj <= 0.1,]
knitr::kable(res.sign)
```
DESeq2 documentation is pretty extensive and with lots of examples, you can find it [here](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
## Annotating peaks to the nearest gene
# Additional resources
- [Datacamp: Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r)
- [Beginner's guide to R: Introduction](https://www.computerworld.com/article/2497143/business-intelligence/business-intelligence-beginner-s-guide-to-r-introduction.html)
- [`swirl`, Learn R, in R](https://swirlstats.com/)
- [R for Data Science](https://r4ds.had.co.nz/)
- [R style guide](http://adv-r.had.co.nz/Style.html)