-
Notifications
You must be signed in to change notification settings - Fork 295
/
Copy pathPrediction.Rmd
286 lines (203 loc) · 10.7 KB
/
Prediction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
title: 'Bioinformatics for Big Omics Data: Statistical prediction'
author: "Raphael Gottardo"
date: "February 23, 2015"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
keep_md: yes
smaller: yes
---
## Setting up some options
Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```
and we need the following packages
```{r, cache=FALSE}
library(GEOquery)
library(limma)
library(data.table)
library(glmnet)
library(ggplot2)
```
## Prediction
In statistics we are often interested in predicting an outcome variable based on a set of dependent variables.
In our case, we might be interested in predicting antibody response at day 28 based on gene expression data at day 3 or day 7.
As another example, Netflix has been very interested in predicting movie rankings for it's users: http://www2.research.att.com/~volinsky/netflix/bpc.html
## Statistical issues involved in prediction
In our case, we are often dealing with the traditional large $p$, small $n$ problem. We have few observations but many potential predictors.
**What should we be worried about?**
## Statistical issues involved in prediction
In our case, we are often dealing with the traditional large $p$, small $n$ problem. We have few observations but many potential predictors.
**What should we be worried about?**
- If we use a traditional (generalized) linear model, we won't have enough degrees of freedom to estimate all regression coefficients
- Using too many coefficients, we are likely to run into an overfitting problem
An approach that has been proposed to solve this problem is regularized or penalized regression.
## Penalized regression
In a traditional `glm` framework, we assume the following model:
$$g(\mathbb{E}(\mathbf{Y}))=\mathbf{X}\boldsymbol{\beta}$$
where $g$ is the link function. Our goal is to estimate the vector $\boldsymbol{\beta}$ that leads to the best prediction of $\mathbf Y$. In traditional linear regression, this can be done via least-squares.
In regularized regression, we add a penalty term to force the coefficient to be well behaved even when the number of predictors is large. For example, in *ridge regression*, this is done by minimizing the following quantity:
$$ \|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2+ \lambda \|\boldsymbol{\beta}\|_2 $$
In L1-regression (aka Lasso), we minimize the following function:
$$ \|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_2+ \lambda \|\boldsymbol{\beta}\|_1 $$
The Elastic net uses a convex combination of L1 and L2 norms.
## L1 versus L2 regularization
Altough these two regularization strategies share a common goal, the L1 offers a significant advantage, it leads to a sparse solution where many of the estimated coefficients are exactly zero.
In practice, this is very convenient as we are usually interested in selecting the most informative variables. In fact, we are often more interested in variable selection than actual prediction.
## Estimation in the L1
For a fixed value of $\lambda$, estimation can be made very efficient using convex optimization. The problem is that we still need to estimate $\lambda$, as the value of $\lambda$ influence the final inference (i.e. the variables selected). A popular approach for selecting $\lambda$ is cross-validation where a small portion of the data is left out at the estimation stage and used for evaluating the selected model (i.e. error rate, deviance, etc). Then the $\lambda$ value that leads to the best performance is selected.
However, this requires estimating our model for many different values of $\lambda$!
Fortunately, the folks at Stanford (Friedman, Hastie, Tibshirani, Efron, etc) have come up with cleaver algorithms to provide the full regularization path as the solution of the Lasso, ridge, and elastic net problems.
## Penalized regression in R
Fortunately for us, the `glmnet` package provides everything that we need to perform statistical inference/prediction using penalized regression.
We first need to install the package:
```{r, eval=FALSE}
install.packages("glmnet")
```
and load it
```{r, cache=FALSE}
library(glmnet)
```
## A quick example
Here we will use the `mtcars` data that is part of `R`. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
```{r}
data(mtcars)
fit_cars <- glmnet(as.matrix(mtcars[,-1]), mtcars[,1])
plot(fit_cars)
legend(0, 2.8, lty=1, legend=colnames(mtcars[,-1]), col=1:10, cex=1, bty="n")
```
## Finding the optimal lambda by cross validation
```{r}
cv_cars <- cv.glmnet(as.matrix(mtcars[,-1]), mtcars[,1])
cv_cars$lambda.min
plot(cv_cars)
```
## Finding the optimal lambda by cross validation
Now we can perform cross validation to find the optimal $\lambda$ value, as follows,
```{r}
fit_cars_min <- glmnet(as.matrix(mtcars[,-1]), mtcars[,1], lambda=cv_cars$lambda.min)
fit_cars_min$beta
fit_cars_1se <- glmnet(as.matrix(mtcars[,-1]), mtcars[,1], lambda=cv_cars$lambda.1se)
fit_cars_1se$beta
```
## A systems biology example
Let's go back to our hai/gene expression dataset
```{r, cache=TRUE}
# Download the mapping information and processed data
gds <- getGEO("GSE29619", destdir="Data/GEO/")
```
Our goal here will be to find genes at day 3 or 7 that predict antibody levels (as measured by hai) at day 28.
```{r sanitize-pdata, cache=TRUE, echo=TRUE}
### Sanitize data and metadata
gds_new <- gds
sanitize_pdata <- function(pd){
keepCols <- c(
"characteristics_ch1.1", "characteristics_ch1.2",
"description",
"supplementary_file")
pd <- pd[, keepCols]
colnames(pd) <- c("ptid", "time", "description", "filename")
pd$ptid <- gsub(".*: ", "", pd$ptid)
pd$time <- gsub(".*: ", "", pd$time)
pd$time<-gsub("Day", "D", pd$time)
pd$description<-gsub("(-\\w*){2}$", "", pd$description)
pd$filename<-basename(as.character(pd$filename))
pd$filename<-gsub(".CEL.gz", "", pd$filename)
pd
}
pData(gds_new[[1]]) <- sanitize_pdata(pData(gds_new[[1]]))
pData(gds_new[[2]]) <- sanitize_pdata(pData(gds_new[[2]]))
pData(gds_new[[3]]) <- sanitize_pdata(pData(gds_new[[3]]))
```
## A systems biology example (suite)
Now we set up our data for the three cohorts. We will use the TIV08 cohort, for our exercise.
```{r data-setup, cache=TRUE}
TIV_08 <- gds_new[[1]][ , grepl("2008-TIV", pData(gds_new[[1]])$description)]
LAIV_08 <- gds_new[[1]][ , grepl("2008-LAIV", pData(gds_new[[1]])$description)]
TIV_07 <- gds_new[[3]][ , grepl("2007-TIV", pData(gds_new[[3]])$description)]
```
TIV_08, LAIV_08 and TIV_07 are expression sets containing data from three time points (variable name is "time", with values D0, D3 and D7), for several probes (i.e., of form GSMXXXX) and patients (variable name "ptid").
## A systems biology example (suite)
Now we work on the pData to extract the hai results.
```{r, eval=TRUE, echo=TRUE}
pd <- pData(gds[[1]][ , grepl("2008-TIV", pData(gds[[1]])$description)])
hai_0 <- sapply(strsplit(as.character(pd$characteristics_ch1.6), ": "), "[", 2)
hai_28 <- sapply(strsplit(as.character(pd$characteristics_ch1.7), ":"), "[", 2)
hai_fold <- log2(as.double(hai_28))-log2(as.double(hai_0))
pData(TIV_08)$hai_fold <- hai_fold
```
## A systems biology example (suite)
We are now ready to perform our prediction. First, for simplicity, I select the subjects that have data at all time points, as follows,
```{r, cache=FALSE}
# We first need to only keep subjets with complete data
dt_pd <- data.table(pData(TIV_08))
# Create a new variable to see if a subject has data at all three timepoints
dt_pd <- dt_pd[,is_complete:=(nrow(.SD)==3),by="ptid"]
# Subset the ExpressionSet to remove subjects with missing data
TIV_08_small <- TIV_08[, pData(TIV_08)$ptid%in%dt_pd[is_complete==TRUE, ptid]]
```
and then compute the log-fold changes values at day 7 to be used as the dependend variables,
```{r, cache=FALSE}
# Order by subjects
TIV_08_small <- TIV_08_small[,order(pData(TIV_08_small)$ptid)]
X <- exprs(TIV_08_small)
X_7_fold <- X[, pData(TIV_08_small)$time=="D7"]-X[, pData(TIV_08_small)$time=="D0"]
hai_fold <- pData(TIV_08_small)$hai_fold[pData(TIV_08_small)$time=="D0"]
```
## A systems biology example (suite)
We now compute the regularization path using `glmnet`,
```{r}
fit_hai <- glmnet(t(X_7_fold), hai_fold)
plot(fit_hai)
```
and perform cross-validation to select the best lambda value
```{r}
cv_hai <- cv.glmnet(t(X_7_fold), hai_fold)
cv_hai$lambda.min
plot(cv_hai)
```
## A systems biology example (suite)
We now look at the number of selected variables for $\lambda_{min}$ and a fixed $\lambda$ value:
```{r}
fit_hai_min <- glmnet(t(X_7_fold), hai_fold, lambda=cv_hai$lambda.min)
sum(fit_hai_min$beta>0)
fit_hai_min <- glmnet(t(X_7_fold), hai_fold, lambda=.1)
sum(fit_hai_min$beta!=0)
```
and extract the corresponding gene symbols:
```{r}
fData(TIV_08_small)[fit_hai_min$beta@i,c("ID","Gene Symbol")]
```
## A systems biology example (suite)
Sometimes it can be desirable to reduce the gene space by filtering the set of genes/probes to be considered. Here we do this by imposing a minimum fold change.
```{r}
# Filter based on log-fold change
ind <- abs(rowMeans(X_7_fold))>log2(1.5)
X_7_fold_small <- X_7_fold[ind,]
cv_hai <- cv.glmnet(t(X_7_fold_small), hai_fold)
cv_hai$lambda.min
plot(cv_hai)
```
and let's look at the selected genes/probes:
```{r}
fit_hai_min <- glmnet(t(X_7_fold_small), hai_fold, lambda=cv_hai$lambda.min)
fData(TIV_08_small)[rownames(X_7_fold_small)[fit_hai_min$beta@i], c("ID", "Gene Symbol")]
```
## A systems biology example (end)
We can now look at the correlation between two of our probes and our hai variable, as follows
```{r}
qplot(hai_fold, X_7_fold["211868_PM_x_at",])+geom_smooth(method="lm")
```
## A systems biology example (end)
```{r}
qplot(hai_fold, X_7_fold["230170_PM_at",])+geom_smooth(method="lm")
```
## Summary
Here we have looked at the `glmnet` package for predicting an outcome variable based on a set of dependent variables. Although there exist many other methods/packages for prediction in R, `glmnet` is a powerful framework that can perform variable section when building a prediction model. Note that there exist many other distributional options in the package that can be used to model both discrete/continuous data.
**Exercise:** Now it's your turn to try using another cohort (e.g. LAIV), and perhaps using the binomial family after thresholding the `hai` variable.
Finally, here we've used gene expression data to predict hai, but many other variables could also be included to predict antibody response, or more generally speaking, to predict an immune response to a stimuli (e.g. vaccination). Such integrated analysis of different data types are refered to as systems biology/immunology.