-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtrain-test-zinser.Rmd
237 lines (189 loc) · 7.28 KB
/
train-test-zinser.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
title: "Training and test indices for Zinser data"
author: "Gregory Britten and Kristof Glauninger and Sangwon (Justin) Hyun"
output:
html_document:
toc: true
toc_float: true
code_folding: hide
---
```{r setup}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE,
message=FALSE, eval=TRUE, cache=TRUE,
fig.width=20, fig.height=20)
```
Load packages and set the working directory.
```{r more-setup}
library(ncdf4)
library(viridis)
library(fields)
library(rstan)
options(mc.cores = parallel::detectCores())
## setwd('d:/dropbox/working/bayesian-matrixmodel/')
repodir = '/home/shyun/repos/matrix-model/' ## The directory of your repository.
setwd(repodir)
```
# Visualize Zinser data
Take the latest calibrated Zinser data. In the top plot, there are $m=52$ size
categories, and $t=0, \cdots, 24$ time points -- 24 time points at 2-hour
increments over two days, plus one time point for t=0. The colors and thickness
of each line shows, from blue+thin to red+thick, the small to large size
classes. The sunlight `par` variable is overlaid (out of scale) as well.
```{r retrieve-data-helper}
##' Helper to open Zinser data.
##' @param filename NC file filename, like
##' \code{"data/Zinser_SizeDist_calibrated-52-12.nc"}.
##' @return List containing data \code{obs} (T x m matrix), \code{time} (T
##' length list of minutes passed since the beginning), \code{num_sizeclass}
##' (Number of size classes, or m), and \code{v} (left-hand-side border of
##' size classes).
open_zinser <- function(filename = "Zinser_SizeDist_calibrated-52-12.nc",
datadir = "data"){
## Read data.
## "data/Zinser_SizeDist_calibrated-52-12.nc"
nc <- nc_open(file.path(datadir, filename))
PAR <- ncvar_get(nc,'PAR') ## Sunlight
w_obs <- ncvar_get(nc,'w_obs') ## Data
## List variables in the nc file.
names(nc$var)
## What are these?
m <- ncvar_get(nc,'m') ## Number of size classes
## Time (in number of minutes since the beginning)
time <- ncvar_get(nc,'time')
## Size classes
delta_v_inv <- ncvar_get(nc,'delta_v_inv')
v_min <- ncvar_get(nc,'v_min')
delta_v <- 1/delta_v_inv
v <- v_min * 2^(((1:m) - 1) * delta_v)
## Name of w_obs
rownames(w_obs) = time
colnames(w_obs) = v
## Return
return(list(obs = w_obs,
time = time,
num_sizeclass = m,
v = v,
nc = nc,
PAR = PAR))
}
```
You can see that the growth and division is in concordance with the sunlight
variable `par`, over the two day period of the data.
The bottom plot is also another version of the data at a different number of
size classes. There are half as many size categories (26 of them), and the same
number of ($25$) time points.
```{r plot, fig.width=14, fig.height=7}
## Retrieve data
for(filename in c("Zinser_SizeDist_calibrated-52-12.nc",
"Zinser_SizeDist_calibrated-26-6.nc")){
## Load data
dat = open_zinser(filename)
## Plot *calibrated* Zinser data.
ncat = ncol(dat$obs) ## alternatively, ncat = dat$num_sizeclass
cols = colorRamps::blue2red(ncat)
ylim = range(dat$obs)
matplot(dat$obs, col = cols, lwd = seq(from = 0.1, to = 5, length = ncat),
lty = 1, type = 'l', ylab = "", xlab = "")
title(main = paste0("Zinser data, ", ncat, " size classes"))
## Making i_test
d = 3
i_test = rep(0, length(dat$time))
i_test[seq(from=d, to=length(dat$time), by=d)] = 1
## Add grey regions for the time points that we're going to leave out.
cex = 1
abline(v = which(i_test == 1), col='grey', lwd=3, lty=1)
for(ii in which(i_test == 1)){
points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=16, col="grey", cex=cex)
points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=16, col="white", cex=cex * 0.7)
points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=1, col="grey", cex=cex)
}
## Add sunlight
scpar = scale(dat$PAR)
scpar = scpar - min(scpar)
scpar = scpar / max(scpar) * max(dat$obs)
lines(scpar, lwd = 5)
## Add legend
legend("topright", lwd=3, col='grey', pch=1, legend="Test data")
}
```
# Training and test indices for Zinser data
The `i_test` variable is a numeric vector which contains 0's an 1's, and the
time points left out shown the grey vertical lines (and empty grey points). The
test data is every 3rd data point, out of $t=0 \cdots 24$:
0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0
We'll save these indices in either an `.Rdata` file or a `.csv` file, each named:
`"Zinser_SizeDist_calibrated-52-12-itest.Rdata"`
`"Zinser_SizeDist_calibrated-52-12-itest.csv`
`"Zinser_SizeDist_calibrated-26-6-itest.Rdata"`
`"Zinser_SizeDist_calibrated-26-6-itest.csv"`
These files are written into the `data` directory:
```{r write-itest}
filenames = c("Zinser_SizeDist_calibrated-52-12",
"Zinser_SizeDist_calibrated-26-6")
for(filename in filenames){
dat = open_zinser(paste0(filename, ".nc"))
datadir = file.path(repodir, "data")
save(i_test, file = file.path(datadir, paste0(filename, "-itest.Rdata")))
write.table(i_test, file = file.path(datadir, paste0(filename, "-itest.csv")),
row.names = FALSE, col.names = FALSE)
}
## Reading it in:
i_test = unlist(read.csv(file = file.path(datadir, paste0(filenames[1], "-itest.csv")),
header = FALSE))
```
# Stan code for model fitting
Lastly, using this vector `itest`, from now on, **you should experiment on model
variations only on the training points**, and **evaluate on the test
points**. Compared to existing scripts, the key difference is in the use of
variable `log_like_test` -- we are explicitly looking for models that yield
higher out-of-sample likelihoods.
Here is roughly what to do in Stan. In fitting (training), only the training
points `i_test[it]==0` are used. In calculating the test score, use only the
time points for which `i_test[it]==1`.
```{eval=FALSE}
model {
real diff;
// priors
delta_max ~ uniform(0.0,1440.0/dt); // T[0.0,1440.0/dt];
gamma_max ~ uniform(0.0,1440.0/dt);
E_star ~ normal(3000.0,10.0);
sigma ~ exponential(1000.0);
// fitting observations
for (it in 1:nt_obs){
diff = 0.0;
if(i_test[it]== 0){
for (iv in 1:m){
diff += fabs(mod_obspos[iv,it] - obs[iv,it]);
}
diff = diff/sigma;
diff ~ normal(0.0, 1.0) T[0,];
}
}
}
generated quantities{
real log_like_test = 0;
{real diff;
for(it in 1:nt_obs){
if(i_test[it] == 1){
diff = 0.0;
for(iv in 1:m){
diff += fabs(mod_obspos[iv,it] - obs[iv,it]);
}
diff = diff/sigma;
log_like_test += normal_lpdf(diff | 0.0, 1.0);
}
}
log_like_test = log_like_test/n_test;
}
}
```
# Other notes
1. Zinser model evaluation is fishy (one-day-long iterations, using previous
day's *real* data and not the model)
2. How to rerun Zinser model to get fairer model fit? This is in order to
compare our model to theirs.
3. For the Seaflow data, and within the time frame (e.g. 1 day) we will apply
the model to, which time points do we leave out as a "test" set?
# References
* Zinser 2009 paper: https://github.com/fribalet/Bayesian-matrixmodel/blob/master/papers/journal.pone.0005135-2.PDF or https://www.dropbox.com/s/tsqmyz8sqr8mpj7/sosik-2003.pdf?dl=0
* Sosik 2003 paper: https://www.dropbox.com/s/wb4732l3lzjqrz7/zinser-2009.PDF?dl=0