diff --git a/docs/GP_Practical_files/figure-html/unnamed-chunk-6-1.png b/docs/GP_Practical_files/figure-html/unnamed-chunk-6-1.png index a7c1e3f..a2200f6 100644 Binary files a/docs/GP_Practical_files/figure-html/unnamed-chunk-6-1.png and b/docs/GP_Practical_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/docs/GP_Solutions.html b/docs/GP_Solutions.html index 6f308c9..6a615b0 100644 --- a/docs/GP_Solutions.html +++ b/docs/GP_Solutions.html @@ -254,7 +254,7 @@
Challenges
# Pulling the data from the NEON data base. 
 target <- readr::read_csv("https://data.ecoforecast.org/neon4cast-targets/ticks/ticks-targets.csv.gz", guess_max = 1e1)
-
Rows: 601 Columns: 5
+
Rows: 637 Columns: 5
 ── Column specification ────────────────────────────────────────────────────────
 Delimiter: ","
 chr  (3): site_id, variable, iso_week
@@ -756,8 +756,8 @@ 
X, Y, sometimes with bold or with subscripts – to denote the RVs. In contrast we use lower case letters, e.g. x, y, k, to denote the values that the RV takes. For instance, lets say that the heights of the woman at Virginia Tech are the RV, X, and X has a normal distribution with mean 62 inches and variance 6^2, i.e., X \sim \mathrm{N}(62,6^2) distribution. Say we then observe the heights of 3 individuals drawn from this distribution – we would write this as: x=( 60.3, 62.9, 63.8 ).

+

We usually use capital letters – e.g. X, Y, sometimes with bold or with subscripts – to denote the RVs. In contrast we use lower case letters, e.g. x, y, k, to denote the values that the RV takes. For instance, lets say that the heights of the woman at Virginia Tech are the RV, X, and X has a normal distribution with mean 62 inches and variance 6^2, i.e., X \sim \mathrm{N}(62,6^2) distribution. Say we then observe the heights of 3 individuals drawn from this distribution – we would write this as: x=( 69.2, 63.3, 61.3 ).



@@ -577,7 +577,7 @@

Probability Distributions in R

rnorm(3, mean=0, sd=1) ## random draws
-
[1] 1.410971 1.186608 0.194505
+
[1] -1.1472029 -0.2967273 -1.4990834
diff --git a/docs/Stats_review_files/figure-html/unnamed-chunk-7-1.png b/docs/Stats_review_files/figure-html/unnamed-chunk-7-1.png index 5f84ef0..9fc043c 100644 Binary files a/docs/Stats_review_files/figure-html/unnamed-chunk-7-1.png and b/docs/Stats_review_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/Stats_review_files/figure-html/unnamed-chunk-8-1.png b/docs/Stats_review_files/figure-html/unnamed-chunk-8-1.png index 224288c..bc3d2f8 100644 Binary files a/docs/Stats_review_files/figure-html/unnamed-chunk-8-1.png and b/docs/Stats_review_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/Stats_review_files/figure-html/unnamed-chunk-9-1.png b/docs/Stats_review_files/figure-html/unnamed-chunk-9-1.png index 70cf592..caf7ef7 100644 Binary files a/docs/Stats_review_files/figure-html/unnamed-chunk-9-1.png and b/docs/Stats_review_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-15-1.png b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-15-1.png index 69cdf4c..8c53272 100644 Binary files a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-15-1.png and b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-15-1.png differ diff --git a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-19-1.png b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-19-1.png index 0a58c91..e7c7a83 100644 Binary files a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-19-1.png and b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-19-1.png differ diff --git a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-20-1.png b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-20-1.png index 4373d14..f4dc586 100644 Binary files a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-20-1.png and b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-20-1.png differ diff --git a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-7-1.png b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-7-1.png index 3c41c0f..5cd07c7 100644 Binary files a/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-7-1.png and b/docs/VB_RegDiagTrans_files/figure-revealjs/unnamed-chunk-7-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln.html b/docs/VB_TimeDepData_practical_soln.html new file mode 100644 index 0000000..306669c --- /dev/null +++ b/docs/VB_TimeDepData_practical_soln.html @@ -0,0 +1,1355 @@ + + + + + + + + + + + +VectorByte Training 2024 - VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + +
+ + + +
+ +
+
+

VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)

+
+ + +
+
Author
+
Affiliation
+ + +
+

+ Virginia Tech and VectorByte +

+
+
+ +
+ + +
+
Published
+
+

July 23, 2024

+
+
+ + +
+ + + +
+ + +


+
+

Overview and Instructions

+

The goal of this practical is to practice building models for time-dependent data using simple regression based techniques. This includes incorporated possible transformations, trying out different time dependent predictors (including lagged variables) and assessing model fit using diagnostic plots.

+


+
+
+

Guided example: Monthly average mosquito counts in Walton County, FL

+

The file Culex_erraticus_walton_covariates_aggregated.csv on the course website contains data on average monthly counts of mosquitos (sample_value) in Walton, FL, together with monthly average maximum temperature (MaxTemp in C) and precipitation (Precip in inches) for each month from January 2015 through December 2017 (Month_Yr).

+
+

Exploring the Data

+

As always, we first want to take a look at the data, to make sure we understand it, and that we don’t have missing or weird values.

+
+
mozData<-read.csv("data/Culex_erraticus_walton_covariates_aggregated.csv")
+summary(mozData)
+
+
   Month_Yr          sample_value        MaxTemp          Precip      
+ Length:36          Min.   :0.00000   Min.   :16.02   Min.   : 0.000  
+ Class :character   1st Qu.:0.04318   1st Qu.:22.99   1st Qu.: 2.162  
+ Mode  :character   Median :0.73001   Median :26.69   Median : 4.606  
+                    Mean   :0.80798   Mean   :26.23   Mean   : 5.595  
+                    3rd Qu.:1.22443   3rd Qu.:30.70   3rd Qu.: 7.864  
+                    Max.   :3.00595   Max.   :33.31   Max.   :18.307  
+
+
+

We can see that the minimum observed average number of mosquitoes it zero, and max is only 3 (there are likely many zeros averaged over many days in the month). There don’t appear to be any NAs in the data. In this case the dataset itself is small enough that we can print the whole thing to ensure it’s complete:

+
+
mozData
+
+
   Month_Yr sample_value  MaxTemp       Precip
+1   2015-01  0.000000000 17.74602  3.303991888
+2   2015-02  0.018181818 17.87269 16.544265802
+3   2015-03  0.468085106 23.81767  2.405651215
+4   2015-04  1.619047619 26.03559  8.974406168
+5   2015-05  0.821428571 30.01602  0.567960943
+6   2015-06  3.005952381 31.12094  4.841342729
+7   2015-07  2.380952381 32.81130  3.849010353
+8   2015-08  1.826347305 32.56245  5.562845324
+9   2015-09  0.648809524 30.55155 10.409724627
+10  2015-10  0.988023952 27.22605  0.337750269
+11  2015-11  0.737804878 24.86768 18.306749680
+12  2015-12  0.142857143 22.46588  5.621475377
+13  2016-01  0.000000000 16.02406  3.550622029
+14  2016-02  0.020202020 19.42057 11.254680803
+15  2016-03  0.015151515 23.13610  4.785664728
+16  2016-04  0.026143791 24.98082  4.580424519
+17  2016-05  0.025252525 28.72884  0.053057634
+18  2016-06  0.833333333 30.96990  6.155417473
+19  2016-07  1.261363636 33.30509  4.496368193
+20  2016-08  1.685279188 32.09633 11.338749182
+21  2016-09  2.617142857 31.60575  2.868288451
+22  2016-10  1.212121212 29.14275  0.000000000
+23  2016-11  1.539772727 24.48482  0.005462681
+24  2016-12  0.771573604 20.46054 11.615521725
+25  2017-01  0.045454545 18.35473  0.000000000
+26  2017-02  0.036363636 23.65584  3.150710053
+27  2017-03  0.194285714 22.53573  1.430094952
+28  2017-04  0.436548223 26.15299  0.499381616
+29  2017-05  1.202020202 28.00173  6.580562663
+30  2017-06  0.834196891 29.48951 13.333939858
+31  2017-07  1.765363128 32.25135  7.493927035
+32  2017-08  0.744791667 31.86476  6.082113434
+33  2017-09  0.722222222 30.60566  4.631037395
+34  2017-10  0.142131980 27.73453 11.567112214
+35  2017-11  0.289772727 23.23140  1.195760473
+36  2017-12  0.009174312 18.93603  4.018254442
+
+
+
+
+

Plotting the data

+

First we’ll examine the data itself, including the predictors:

+
+
months<-dim(mozData)[1]
+t<-1:months ## counter for months in the data set
+par(mfrow=c(3,1))
+plot(t, mozData$sample_value, type="l", lwd=2, 
+     main="Average Monthly Abundance", 
+     xlab ="Time (months)", 
+     ylab = "Average Count")
+plot(t, mozData$MaxTemp, type="l",
+     col = 2, lwd=2, 
+     main="Average Maximum Temp", 
+     xlab ="Time (months)", 
+     ylab = "Temperature (C)")
+plot(t, mozData$Precip, type="l",
+     col="dodgerblue", lwd=2,
+     main="Average Monthly Precip", 
+     xlab ="Time (months)", 
+     ylab = "Precipitation (in)")
+
+
+
+

+
+
+
+
+

Visually we noticed that there may be a bit of clumping in the values for abundance (this is subtle) – in particular, since we have a lot of very small/nearly zero counts, a transform, such as a square root, may spread things out for the abundances. It also looks like both the abundance and temperature data are more cyclical than the precipitation, and thus more likely to be related to each other. There’s also not visually a lot of indication of a trend, but it’s usually worthwhile to consider it anyway. Replotting the abundance data with a transformation:

+
+
months<-dim(mozData)[1]
+t<-1:months ## counter for months in the data set
+plot(t, sqrt(mozData$sample_value), type="l", lwd=2, 
+     main="Sqrt Average Monthly Abundance", 
+     xlab ="Time (months)", 
+     ylab = "Average Count")
+
+
+
+

+
+
+
+
+

That looks a little bit better. I suggest we go with this for our response.

+
+
+

Building a data frame

+

Before we get into model building, we always want to build a data frame to contain all of the predictors that we want to consider, at the potential lags that we’re interested in. In the lecture we saw building the AR, sine/cosine, and trend predictors:

+
+
t <- 2:months ## to make building the AR1 predictors easier
+
+mozTS <- data.frame(
+  Y=sqrt(mozData$sample_value[t]), # transformed response
+  Yl1=sqrt(mozData$sample_value[t-1]), # AR1 predictor
+  t=t, # trend predictor
+  sin12=sin(2*pi*t/12), 
+  cos12=cos(2*pi*t/12) # periodic predictors
+  )
+
+

We will also put in the temperature and precipitation predictors. But we need to think about what might be an appropriate lag. If this were daily or weekly data, we’d probably want to have a fairly sizable lag – mosquitoes take a while to develop, so the number we see today is not likely related to the temperature today. However, since these data are agregated across a whole month, as is the temperature/precipitaion, the current month values are likely to be useful. However, it’s even possible that last month’s values may be so we’ll add those in as well:

+
+
mozTS$MaxTemp<-mozData$MaxTemp[t] ## current temps
+mozTS$MaxTempl1<-mozData$MaxTemp[t-1] ## previous temps
+mozTS$Precip<-mozData$Precip[t] ## current precip
+mozTS$Precipl1<-mozData$Precip[t-1] ## previous precip
+
+

Thus our full dataframe:

+
+
summary(mozTS)
+
+
       Y               Yl1               t            sin12         
+ Min.   :0.0000   Min.   :0.0000   Min.   : 2.0   Min.   :-1.00000  
+ 1st Qu.:0.2951   1st Qu.:0.2951   1st Qu.:10.5   1st Qu.:-0.68301  
+ Median :0.8590   Median :0.8590   Median :19.0   Median : 0.00000  
+ Mean   :0.7711   Mean   :0.7684   Mean   :19.0   Mean   :-0.01429  
+ 3rd Qu.:1.1120   3rd Qu.:1.1120   3rd Qu.:27.5   3rd Qu.: 0.68301  
+ Max.   :1.7338   Max.   :1.7338   Max.   :36.0   Max.   : 1.00000  
+     cos12             MaxTemp        MaxTempl1         Precip      
+ Min.   :-1.00000   Min.   :16.02   Min.   :16.02   Min.   : 0.000  
+ 1st Qu.:-0.68301   1st Qu.:23.18   1st Qu.:23.18   1st Qu.: 1.918  
+ Median : 0.00000   Median :27.23   Median :27.23   Median : 4.631  
+ Mean   :-0.02474   Mean   :26.47   Mean   :26.44   Mean   : 5.660  
+ 3rd Qu.: 0.50000   3rd Qu.:30.79   3rd Qu.:30.79   3rd Qu.: 8.234  
+ Max.   : 1.00000   Max.   :33.31   Max.   :33.31   Max.   :18.307  
+    Precipl1     
+ Min.   : 0.000  
+ 1st Qu.: 1.918  
+ Median : 4.631  
+ Mean   : 5.640  
+ 3rd Qu.: 8.234  
+ Max.   :18.307  
+
+
+
+
head(mozTS)
+
+
          Y       Yl1 t         sin12         cos12  MaxTemp MaxTempl1
+1 0.1348400 0.0000000 2  8.660254e-01  5.000000e-01 17.87269  17.74602
+2 0.6841675 0.1348400 3  1.000000e+00  6.123234e-17 23.81767  17.87269
+3 1.2724180 0.6841675 4  8.660254e-01 -5.000000e-01 26.03559  23.81767
+4 0.9063270 1.2724180 5  5.000000e-01 -8.660254e-01 30.01602  26.03559
+5 1.7337683 0.9063270 6  1.224647e-16 -1.000000e+00 31.12094  30.01602
+6 1.5430335 1.7337683 7 -5.000000e-01 -8.660254e-01 32.81130  31.12094
+      Precip   Precipl1
+1 16.5442658  3.3039919
+2  2.4056512 16.5442658
+3  8.9744062  2.4056512
+4  0.5679609  8.9744062
+5  4.8413427  0.5679609
+6  3.8490104  4.8413427
+
+
+
+
+

Building a first model

+

We will first build a very simple model – just a trend – to practice building the model, checking diagnostics, and plotting predictions.

+
+
mod1<-lm(Y ~ t, data=mozTS)
+summary(mod1)
+
+

+Call:
+lm(formula = Y ~ t, data = mozTS)
+
+Residuals:
+     Min       1Q   Median       3Q      Max 
+-0.81332 -0.47902  0.03671  0.37384  0.87119 
+
+Coefficients:
+             Estimate Std. Error t value Pr(>|t|)    
+(Intercept)  0.904809   0.178421   5.071  1.5e-05 ***
+t           -0.007038   0.008292  -0.849    0.402    
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 0.4954 on 33 degrees of freedom
+Multiple R-squared:  0.02136,   Adjusted R-squared:  -0.008291 
+F-statistic: 0.7204 on 1 and 33 DF,  p-value: 0.4021
+
+
+

The model output indicates that this model is not useful – the trend is not significant and it only explains about 2% of the variability. Let’s plot the predictions:

+
+
## plot points and fitted lines
+plot(Y~t, data=mozTS, col=1, type="l")
+lines(t, mod1$fitted, col="dodgerblue", lwd=2)
+
+
+
+

+
+
+
+
+

Not good – we’ll definitely need to try something else! Remember that since we’re using a linear model for this, that we should check our residual plots as usual, and then also plot the acf of the residuals:

+
+
par(mfrow=c(1,3), mar=c(4,4,2,0.5))   
+
+## studentized residuals vs fitted
+plot(mod1$fitted, rstudent(mod1), col=1,
+     xlab="Fitted Values", 
+     ylab="Studentized Residuals", 
+     pch=20, main="AR 1 only model")
+
+## qq plot of studentized residuals
+qqnorm(rstudent(mod1), pch=20, col=1, main="" )
+abline(a=0,b=1,lty=2, col=2)
+
+## histogram of studentized residuals
+hist(rstudent(mod1), col=1, 
+     xlab="Studentized Residuals", 
+     main="", border=8)
+
+
+
+

+
+
+
+
+

This doesn’t look really bad, although the histogram might be a bit weird. Finally the acf

+
+
acf(mod1$residuals)
+
+
+
+

+
+
+
+
+

This is where we can see that we definitely aren’t able to capture the pattern. There’s substantial autocorrelation left at a 1 month lag, and around 6 months.

+

Finally, for moving forward, we can extract the BIC for this model so that we can compare with other models that you’ll build next.

+
+
n<-length(t)
+extractAIC(mod1, k=log(n))[2]
+
+
[1] -44.11057
+
+
+
+
+
+

Build and compare your own models (Example solution)

+

Follow the procedure I showed for the model with a simple trend, and build at least 4 more models:

+
    +
  1. one that contains an AR term
  2. +
  3. one with the sine/cosine terms
  4. +
  5. one with the environmental predictors
  6. +
  7. one with a combination
  8. +
+

Check diagnostics/model assumptions as you go. Then at the end compare all of your models via BIC. What is your best model by that metric? We’ll share among the group what folks found to be good models.

+

NOTE: The solutions I show below are examples of what one could do, but your models might be a bit different

+
+

Example Solution: AR1 model only

+
+
mod2<-lm(Y ~ Yl1, data=mozTS)
+summary(mod2)
+
+

+Call:
+lm(formula = Y ~ Yl1, data = mozTS)
+
+Residuals:
+    Min      1Q  Median      3Q     Max 
+-0.6338 -0.2173 -0.0678  0.2463  0.8675 
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)    
+(Intercept)   0.2410     0.1130   2.132   0.0405 *  
+Yl1           0.6899     0.1240   5.562 3.51e-06 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 0.3598 on 33 degrees of freedom
+Multiple R-squared:  0.4839,    Adjusted R-squared:  0.4682 
+F-statistic: 30.94 on 1 and 33 DF,  p-value: 3.507e-06
+
+
+

The model is better than the original trend only model – the AR1 term explains about 48% of the variability. Let’s plot the predictions:

+
+
## plot points and fitted lines
+plot(Y~t, data=mozTS, col=1, type="l")
+lines(t, mod2$fitted, col=2, lwd=2)
+
+
+
+

+
+
+
+
+

Pretty good! Look at all of the diagnostic plots:

+
+
par(mfrow=c(1,3), mar=c(4,4,2,0.5))   
+
+## studentized residuals vs fitted
+plot(mod2$fitted, rstudent(mod2), col=2,
+     xlab="Fitted Values", 
+     ylab="Studentized Residuals", 
+     pch=20, main="AR 1 only model")
+
+## qq plot of studentized residuals
+qqnorm(rstudent(mod2), pch=20, col=2, main="" )
+abline(a=0,b=1,lty=2, col=1)
+
+## histogram of studentized residuals
+hist(rstudent(mod2), col=2, 
+     xlab="Studentized Residuals", 
+     main="", border=8)
+
+
+
+

+
+
+
+
+

Maybe one outlier, but not too bad.

+
+
acf(mod2$residuals)
+
+
+
+

+
+
+
+
+

We seem to have taken care of all of the autoregression, even at multiple lags!

+
+
n<-length(t)
+extractAIC(mod2, k=log(n))[2]
+
+
[1] -66.50482
+
+
+

BIC is much lower – overall a much much better model than the first one.

+
+
+

Example Solution: sine/cosine terms only

+
+
mod3<-lm(Y ~ sin12 + cos12, data=mozTS)
+summary(mod3)
+
+

+Call:
+lm(formula = Y ~ sin12 + cos12, data = mozTS)
+
+Residuals:
+     Min       1Q   Median       3Q      Max 
+-0.70116 -0.21655 -0.03611  0.19213  0.67992 
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)    
+(Intercept)  0.75706    0.05750  13.165 1.83e-14 ***
+sin12       -0.38804    0.08072  -4.807 3.48e-05 ***
+cos12       -0.34298    0.08192  -4.187 0.000207 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 0.3399 on 32 degrees of freedom
+Multiple R-squared:  0.5533,    Adjusted R-squared:  0.5254 
+F-statistic: 19.82 on 2 and 32 DF,  p-value: 2.512e-06
+
+
+

The model is better than the original trend only model – it explains about 55% of the variability (we expect R^2 to increase as we have more predictors). Let’s plot the predictions:

+
+
## plot points and fitted lines
+plot(Y~t, data=mozTS, col=1, type="l")
+lines(t, mod3$fitted, col=3, lwd=2)
+
+
+
+

+
+
+
+
+

Pretty good! Look at all of the diagnostic plots:

+
+
par(mfrow=c(1,3), mar=c(4,4,2,0.5))   
+
+## studentized residuals vs fitted
+plot(mod3$fitted, rstudent(mod3), col=3,
+     xlab="Fitted Values", 
+     ylab="Studentized Residuals", 
+     pch=20, main="sin/cos only model")
+
+## qq plot of studentized residuals
+qqnorm(rstudent(mod3), pch=20, col=3, main="" )
+abline(a=0,b=1,lty=2, col=2)
+
+## histogram of studentized residuals
+hist(rstudent(mod3), col=3, 
+     xlab="Studentized Residuals", 
+     main="", border=8)
+
+
+
+

+
+
+
+
+

Maybe one outlier, but not too bad.

+
+
acf(mod3$residuals)
+
+
+
+

+
+
+
+
+

We seem have taken care of the longer lag autocorrelation, but still some lag 1 left.

+
+
n<-length(t)
+extractAIC(mod3, k=log(n))[2]
+
+
[1] -68.00597
+
+
+

This model is even better than the AR1 model. We’ll keep this in mind….

+
+
+

Example Solution: environmental predictors only

+

I’ll put in the predictors at the current time period. Since this is monthly averaged data we could probably do either current or lagged.

+
+
mod4<-lm(Y ~ MaxTemp + Precip, data=mozTS)
+summary(mod4)
+
+

+Call:
+lm(formula = Y ~ MaxTemp + Precip, data = mozTS)
+
+Residuals:
+     Min       1Q   Median       3Q      Max 
+-0.76043 -0.17925 -0.01671  0.15491  0.64193 
+
+Coefficients:
+             Estimate Std. Error t value Pr(>|t|)    
+(Intercept) -1.248452   0.323576  -3.858 0.000521 ***
+MaxTemp      0.075450   0.011641   6.481 2.72e-07 ***
+Precip       0.003928   0.011870   0.331 0.742852    
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 0.3344 on 32 degrees of freedom
+Multiple R-squared:  0.5676,    Adjusted R-squared:  0.5406 
+F-statistic:    21 on 2 and 32 DF,  p-value: 1.493e-06
+
+
+

The model is even better than the last – the model explains about 58% of the variability, although the Precip isn’t significant and we might want to consider dropping it. Let’s plot the predictions:

+
+
## plot points and fitted lines
+plot(Y~t, data=mozTS, col=1, type="l")
+lines(t, mod4$fitted, col=4, lwd=2)
+
+
+
+

+
+
+
+
+

Pretty good! Look at all of the diagnostic plots:

+
+
par(mfrow=c(1,3), mar=c(4,4,2,0.5))   
+
+## studentized residuals vs fitted
+plot(mod4$fitted, rstudent(mod4), col=4,
+     xlab="Fitted Values", 
+     ylab="Studentized Residuals", 
+     pch=20, main="weather model")
+
+## qq plot of studentized residuals
+qqnorm(rstudent(mod4), pch=20, col=4, main="" )
+abline(a=0,b=1,lty=2, col=2)
+
+## histogram of studentized residuals
+hist(rstudent(mod4), col=4, 
+     xlab="Studentized Residuals", 
+     main="", border=8)
+
+
+
+

+
+
+
+
+

Maybe one outlier again, but not too bad.

+
+
acf(mod4$residuals)
+
+
+
+

+
+
+
+
+

We seem to have taken care of all of the autoregression, except maybe a bit of AR1.

+
+
n<-length(t)
+extractAIC(mod4, k=log(n))[2]
+
+
[1] -69.14372
+
+
+

Even better, although it’s not much different than the sin/cos

+
+
+

Example Solution: AR1 plus sin/cos

+

Ok, now to combine things:

+
+
mod5<-lm(Y ~ Yl1 + sin12 + cos12, data=mozTS)
+summary(mod5)
+
+

+Call:
+lm(formula = Y ~ Yl1 + sin12 + cos12, data = mozTS)
+
+Residuals:
+     Min       1Q   Median       3Q      Max 
+-0.49092 -0.25028 -0.02153  0.17287  0.60748 
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)    
+(Intercept)  0.38035    0.12935   2.940 0.006148 ** 
+Yl1          0.49652    0.15681   3.166 0.003453 ** 
+sin12       -0.13417    0.10729  -1.251 0.220457    
+cos12       -0.29593    0.07386  -4.007 0.000358 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 0.3002 on 31 degrees of freedom
+Multiple R-squared:  0.6625,    Adjusted R-squared:  0.6298 
+F-statistic: 20.28 on 3 and 31 DF,  p-value: 1.835e-07
+
+
+

The model is better than the original trend only model – the AR1 term explains about 48% of the variability. Let’s plot the predictions:

+
+
## plot points and fitted lines
+plot(Y~t, data=mozTS, col=1, type="l")
+lines(t, mod5$fitted, col=5, lwd=2)
+
+
+
+

+
+
+
+
+

Pretty good! Look at all of the diagnostic plots:

+
+
par(mfrow=c(1,3), mar=c(4,4,2,0.5))   
+
+## studentized residuals vs fitted
+plot(mod5$fitted, rstudent(mod5), col=5,
+     xlab="Fitted Values", 
+     ylab="Studentized Residuals", 
+     pch=20, main="AR 1 only model")
+
+## qq plot of studentized residuals
+qqnorm(rstudent(mod5), pch=20, col=5, main="" )
+abline(a=0,b=1,lty=2, col=2)
+
+## histogram of studentized residuals
+hist(rstudent(mod5), col=5, 
+     xlab="Studentized Residuals", 
+     main="", border=8)
+
+
+
+

+
+
+
+
+

That’s really good!.

+
+
acf(mod5$residuals)
+
+
+
+

+
+
+
+
+

We seem to have taken care of all of the autoregression!

+
+
n<-length(t)
+extractAIC(mod5, k=log(n))[2]
+
+
[1] -74.25862
+
+
+

And definitely the best so far. Just to compare more easily:

+
+
c(mod1 = extractAIC(mod1, k=log(n))[2],
+  mod2 = extractAIC(mod2, k=log(n))[2],
+  mod3 = extractAIC(mod3, k=log(n))[2],
+  mod4 = extractAIC(mod4, k=log(n))[2],
+  mod5 = extractAIC(mod5, k=log(n))[2])
+
+
     mod1      mod2      mod3      mod4      mod5 
+-44.11057 -66.50482 -68.00597 -69.14372 -74.25862 
+
+
+

We’re looking for difference of about 5 to determine if a model is better. Model 5 is about 5 better than model 4, and models 2-4 are all about even. It may be that AR1 plus temperature might be even better, but it’s easier to forecast with a sine/cosine than using temperature, so I went for that….

+
+
+
+

Extra Practice

+

Imagine that you are missing a few months at random – how would you need to modify the analysis. Try it out by removing about 5 months not at the beginning or end of the time series.

+ + +
+ +

Citation

BibTeX citation:
@online{r. johnson2024,
+  author = {R. Johnson, Leah},
+  title = {VectorByte {Methods} {Training:} {Regression} {Methods} for
+    {Time} {Dependent} {Data} (Practical - Solution)},
+  date = {2024-07-23},
+  langid = {en}
+}
+
For attribution, please cite this work as:
+R. Johnson, Leah. 2024. “VectorByte Methods Training: Regression +Methods for Time Dependent Data (Practical - Solution).” July 23, +2024. +
+ +
+ + + + + \ No newline at end of file diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-10-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 0000000..92d91fb Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-11-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 0000000..021a990 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-12-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-12-1.png new file mode 100644 index 0000000..d0e16aa Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-15-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-15-1.png new file mode 100644 index 0000000..d7f46cf Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-15-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-16-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 0000000..d27c705 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-17-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 0000000..2a397ec Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-20-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-20-1.png new file mode 100644 index 0000000..d8ddef1 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-21-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-21-1.png new file mode 100644 index 0000000..0bbae3e Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-22-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-22-1.png new file mode 100644 index 0000000..a0b050b Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-25-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-25-1.png new file mode 100644 index 0000000..fa1733b Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-25-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-26-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-26-1.png new file mode 100644 index 0000000..d485610 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-27-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-27-1.png new file mode 100644 index 0000000..7122209 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-3-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 0000000..a1b87b6 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-30-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 0000000..5783c25 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-30-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-31-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-31-1.png new file mode 100644 index 0000000..8fd6176 Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-31-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-32-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-32-1.png new file mode 100644 index 0000000..bb5160a Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-32-1.png differ diff --git a/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-4-1.png b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 0000000..4c166ec Binary files /dev/null and b/docs/VB_TimeDepData_practical_soln_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/search.json b/docs/search.json index 8f37c19..76dfd5a 100644 --- a/docs/search.json +++ b/docs/search.json @@ -53,7 +53,7 @@ "href": "GP_Solutions.html", "title": "GP_Solutions", "section": "", - "text": "Libraries\n\nlibrary(mvtnorm)\nlibrary(laGP)\nlibrary(hetGP)\nlibrary(ggplot2)\n\n\n\nHetGP (sin wave eg)\n\n# Your turn\nset.seed(26)\nn <- 8 # number of points\nX <- matrix(seq(0, 2*pi, length= n), ncol=1) # build inputs \ny <- 5*sin(X) + rnorm(n, 0 , 2) # response with some noise\n\n# Predict on this set\nXX <- matrix(seq(-0.5, 2*pi + 0.5, length= 100), ncol=1)\n\n# Data visualization\nplot(X, y)\n\n\n\n\n\n\n\n# ------ Solutions ------------------------------\n\nhet_fit <- hetGP::mleHetGP(X, y)\nhet_pred <- predict(het_fit, XX)\n\nmean <- het_pred$mean\ns2 <- het_pred$sd2 + het_pred$nugs\n\nyy <- 5*sin(XX)\n\npar(mfrow = c(1, 1), mar = c(4, 4, 4, 1))\nplot(X, y, ylim = c(-10, 10))\nlines(XX, yy, col = 3)\nlines(XX, mean, col = 2)\nlines(XX, mean + 2 * sqrt(s2), col = 4)\nlines(XX, mean - 2 * sqrt(s2), col = 4)\n\n\n\n\n\n\n\n# You can check the nuggets (each one will be different)\nnugs <- het_pred$nugs\nsummary(nugs)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 3.904 3.927 3.931 3.936 3.949 3.971 \n\n\n\n\nChallenges\nWe need to load the data and the functions\n\n# Pulling the data from the NEON data base. \ntarget <- readr::read_csv(\"https://data.ecoforecast.org/neon4cast-targets/ticks/ticks-targets.csv.gz\", guess_max = 1e1)\n\nRows: 601 Columns: 5\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (3): site_id, variable, iso_week\ndbl (1): observation\ndate (1): datetime\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\n# transforms y\nf <- function(x) {\n y <- log(x + 1)\n return(y)\n}\n\n# This function back transforms the input argument\nfi <- function(y) {\n x <- exp(y) - 1\n return(x)\n}\n\n# This function tells us the iso-week number given the date\nfx.iso_week <- function(datetime){\n # Gives ISO-week in the format yyyy-w## and we extract the ##\n x1 <- as.numeric(stringr::str_sub(ISOweek::ISOweek(datetime), 7, 8)) # find iso week #\n return(x1)\n}\n\nfx.sin <- function(datetime, f1 = fx.iso_week){\n # identify iso week#\n x <- f1(datetime) \n # calculate sin value for that week\n x2 <- (sin(2*pi*x/106))^2 \n return(x2)\n}\n\n\nFit a GP Model for the location “SERC” i.e. site_number = 7.\nJust change site = 7\n\nsite_number <- 7 # (site_number = 4) for the other challenge\n\n# Obtaining site name\nsite_names <- unique(target$site_id)\n\n# Subsetting all the data at that location\ndf <- subset(target, target$site_id == site_names[site_number])\n\n# extracting only the datetime and obs columns\ndf <- df[, c(\"datetime\", \"observation\")]\n\n# Selecting a date before which we consider everything as training data and after this is testing data.\ncutoff = as.Date('2020-12-31')\ndf_train <- subset(df, df$datetime <= cutoff)\ndf_test <- subset(df, df$datetime > cutoff)\n\n# Setting up iso-week and sin wave predictors by calling the functions\nX1 <- fx.iso_week(df_train$datetime) # range is 1-53\nX2 <- fx.sin(df_train$datetime) # range is 0 to 1\n\n# Centering the iso-week by diving by 53\nX1c <- X1/ 53\n\n# We combine columns centered X1 and X2, into a matrix as our input space\nX <- as.matrix(cbind.data.frame(X1c, X2))\nhead(X)\n\n X1c X2\n[1,] 0.3584906 0.8150439\n[2,] 0.3962264 0.8974272\n[3,] 0.4528302 0.9782005\n[4,] 0.5094340 0.9991219\n[5,] 0.6226415 0.8587536\n[6,] 0.6792453 0.7150326\n\ny_obs <- df_train$observation\ny <- f(y_obs) # transform y\n\n# A very small value for stability\neps <- sqrt(.Machine$double.eps) \n \n# Priors for theta and g. \nd <- darg(list(mle=TRUE, min =eps, max=5), X)\ng <- garg(list(mle=TRUE, min = eps, max = 1), y)\n\n# Fitting a GP with our data, and some starting values for theta and g\ngpi <- newGPsep(X, y, d = 0.1, g = 1, dK = T)\n\n# Jointly infer MLE for all parameters\nmle <- jmleGPsep(gpi, drange = c(d$min, d$max), grange = c(g$min, g$max), \n dab = d$ab, gab= g$ab)\n\n# Create a grid from start date in our data set to one year in future (so we forecast for next season)\nstartdate <- as.Date(min(df$datetime))# identify start week\ngrid_datetime <- seq.Date(startdate, Sys.Date() + 365, by = 7) # create sequence\n\n# Build the input space for the predictive space (All weeks from 04-2014 to 07-2025)\nXXt1 <- fx.iso_week(grid_datetime)\nXXt2 <- fx.sin(grid_datetime)\n\n# Standardize\nXXt1c <- XXt1/53\n\n# Store inputs as a matrix\nXXt <- as.matrix(cbind.data.frame(XXt1c, XXt2))\n\n# Make predictions using predGP with the gp object and the predictive set\nppt <- predGPsep(gpi, XXt) \n\n# Now we store the mean as our predicted response i.e. density along with quantiles\nyyt <- ppt$mean\nq1t <- ppt$mean + qnorm(0.025,0,sqrt(diag(ppt$Sigma))) #lower bound\nq2t <- ppt$mean + qnorm(0.975,0,sqrt(diag(ppt$Sigma))) # upper bound\n\n# Back transform our data to original\ngp_yy <- fi(yyt)\ngp_q1 <- fi(q1t)\ngp_q2 <- fi(q2t)\n\n# Plot the observed points\nplot(as.Date(df$datetime), df$observation,\n main = paste(site_names[site_number]), col = \"black\",\n xlab = \"Dates\" , ylab = \"Abundance\",\n # xlim = c(as.Date(min(df$datetime)), as.Date(cutoff)),\n ylim = c(min(df_train$observation, gp_yy, gp_q1), max(df_train$observation, gp_yy, gp_q2)* 1.05))\n\n# Plot the testing set data \npoints(as.Date(df_test$datetime), df_test$observation, col =\"black\", pch = 19)\n\n# Line to indicate seperation between train and test data\nabline(v = as.Date(cutoff), lwd = 2)\n\n# Add the predicted response and the quantiles\nlines(grid_datetime, gp_yy, col = 4, lwd = 2)\nlines(grid_datetime, gp_q1, col = 4, lwd = 1.2, lty = 2)\nlines(grid_datetime, gp_q2, col = 4, lwd = 1.2, lty =2)\n\n\n\n\n\n\n\n# Obtain true observed values for testing set\nyt_true <- f(df_test$observation)\n\n# FInd corresponding predictions from our model in the grid we predicted on\nyt_pred <- yyt[which(grid_datetime %in% df_test$datetime)]\n\n# calculate RMSE\nrmse <- sqrt(mean((yt_true - yt_pred)^2))\nrmse\n\n[1] 0.9553652\n\n\n\n\nUse an environmental predictor in your model. Following is a function fx.green that creates the variable given the datetime and the location.\nHere is a snippet of the supporting file that you will use; You can look into the data.frame and try to plot ker for one site at a time and see what it yields.\n\nsource('code/df_spline.R') # sources the cript to make greenness predictor\nhead(df_green) # how the dataset looks\n\n site iso ker\n1 BLAN 1 0\n2 BLAN 2 0\n3 BLAN 3 0\n4 BLAN 4 0\n5 BLAN 5 0\n6 BLAN 6 0\n\n# The function to create the environmental predictor similar to iso-week and sin wave\nfx.green <- function(datetime, site, site_info = df_green){\n ker <- NULL\n iso <- fx.iso_week(datetime) # identify iso week\n df.iso <- cbind.data.frame(datetime, iso) # combine date with iso week\n sites.ker <- subset(site_info, site == site)[,2:3] # obtain kernel for location\n df.green <- df.iso %>% left_join(sites.ker, by = 'iso') # join dataframes by iso week\n ker <- df.green$ker # return kernel\n return(ker)\n}\n\n\nChoose a site\nset up X3 using fx_green\nScale X3\n\nSetting up the target dataframe\n\n# Obtaining site name\nsite_names <- unique(target$site_id)\n\n# Subsetting all the data at that location\ndf <- subset(target, target$site_id == site_names[site_number])\n\n# extracting only the datetime and obs columns\ndf <- df[, c(\"datetime\", \"observation\")]\n\n# Selecting a date before which we consider everything as training data and after this is testing data.\ncutoff = as.Date('2020-12-31')\ndf_train <- subset(df, df$datetime <= cutoff)\ndf_test <- subset(df, df$datetime > cutoff)\n\nAdding Greenness\n\n# Choose location \nsite_number = 7\ndf_green_site2 <- subset(df_green, site == site_names[site_number])\n\n# Setting up iso-week and sin wave predictors by calling the functions\nX1 <- fx.iso_week(df_train$datetime) # range is 1-53\nX2 <- fx.sin(df_train$datetime) # range is 0 to 1\n\n# you need datetime, site name and the df_green dataset.\nX3 <- fx.green(df_train$datetime, site = site_names[site_number], site_info = df_green_site2)\n\n# Centering the iso-week by diving by 53\nX1c <- X1/ 53\n\n# Scale X3\nX3c <- (X3 - min(X3))/ (max(X3)- min(X3))\n\n# We combine columns centered X1 and X2, into a matrix as our input space\nX <- as.matrix(cbind.data.frame(X1c, X2, X3c))\nhead(X)\n\n X1c X2 X3c\n[1,] 0.3584906 0.8150439 0.84585476\n[2,] 0.3962264 0.8974272 0.97749530\n[3,] 0.4528302 0.9782005 0.96806621\n[4,] 0.5094340 0.9991219 0.75535447\n[5,] 0.6226415 0.8587536 0.23957183\n[6,] 0.6792453 0.7150326 0.09822483\n\ny_obs <- df_train$observation\ny <- f(y_obs) # transform y\n\n# A very small value for stability\neps <- sqrt(.Machine$double.eps) \n \n# Priors for theta and g. \nd <- darg(list(mle=TRUE, min =eps, max=5), X)\ng <- garg(list(mle=TRUE, min = eps, max = 1), y)\n\n# Fitting a GP with our data, and some starting values for theta and g\ngpi <- newGPsep(X, y, d = 0.1, g = 1, dK = T)\n\n# Jointly infer MLE for all parameters\nmle <- jmleGPsep(gpi, drange = c(d$min, d$max), grange = c(g$min, g$max), \n dab = d$ab, gab= g$ab)\n\n# Create a grid from start date in our data set to one year in future (so we forecast for next season)\nstartdate <- as.Date(min(df$datetime))# identify start week\ngrid_datetime <- seq.Date(startdate, Sys.Date() + 365, by = 7) # create sequence\n\n# Build the input space for the predictive space (All weeks from 04-2014 to 07-2025)\nXXt1 <- fx.iso_week(grid_datetime)\nXXt2 <- fx.sin(grid_datetime)\nXXt3 <- fx.green(grid_datetime, site = site_names[site_nunber], site_info = df_green_site2)\n\n# Standardize\nXXt1c <- XXt1/53\nXXt3 <- (XXt3 - min(XXt3))/ (max(XXt3)- min(XXt3))\n\n# Store inputs as a matrix\nXXt <- as.matrix(cbind.data.frame(XXt1c, XXt2, XXt3))\n\n# Make predictions using predGP with the gp object and the predictive set\nppt <- predGPsep(gpi, XXt) \n\n# Now we store the mean as our predicted response i.e. density along with quantiles\nyyt <- ppt$mean\nq1t <- ppt$mean + qnorm(0.025,0,sqrt(diag(ppt$Sigma))) #lower bound\nq2t <- ppt$mean + qnorm(0.975,0,sqrt(diag(ppt$Sigma))) # upper bound\n\n# Back transform our data to original\ngp_yy <- fi(yyt)\ngp_q1 <- fi(q1t)\ngp_q2 <- fi(q2t)\n\n# Plot the observed points\nplot(as.Date(df$datetime), df$observation,\n main = paste(site_names[site_number]), col = \"black\",\n xlab = \"Dates\" , ylab = \"Abundance\",\n # xlim = c(as.Date(min(df$datetime)), as.Date(cutoff)),\n ylim = c(min(df_train$observation, gp_yy, gp_q1), max(df_train$observation, gp_yy, gp_q2)* 1.05))\n\n# Plot the testing set data \npoints(as.Date(df_test$datetime), df_test$observation, col =\"black\", pch = 19)\n\n# Line to indicate seperation between train and test data\nabline(v = as.Date(cutoff), lwd = 2)\n\n# Add the predicted response and the quantiles\nlines(grid_datetime, gp_yy, col = 4, lwd = 2)\nlines(grid_datetime, gp_q1, col = 4, lwd = 1.2, lty = 2)\nlines(grid_datetime, gp_q2, col = 4, lwd = 1.2, lty =2)\n\n\n\n\n\n\n\n# Obtain true observed values for testing set\nyt_true <- f(df_test$observation)\n\n# FInd corresponding predictions from our model in the grid we predicted on\nyt_pred <- yyt[which(grid_datetime %in% df_test$datetime)]\n\n# calculate RMSE\nrmse <- sqrt(mean((yt_true - yt_pred)^2))\nrmse\n\n[1] 0.9532001\n\n\n\n\nFit a GP Model for all the locations (More advanced).\n\n# GP function. This can be varied but easiest way is to just take in X, y, XX and return the predicted means and bounds. \n\ngpfit <- function(X, y , XXt){\n eps <- sqrt(.Machine$double.eps) \n \n # Priors for theta and g. \n d <- darg(list(mle=TRUE, min =eps, max=5), X)\n g <- garg(list(mle=TRUE, min = eps, max = 1), y)\n\n # Fitting a GP with our data, and some starting values for theta and g\n gpi <- newGPsep(X, y, d = 0.1, g = 1, dK = T)\n\n # Jointly infer MLE for all parameters\n mle <- jmleGPsep(gpi, drange = c(d$min, d$max), grange = c(g$min, g$max), \n dab = d$ab, gab= g$ab)\n\n ppt <- predGPsep(gpi, XXt) \n\n # Now we store the mean as our predicted response i.e. density along with quantiles\n yyt <- ppt$mean\n q1t <- ppt$mean + qnorm(0.025,0,sqrt(diag(ppt$Sigma))) #lower bound\n q2t <- ppt$mean + qnorm(0.975,0,sqrt(diag(ppt$Sigma))) # upper bound\n\n # Back transform our data to original\n gp_yy <- fi(yyt)\n gp_q1 <- fi(q1t)\n gp_q2 <- fi(q2t)\n \n return(list(mean = gp_yy, s2 = diag(ppt$Sigma), q1 = gp_q1, q2 = gp_q2))\n}\n\n\nsite_number <- 7 # (site_number = 4) for the other challenge\n\n# Obtaining site name\nsite_names <- unique(target$site_id)\n\n# extracting only the datetime and obs columns\ndf <- target[, c(\"datetime\", \"site_id\", \"observation\")]\n\ncutoff = as.Date('2020-12-31')\n\n# This was always be prediction set\nstartdate <- as.Date(min(df$datetime))# identify start week\ngrid_datetime <- seq.Date(startdate, Sys.Date() + 365, by = 7) # create sequence\n\n# You can pre process to have y transformed or have it in the loop.\nrmse <- matrix(nrow = length(site_names), ncol = 1) # if rmse\n\nfor(i in 1:length(site_names)){\n \n df_site <- subset(df, site_id == site_names[i])\n \n # cutoff for sites\n df_train <- subset(df_site, df_site$datetime <= cutoff)\n df_test <- subset(df_site, df_site$datetime > cutoff)\n \n df_green_site <- subset(df_green, site == site_names[i])\n \n X1 <- fx.iso_week(df_train$datetime) # range is 1-53\n X2 <- fx.sin(df_train$datetime) # range is 0 to 1\n X3 <- fx.green(df_train$datetime, site = site_names[site_number], site_info = df_green_site) # optional add\n\n X1c <- X1/ 53\n X3c <- (X3 - min(X3))/ (max(X3)- min(X3))\n X <- as.matrix(cbind.data.frame(X1c, X2, X3c))\n\n y_obs <- df_train$observation # only at this location\n y <- f(y_obs) # transform y\n\n XXt1 <- fx.iso_week(grid_datetime)\n XXt2 <- fx.sin(grid_datetime)\n XXt3 <- fx.green(grid_datetime, site = site_names[site_nunber], site_info =\n df_green_site)\n\n # Standardize\n XXt1c <- XXt1/53\n XXt3 <- (XXt3 - min(XXt3))/ (max(XXt3)- min(XXt3))\n XXt <- as.matrix(cbind.data.frame(XXt1c, XXt2, XXt3))\n \n fit <- gpfit(X = X, y = y, XX = XXt)\n \n # Make plots\n plot(as.Date(df_site$datetime), df_site$observation,\n main = paste(site_names[i]), col = \"black\",\n xlab = \"Dates\" , ylab = \"Abundance\",\n # xlim = c(as.Date(min(df$datetime)), as.Date(cutoff)),\n ylim = c(min(df_train$observation, df_test$observation, fit$q1),\n max(df_train$observation,df_test$observation, fit$q2)* 1.05))\n\n points(as.Date(df_test$datetime), df_test$observation, col =\"black\", pch = 19)\n abline(v = as.Date(cutoff), lwd = 2)\n\n # Add the predicted response and the quantiles\n lines(grid_datetime, fit$mean, col = 4, lwd = 2)\n lines(grid_datetime, fit$q1, col = 4, lwd = 1.2, lty = 2)\n lines(grid_datetime, fit$q2, col = 4, lwd = 1.2, lty =2)\n \n \n yt_true <- f(df_test$observation)\n yt_pred <- f(fit$mean[which(grid_datetime %in% df_test$datetime)])\n\n # calculate RMSE\n rmse[i, ] <- sqrt(mean((yt_true - yt_pred)^2))\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrownames(rmse) <- site_names\nprint(rmse)\n\n [,1]\nBLAN 1.1173576\nKONZ 0.7779042\nLENO 0.6851390\nORNL 0.9205952\nOSBS 1.2057049\nSCBI 0.8610973\nSERC 0.9532001\nTALL 0.8883320\nUKFS 0.9521605" + "text": "Libraries\n\nlibrary(mvtnorm)\nlibrary(laGP)\nlibrary(hetGP)\nlibrary(ggplot2)\n\n\n\nHetGP (sin wave eg)\n\n# Your turn\nset.seed(26)\nn <- 8 # number of points\nX <- matrix(seq(0, 2*pi, length= n), ncol=1) # build inputs \ny <- 5*sin(X) + rnorm(n, 0 , 2) # response with some noise\n\n# Predict on this set\nXX <- matrix(seq(-0.5, 2*pi + 0.5, length= 100), ncol=1)\n\n# Data visualization\nplot(X, y)\n\n\n\n\n\n\n\n# ------ Solutions ------------------------------\n\nhet_fit <- hetGP::mleHetGP(X, y)\nhet_pred <- predict(het_fit, XX)\n\nmean <- het_pred$mean\ns2 <- het_pred$sd2 + het_pred$nugs\n\nyy <- 5*sin(XX)\n\npar(mfrow = c(1, 1), mar = c(4, 4, 4, 1))\nplot(X, y, ylim = c(-10, 10))\nlines(XX, yy, col = 3)\nlines(XX, mean, col = 2)\nlines(XX, mean + 2 * sqrt(s2), col = 4)\nlines(XX, mean - 2 * sqrt(s2), col = 4)\n\n\n\n\n\n\n\n# You can check the nuggets (each one will be different)\nnugs <- het_pred$nugs\nsummary(nugs)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. \n 3.904 3.927 3.931 3.936 3.949 3.971 \n\n\n\n\nChallenges\nWe need to load the data and the functions\n\n# Pulling the data from the NEON data base. \ntarget <- readr::read_csv(\"https://data.ecoforecast.org/neon4cast-targets/ticks/ticks-targets.csv.gz\", guess_max = 1e1)\n\nRows: 637 Columns: 5\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\nchr (3): site_id, variable, iso_week\ndbl (1): observation\ndate (1): datetime\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\n# transforms y\nf <- function(x) {\n y <- log(x + 1)\n return(y)\n}\n\n# This function back transforms the input argument\nfi <- function(y) {\n x <- exp(y) - 1\n return(x)\n}\n\n# This function tells us the iso-week number given the date\nfx.iso_week <- function(datetime){\n # Gives ISO-week in the format yyyy-w## and we extract the ##\n x1 <- as.numeric(stringr::str_sub(ISOweek::ISOweek(datetime), 7, 8)) # find iso week #\n return(x1)\n}\n\nfx.sin <- function(datetime, f1 = fx.iso_week){\n # identify iso week#\n x <- f1(datetime) \n # calculate sin value for that week\n x2 <- (sin(2*pi*x/106))^2 \n return(x2)\n}\n\n\nFit a GP Model for the location “SERC” i.e. site_number = 7.\nJust change site = 7\n\nsite_number <- 7 # (site_number = 4) for the other challenge\n\n# Obtaining site name\nsite_names <- unique(target$site_id)\n\n# Subsetting all the data at that location\ndf <- subset(target, target$site_id == site_names[site_number])\n\n# extracting only the datetime and obs columns\ndf <- df[, c(\"datetime\", \"observation\")]\n\n# Selecting a date before which we consider everything as training data and after this is testing data.\ncutoff = as.Date('2020-12-31')\ndf_train <- subset(df, df$datetime <= cutoff)\ndf_test <- subset(df, df$datetime > cutoff)\n\n# Setting up iso-week and sin wave predictors by calling the functions\nX1 <- fx.iso_week(df_train$datetime) # range is 1-53\nX2 <- fx.sin(df_train$datetime) # range is 0 to 1\n\n# Centering the iso-week by diving by 53\nX1c <- X1/ 53\n\n# We combine columns centered X1 and X2, into a matrix as our input space\nX <- as.matrix(cbind.data.frame(X1c, X2))\nhead(X)\n\n X1c X2\n[1,] 0.3584906 0.8150439\n[2,] 0.3962264 0.8974272\n[3,] 0.4528302 0.9782005\n[4,] 0.5094340 0.9991219\n[5,] 0.6226415 0.8587536\n[6,] 0.6792453 0.7150326\n\ny_obs <- df_train$observation\ny <- f(y_obs) # transform y\n\n# A very small value for stability\neps <- sqrt(.Machine$double.eps) \n \n# Priors for theta and g. \nd <- darg(list(mle=TRUE, min =eps, max=5), X)\ng <- garg(list(mle=TRUE, min = eps, max = 1), y)\n\n# Fitting a GP with our data, and some starting values for theta and g\ngpi <- newGPsep(X, y, d = 0.1, g = 1, dK = T)\n\n# Jointly infer MLE for all parameters\nmle <- jmleGPsep(gpi, drange = c(d$min, d$max), grange = c(g$min, g$max), \n dab = d$ab, gab= g$ab)\n\n# Create a grid from start date in our data set to one year in future (so we forecast for next season)\nstartdate <- as.Date(min(df$datetime))# identify start week\ngrid_datetime <- seq.Date(startdate, Sys.Date() + 365, by = 7) # create sequence\n\n# Build the input space for the predictive space (All weeks from 04-2014 to 07-2025)\nXXt1 <- fx.iso_week(grid_datetime)\nXXt2 <- fx.sin(grid_datetime)\n\n# Standardize\nXXt1c <- XXt1/53\n\n# Store inputs as a matrix\nXXt <- as.matrix(cbind.data.frame(XXt1c, XXt2))\n\n# Make predictions using predGP with the gp object and the predictive set\nppt <- predGPsep(gpi, XXt) \n\n# Now we store the mean as our predicted response i.e. density along with quantiles\nyyt <- ppt$mean\nq1t <- ppt$mean + qnorm(0.025,0,sqrt(diag(ppt$Sigma))) #lower bound\nq2t <- ppt$mean + qnorm(0.975,0,sqrt(diag(ppt$Sigma))) # upper bound\n\n# Back transform our data to original\ngp_yy <- fi(yyt)\ngp_q1 <- fi(q1t)\ngp_q2 <- fi(q2t)\n\n# Plot the observed points\nplot(as.Date(df$datetime), df$observation,\n main = paste(site_names[site_number]), col = \"black\",\n xlab = \"Dates\" , ylab = \"Abundance\",\n # xlim = c(as.Date(min(df$datetime)), as.Date(cutoff)),\n ylim = c(min(df_train$observation, gp_yy, gp_q1), max(df_train$observation, gp_yy, gp_q2)* 1.05))\n\n# Plot the testing set data \npoints(as.Date(df_test$datetime), df_test$observation, col =\"black\", pch = 19)\n\n# Line to indicate seperation between train and test data\nabline(v = as.Date(cutoff), lwd = 2)\n\n# Add the predicted response and the quantiles\nlines(grid_datetime, gp_yy, col = 4, lwd = 2)\nlines(grid_datetime, gp_q1, col = 4, lwd = 1.2, lty = 2)\nlines(grid_datetime, gp_q2, col = 4, lwd = 1.2, lty =2)\n\n\n\n\n\n\n\n# Obtain true observed values for testing set\nyt_true <- f(df_test$observation)\n\n# FInd corresponding predictions from our model in the grid we predicted on\nyt_pred <- yyt[which(grid_datetime %in% df_test$datetime)]\n\n# calculate RMSE\nrmse <- sqrt(mean((yt_true - yt_pred)^2))\nrmse\n\n[1] 0.9553652\n\n\n\n\nUse an environmental predictor in your model. Following is a function fx.green that creates the variable given the datetime and the location.\nHere is a snippet of the supporting file that you will use; You can look into the data.frame and try to plot ker for one site at a time and see what it yields.\n\nsource('code/df_spline.R') # sources the cript to make greenness predictor\nhead(df_green) # how the dataset looks\n\n site iso ker\n1 BLAN 1 0\n2 BLAN 2 0\n3 BLAN 3 0\n4 BLAN 4 0\n5 BLAN 5 0\n6 BLAN 6 0\n\n# The function to create the environmental predictor similar to iso-week and sin wave\nfx.green <- function(datetime, site, site_info = df_green){\n ker <- NULL\n iso <- fx.iso_week(datetime) # identify iso week\n df.iso <- cbind.data.frame(datetime, iso) # combine date with iso week\n sites.ker <- subset(site_info, site == site)[,2:3] # obtain kernel for location\n df.green <- df.iso %>% left_join(sites.ker, by = 'iso') # join dataframes by iso week\n ker <- df.green$ker # return kernel\n return(ker)\n}\n\n\nChoose a site\nset up X3 using fx_green\nScale X3\n\nSetting up the target dataframe\n\n# Obtaining site name\nsite_names <- unique(target$site_id)\n\n# Subsetting all the data at that location\ndf <- subset(target, target$site_id == site_names[site_number])\n\n# extracting only the datetime and obs columns\ndf <- df[, c(\"datetime\", \"observation\")]\n\n# Selecting a date before which we consider everything as training data and after this is testing data.\ncutoff = as.Date('2020-12-31')\ndf_train <- subset(df, df$datetime <= cutoff)\ndf_test <- subset(df, df$datetime > cutoff)\n\nAdding Greenness\n\n# Choose location \nsite_number = 7\ndf_green_site2 <- subset(df_green, site == site_names[site_number])\n\n# Setting up iso-week and sin wave predictors by calling the functions\nX1 <- fx.iso_week(df_train$datetime) # range is 1-53\nX2 <- fx.sin(df_train$datetime) # range is 0 to 1\n\n# you need datetime, site name and the df_green dataset.\nX3 <- fx.green(df_train$datetime, site = site_names[site_number], site_info = df_green_site2)\n\n# Centering the iso-week by diving by 53\nX1c <- X1/ 53\n\n# Scale X3\nX3c <- (X3 - min(X3))/ (max(X3)- min(X3))\n\n# We combine columns centered X1 and X2, into a matrix as our input space\nX <- as.matrix(cbind.data.frame(X1c, X2, X3c))\nhead(X)\n\n X1c X2 X3c\n[1,] 0.3584906 0.8150439 0.84585476\n[2,] 0.3962264 0.8974272 0.97749530\n[3,] 0.4528302 0.9782005 0.96806621\n[4,] 0.5094340 0.9991219 0.75535447\n[5,] 0.6226415 0.8587536 0.23957183\n[6,] 0.6792453 0.7150326 0.09822483\n\ny_obs <- df_train$observation\ny <- f(y_obs) # transform y\n\n# A very small value for stability\neps <- sqrt(.Machine$double.eps) \n \n# Priors for theta and g. \nd <- darg(list(mle=TRUE, min =eps, max=5), X)\ng <- garg(list(mle=TRUE, min = eps, max = 1), y)\n\n# Fitting a GP with our data, and some starting values for theta and g\ngpi <- newGPsep(X, y, d = 0.1, g = 1, dK = T)\n\n# Jointly infer MLE for all parameters\nmle <- jmleGPsep(gpi, drange = c(d$min, d$max), grange = c(g$min, g$max), \n dab = d$ab, gab= g$ab)\n\n# Create a grid from start date in our data set to one year in future (so we forecast for next season)\nstartdate <- as.Date(min(df$datetime))# identify start week\ngrid_datetime <- seq.Date(startdate, Sys.Date() + 365, by = 7) # create sequence\n\n# Build the input space for the predictive space (All weeks from 04-2014 to 07-2025)\nXXt1 <- fx.iso_week(grid_datetime)\nXXt2 <- fx.sin(grid_datetime)\nXXt3 <- fx.green(grid_datetime, site = site_names[site_nunber], site_info = df_green_site2)\n\n# Standardize\nXXt1c <- XXt1/53\nXXt3 <- (XXt3 - min(XXt3))/ (max(XXt3)- min(XXt3))\n\n# Store inputs as a matrix\nXXt <- as.matrix(cbind.data.frame(XXt1c, XXt2, XXt3))\n\n# Make predictions using predGP with the gp object and the predictive set\nppt <- predGPsep(gpi, XXt) \n\n# Now we store the mean as our predicted response i.e. density along with quantiles\nyyt <- ppt$mean\nq1t <- ppt$mean + qnorm(0.025,0,sqrt(diag(ppt$Sigma))) #lower bound\nq2t <- ppt$mean + qnorm(0.975,0,sqrt(diag(ppt$Sigma))) # upper bound\n\n# Back transform our data to original\ngp_yy <- fi(yyt)\ngp_q1 <- fi(q1t)\ngp_q2 <- fi(q2t)\n\n# Plot the observed points\nplot(as.Date(df$datetime), df$observation,\n main = paste(site_names[site_number]), col = \"black\",\n xlab = \"Dates\" , ylab = \"Abundance\",\n # xlim = c(as.Date(min(df$datetime)), as.Date(cutoff)),\n ylim = c(min(df_train$observation, gp_yy, gp_q1), max(df_train$observation, gp_yy, gp_q2)* 1.05))\n\n# Plot the testing set data \npoints(as.Date(df_test$datetime), df_test$observation, col =\"black\", pch = 19)\n\n# Line to indicate seperation between train and test data\nabline(v = as.Date(cutoff), lwd = 2)\n\n# Add the predicted response and the quantiles\nlines(grid_datetime, gp_yy, col = 4, lwd = 2)\nlines(grid_datetime, gp_q1, col = 4, lwd = 1.2, lty = 2)\nlines(grid_datetime, gp_q2, col = 4, lwd = 1.2, lty =2)\n\n\n\n\n\n\n\n# Obtain true observed values for testing set\nyt_true <- f(df_test$observation)\n\n# FInd corresponding predictions from our model in the grid we predicted on\nyt_pred <- yyt[which(grid_datetime %in% df_test$datetime)]\n\n# calculate RMSE\nrmse <- sqrt(mean((yt_true - yt_pred)^2))\nrmse\n\n[1] 0.9532001\n\n\n\n\nFit a GP Model for all the locations (More advanced).\n\n# GP function. This can be varied but easiest way is to just take in X, y, XX and return the predicted means and bounds. \n\ngpfit <- function(X, y , XXt){\n eps <- sqrt(.Machine$double.eps) \n \n # Priors for theta and g. \n d <- darg(list(mle=TRUE, min =eps, max=5), X)\n g <- garg(list(mle=TRUE, min = eps, max = 1), y)\n\n # Fitting a GP with our data, and some starting values for theta and g\n gpi <- newGPsep(X, y, d = 0.1, g = 1, dK = T)\n\n # Jointly infer MLE for all parameters\n mle <- jmleGPsep(gpi, drange = c(d$min, d$max), grange = c(g$min, g$max), \n dab = d$ab, gab= g$ab)\n\n ppt <- predGPsep(gpi, XXt) \n\n # Now we store the mean as our predicted response i.e. density along with quantiles\n yyt <- ppt$mean\n q1t <- ppt$mean + qnorm(0.025,0,sqrt(diag(ppt$Sigma))) #lower bound\n q2t <- ppt$mean + qnorm(0.975,0,sqrt(diag(ppt$Sigma))) # upper bound\n\n # Back transform our data to original\n gp_yy <- fi(yyt)\n gp_q1 <- fi(q1t)\n gp_q2 <- fi(q2t)\n \n return(list(mean = gp_yy, s2 = diag(ppt$Sigma), q1 = gp_q1, q2 = gp_q2))\n}\n\n\nsite_number <- 7 # (site_number = 4) for the other challenge\n\n# Obtaining site name\nsite_names <- unique(target$site_id)\n\n# extracting only the datetime and obs columns\ndf <- target[, c(\"datetime\", \"site_id\", \"observation\")]\n\ncutoff = as.Date('2020-12-31')\n\n# This was always be prediction set\nstartdate <- as.Date(min(df$datetime))# identify start week\ngrid_datetime <- seq.Date(startdate, Sys.Date() + 365, by = 7) # create sequence\n\n# You can pre process to have y transformed or have it in the loop.\nrmse <- matrix(nrow = length(site_names), ncol = 1) # if rmse\n\nfor(i in 1:length(site_names)){\n \n df_site <- subset(df, site_id == site_names[i])\n \n # cutoff for sites\n df_train <- subset(df_site, df_site$datetime <= cutoff)\n df_test <- subset(df_site, df_site$datetime > cutoff)\n \n df_green_site <- subset(df_green, site == site_names[i])\n \n X1 <- fx.iso_week(df_train$datetime) # range is 1-53\n X2 <- fx.sin(df_train$datetime) # range is 0 to 1\n X3 <- fx.green(df_train$datetime, site = site_names[site_number], site_info = df_green_site) # optional add\n\n X1c <- X1/ 53\n X3c <- (X3 - min(X3))/ (max(X3)- min(X3))\n X <- as.matrix(cbind.data.frame(X1c, X2, X3c))\n\n y_obs <- df_train$observation # only at this location\n y <- f(y_obs) # transform y\n\n XXt1 <- fx.iso_week(grid_datetime)\n XXt2 <- fx.sin(grid_datetime)\n XXt3 <- fx.green(grid_datetime, site = site_names[site_nunber], site_info =\n df_green_site)\n\n # Standardize\n XXt1c <- XXt1/53\n XXt3 <- (XXt3 - min(XXt3))/ (max(XXt3)- min(XXt3))\n XXt <- as.matrix(cbind.data.frame(XXt1c, XXt2, XXt3))\n \n fit <- gpfit(X = X, y = y, XX = XXt)\n \n # Make plots\n plot(as.Date(df_site$datetime), df_site$observation,\n main = paste(site_names[i]), col = \"black\",\n xlab = \"Dates\" , ylab = \"Abundance\",\n # xlim = c(as.Date(min(df$datetime)), as.Date(cutoff)),\n ylim = c(min(df_train$observation, df_test$observation, fit$q1),\n max(df_train$observation,df_test$observation, fit$q2)* 1.05))\n\n points(as.Date(df_test$datetime), df_test$observation, col =\"black\", pch = 19)\n abline(v = as.Date(cutoff), lwd = 2)\n\n # Add the predicted response and the quantiles\n lines(grid_datetime, fit$mean, col = 4, lwd = 2)\n lines(grid_datetime, fit$q1, col = 4, lwd = 1.2, lty = 2)\n lines(grid_datetime, fit$q2, col = 4, lwd = 1.2, lty =2)\n \n \n yt_true <- f(df_test$observation)\n yt_pred <- f(fit$mean[which(grid_datetime %in% df_test$datetime)])\n\n # calculate RMSE\n rmse[i, ] <- sqrt(mean((yt_true - yt_pred)^2))\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nrownames(rmse) <- site_names\nprint(rmse)\n\n [,1]\nBLAN 1.1173576\nKONZ 0.7779042\nLENO 0.6851390\nORNL 0.9205952\nOSBS 1.2057049\nSCBI 0.8610973\nSERC 0.9532001\nTALL 1.2216955\nUKFS 2.0705090" }, { "objectID": "Stats_review_soln.html", @@ -811,6 +811,62 @@ "section": "References", "text": "References\n\n\n\n\n\n\n\n\nBinois, Mickael, Robert B Gramacy, and Mike Ludkovski. 2018. “Practical Heteroscedastic Gaussian Process Modeling for Large Simulation Experiments.” Journal of Computational and Graphical Statistics 27 (4): 808–21.\n\n\nGramacy, Robert B. 2020. Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences. Chapman; Hall/CRC.\n\n\nThomas, R Quinn, Carl Boettiger, Cayelan C Carey, Michael C Dietze, Leah R Johnson, Melissa A Kenney, Jason S Mclachlan, et al. 2022. “The NEON Ecological Forecasting Challenge.” Authorea Preprints." }, + { + "objectID": "VB_TimeDepData_practical_soln.html#exploring-the-data", + "href": "VB_TimeDepData_practical_soln.html#exploring-the-data", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Exploring the Data", + "text": "Exploring the Data\nAs always, we first want to take a look at the data, to make sure we understand it, and that we don’t have missing or weird values.\n\nmozData<-read.csv(\"data/Culex_erraticus_walton_covariates_aggregated.csv\")\nsummary(mozData)\n\n Month_Yr sample_value MaxTemp Precip \n Length:36 Min. :0.00000 Min. :16.02 Min. : 0.000 \n Class :character 1st Qu.:0.04318 1st Qu.:22.99 1st Qu.: 2.162 \n Mode :character Median :0.73001 Median :26.69 Median : 4.606 \n Mean :0.80798 Mean :26.23 Mean : 5.595 \n 3rd Qu.:1.22443 3rd Qu.:30.70 3rd Qu.: 7.864 \n Max. :3.00595 Max. :33.31 Max. :18.307 \n\n\nWe can see that the minimum observed average number of mosquitoes it zero, and max is only 3 (there are likely many zeros averaged over many days in the month). There don’t appear to be any NAs in the data. In this case the dataset itself is small enough that we can print the whole thing to ensure it’s complete:\n\nmozData\n\n Month_Yr sample_value MaxTemp Precip\n1 2015-01 0.000000000 17.74602 3.303991888\n2 2015-02 0.018181818 17.87269 16.544265802\n3 2015-03 0.468085106 23.81767 2.405651215\n4 2015-04 1.619047619 26.03559 8.974406168\n5 2015-05 0.821428571 30.01602 0.567960943\n6 2015-06 3.005952381 31.12094 4.841342729\n7 2015-07 2.380952381 32.81130 3.849010353\n8 2015-08 1.826347305 32.56245 5.562845324\n9 2015-09 0.648809524 30.55155 10.409724627\n10 2015-10 0.988023952 27.22605 0.337750269\n11 2015-11 0.737804878 24.86768 18.306749680\n12 2015-12 0.142857143 22.46588 5.621475377\n13 2016-01 0.000000000 16.02406 3.550622029\n14 2016-02 0.020202020 19.42057 11.254680803\n15 2016-03 0.015151515 23.13610 4.785664728\n16 2016-04 0.026143791 24.98082 4.580424519\n17 2016-05 0.025252525 28.72884 0.053057634\n18 2016-06 0.833333333 30.96990 6.155417473\n19 2016-07 1.261363636 33.30509 4.496368193\n20 2016-08 1.685279188 32.09633 11.338749182\n21 2016-09 2.617142857 31.60575 2.868288451\n22 2016-10 1.212121212 29.14275 0.000000000\n23 2016-11 1.539772727 24.48482 0.005462681\n24 2016-12 0.771573604 20.46054 11.615521725\n25 2017-01 0.045454545 18.35473 0.000000000\n26 2017-02 0.036363636 23.65584 3.150710053\n27 2017-03 0.194285714 22.53573 1.430094952\n28 2017-04 0.436548223 26.15299 0.499381616\n29 2017-05 1.202020202 28.00173 6.580562663\n30 2017-06 0.834196891 29.48951 13.333939858\n31 2017-07 1.765363128 32.25135 7.493927035\n32 2017-08 0.744791667 31.86476 6.082113434\n33 2017-09 0.722222222 30.60566 4.631037395\n34 2017-10 0.142131980 27.73453 11.567112214\n35 2017-11 0.289772727 23.23140 1.195760473\n36 2017-12 0.009174312 18.93603 4.018254442" + }, + { + "objectID": "VB_TimeDepData_practical_soln.html#plotting-the-data", + "href": "VB_TimeDepData_practical_soln.html#plotting-the-data", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Plotting the data", + "text": "Plotting the data\nFirst we’ll examine the data itself, including the predictors:\n\nmonths<-dim(mozData)[1]\nt<-1:months ## counter for months in the data set\npar(mfrow=c(3,1))\nplot(t, mozData$sample_value, type=\"l\", lwd=2, \n main=\"Average Monthly Abundance\", \n xlab =\"Time (months)\", \n ylab = \"Average Count\")\nplot(t, mozData$MaxTemp, type=\"l\",\n col = 2, lwd=2, \n main=\"Average Maximum Temp\", \n xlab =\"Time (months)\", \n ylab = \"Temperature (C)\")\nplot(t, mozData$Precip, type=\"l\",\n col=\"dodgerblue\", lwd=2,\n main=\"Average Monthly Precip\", \n xlab =\"Time (months)\", \n ylab = \"Precipitation (in)\")\n\n\n\n\n\n\n\n\nVisually we noticed that there may be a bit of clumping in the values for abundance (this is subtle) – in particular, since we have a lot of very small/nearly zero counts, a transform, such as a square root, may spread things out for the abundances. It also looks like both the abundance and temperature data are more cyclical than the precipitation, and thus more likely to be related to each other. There’s also not visually a lot of indication of a trend, but it’s usually worthwhile to consider it anyway. Replotting the abundance data with a transformation:\n\nmonths<-dim(mozData)[1]\nt<-1:months ## counter for months in the data set\nplot(t, sqrt(mozData$sample_value), type=\"l\", lwd=2, \n main=\"Sqrt Average Monthly Abundance\", \n xlab =\"Time (months)\", \n ylab = \"Average Count\")\n\n\n\n\n\n\n\n\nThat looks a little bit better. I suggest we go with this for our response." + }, + { + "objectID": "VB_TimeDepData_practical_soln.html#building-a-data-frame", + "href": "VB_TimeDepData_practical_soln.html#building-a-data-frame", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Building a data frame", + "text": "Building a data frame\nBefore we get into model building, we always want to build a data frame to contain all of the predictors that we want to consider, at the potential lags that we’re interested in. In the lecture we saw building the AR, sine/cosine, and trend predictors:\n\nt <- 2:months ## to make building the AR1 predictors easier\n\nmozTS <- data.frame(\n Y=sqrt(mozData$sample_value[t]), # transformed response\n Yl1=sqrt(mozData$sample_value[t-1]), # AR1 predictor\n t=t, # trend predictor\n sin12=sin(2*pi*t/12), \n cos12=cos(2*pi*t/12) # periodic predictors\n )\n\nWe will also put in the temperature and precipitation predictors. But we need to think about what might be an appropriate lag. If this were daily or weekly data, we’d probably want to have a fairly sizable lag – mosquitoes take a while to develop, so the number we see today is not likely related to the temperature today. However, since these data are agregated across a whole month, as is the temperature/precipitaion, the current month values are likely to be useful. However, it’s even possible that last month’s values may be so we’ll add those in as well:\n\nmozTS$MaxTemp<-mozData$MaxTemp[t] ## current temps\nmozTS$MaxTempl1<-mozData$MaxTemp[t-1] ## previous temps\nmozTS$Precip<-mozData$Precip[t] ## current precip\nmozTS$Precipl1<-mozData$Precip[t-1] ## previous precip\n\nThus our full dataframe:\n\nsummary(mozTS)\n\n Y Yl1 t sin12 \n Min. :0.0000 Min. :0.0000 Min. : 2.0 Min. :-1.00000 \n 1st Qu.:0.2951 1st Qu.:0.2951 1st Qu.:10.5 1st Qu.:-0.68301 \n Median :0.8590 Median :0.8590 Median :19.0 Median : 0.00000 \n Mean :0.7711 Mean :0.7684 Mean :19.0 Mean :-0.01429 \n 3rd Qu.:1.1120 3rd Qu.:1.1120 3rd Qu.:27.5 3rd Qu.: 0.68301 \n Max. :1.7338 Max. :1.7338 Max. :36.0 Max. : 1.00000 \n cos12 MaxTemp MaxTempl1 Precip \n Min. :-1.00000 Min. :16.02 Min. :16.02 Min. : 0.000 \n 1st Qu.:-0.68301 1st Qu.:23.18 1st Qu.:23.18 1st Qu.: 1.918 \n Median : 0.00000 Median :27.23 Median :27.23 Median : 4.631 \n Mean :-0.02474 Mean :26.47 Mean :26.44 Mean : 5.660 \n 3rd Qu.: 0.50000 3rd Qu.:30.79 3rd Qu.:30.79 3rd Qu.: 8.234 \n Max. : 1.00000 Max. :33.31 Max. :33.31 Max. :18.307 \n Precipl1 \n Min. : 0.000 \n 1st Qu.: 1.918 \n Median : 4.631 \n Mean : 5.640 \n 3rd Qu.: 8.234 \n Max. :18.307 \n\n\n\nhead(mozTS)\n\n Y Yl1 t sin12 cos12 MaxTemp MaxTempl1\n1 0.1348400 0.0000000 2 8.660254e-01 5.000000e-01 17.87269 17.74602\n2 0.6841675 0.1348400 3 1.000000e+00 6.123234e-17 23.81767 17.87269\n3 1.2724180 0.6841675 4 8.660254e-01 -5.000000e-01 26.03559 23.81767\n4 0.9063270 1.2724180 5 5.000000e-01 -8.660254e-01 30.01602 26.03559\n5 1.7337683 0.9063270 6 1.224647e-16 -1.000000e+00 31.12094 30.01602\n6 1.5430335 1.7337683 7 -5.000000e-01 -8.660254e-01 32.81130 31.12094\n Precip Precipl1\n1 16.5442658 3.3039919\n2 2.4056512 16.5442658\n3 8.9744062 2.4056512\n4 0.5679609 8.9744062\n5 4.8413427 0.5679609\n6 3.8490104 4.8413427" + }, + { + "objectID": "VB_TimeDepData_practical_soln.html#building-a-first-model", + "href": "VB_TimeDepData_practical_soln.html#building-a-first-model", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Building a first model", + "text": "Building a first model\nWe will first build a very simple model – just a trend – to practice building the model, checking diagnostics, and plotting predictions.\n\nmod1<-lm(Y ~ t, data=mozTS)\nsummary(mod1)\n\n\nCall:\nlm(formula = Y ~ t, data = mozTS)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.81332 -0.47902 0.03671 0.37384 0.87119 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.904809 0.178421 5.071 1.5e-05 ***\nt -0.007038 0.008292 -0.849 0.402 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.4954 on 33 degrees of freedom\nMultiple R-squared: 0.02136, Adjusted R-squared: -0.008291 \nF-statistic: 0.7204 on 1 and 33 DF, p-value: 0.4021\n\n\nThe model output indicates that this model is not useful – the trend is not significant and it only explains about 2% of the variability. Let’s plot the predictions:\n\n## plot points and fitted lines\nplot(Y~t, data=mozTS, col=1, type=\"l\")\nlines(t, mod1$fitted, col=\"dodgerblue\", lwd=2)\n\n\n\n\n\n\n\n\nNot good – we’ll definitely need to try something else! Remember that since we’re using a linear model for this, that we should check our residual plots as usual, and then also plot the acf of the residuals:\n\npar(mfrow=c(1,3), mar=c(4,4,2,0.5)) \n\n## studentized residuals vs fitted\nplot(mod1$fitted, rstudent(mod1), col=1,\n xlab=\"Fitted Values\", \n ylab=\"Studentized Residuals\", \n pch=20, main=\"AR 1 only model\")\n\n## qq plot of studentized residuals\nqqnorm(rstudent(mod1), pch=20, col=1, main=\"\" )\nabline(a=0,b=1,lty=2, col=2)\n\n## histogram of studentized residuals\nhist(rstudent(mod1), col=1, \n xlab=\"Studentized Residuals\", \n main=\"\", border=8)\n\n\n\n\n\n\n\n\nThis doesn’t look really bad, although the histogram might be a bit weird. Finally the acf\n\nacf(mod1$residuals)\n\n\n\n\n\n\n\n\nThis is where we can see that we definitely aren’t able to capture the pattern. There’s substantial autocorrelation left at a 1 month lag, and around 6 months.\nFinally, for moving forward, we can extract the BIC for this model so that we can compare with other models that you’ll build next.\n\nn<-length(t)\nextractAIC(mod1, k=log(n))[2]\n\n[1] -44.11057" + }, + { + "objectID": "VB_TimeDepData_practical_soln.html#example-solution-ar1-model-only", + "href": "VB_TimeDepData_practical_soln.html#example-solution-ar1-model-only", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Example Solution: AR1 model only", + "text": "Example Solution: AR1 model only\n\nmod2<-lm(Y ~ Yl1, data=mozTS)\nsummary(mod2)\n\n\nCall:\nlm(formula = Y ~ Yl1, data = mozTS)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.6338 -0.2173 -0.0678 0.2463 0.8675 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.2410 0.1130 2.132 0.0405 * \nYl1 0.6899 0.1240 5.562 3.51e-06 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3598 on 33 degrees of freedom\nMultiple R-squared: 0.4839, Adjusted R-squared: 0.4682 \nF-statistic: 30.94 on 1 and 33 DF, p-value: 3.507e-06\n\n\nThe model is better than the original trend only model – the AR1 term explains about 48% of the variability. Let’s plot the predictions:\n\n## plot points and fitted lines\nplot(Y~t, data=mozTS, col=1, type=\"l\")\nlines(t, mod2$fitted, col=2, lwd=2)\n\n\n\n\n\n\n\n\nPretty good! Look at all of the diagnostic plots:\n\npar(mfrow=c(1,3), mar=c(4,4,2,0.5)) \n\n## studentized residuals vs fitted\nplot(mod2$fitted, rstudent(mod2), col=2,\n xlab=\"Fitted Values\", \n ylab=\"Studentized Residuals\", \n pch=20, main=\"AR 1 only model\")\n\n## qq plot of studentized residuals\nqqnorm(rstudent(mod2), pch=20, col=2, main=\"\" )\nabline(a=0,b=1,lty=2, col=1)\n\n## histogram of studentized residuals\nhist(rstudent(mod2), col=2, \n xlab=\"Studentized Residuals\", \n main=\"\", border=8)\n\n\n\n\n\n\n\n\nMaybe one outlier, but not too bad.\n\nacf(mod2$residuals)\n\n\n\n\n\n\n\n\nWe seem to have taken care of all of the autoregression, even at multiple lags!\n\nn<-length(t)\nextractAIC(mod2, k=log(n))[2]\n\n[1] -66.50482\n\n\nBIC is much lower – overall a much much better model than the first one." + }, + { + "objectID": "VB_TimeDepData_practical_soln.html#example-solution-sinecosine-terms-only", + "href": "VB_TimeDepData_practical_soln.html#example-solution-sinecosine-terms-only", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Example Solution: sine/cosine terms only", + "text": "Example Solution: sine/cosine terms only\n\nmod3<-lm(Y ~ sin12 + cos12, data=mozTS)\nsummary(mod3)\n\n\nCall:\nlm(formula = Y ~ sin12 + cos12, data = mozTS)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.70116 -0.21655 -0.03611 0.19213 0.67992 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.75706 0.05750 13.165 1.83e-14 ***\nsin12 -0.38804 0.08072 -4.807 3.48e-05 ***\ncos12 -0.34298 0.08192 -4.187 0.000207 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3399 on 32 degrees of freedom\nMultiple R-squared: 0.5533, Adjusted R-squared: 0.5254 \nF-statistic: 19.82 on 2 and 32 DF, p-value: 2.512e-06\n\n\nThe model is better than the original trend only model – it explains about 55% of the variability (we expect R^2 to increase as we have more predictors). Let’s plot the predictions:\n\n## plot points and fitted lines\nplot(Y~t, data=mozTS, col=1, type=\"l\")\nlines(t, mod3$fitted, col=3, lwd=2)\n\n\n\n\n\n\n\n\nPretty good! Look at all of the diagnostic plots:\n\npar(mfrow=c(1,3), mar=c(4,4,2,0.5)) \n\n## studentized residuals vs fitted\nplot(mod3$fitted, rstudent(mod3), col=3,\n xlab=\"Fitted Values\", \n ylab=\"Studentized Residuals\", \n pch=20, main=\"sin/cos only model\")\n\n## qq plot of studentized residuals\nqqnorm(rstudent(mod3), pch=20, col=3, main=\"\" )\nabline(a=0,b=1,lty=2, col=2)\n\n## histogram of studentized residuals\nhist(rstudent(mod3), col=3, \n xlab=\"Studentized Residuals\", \n main=\"\", border=8)\n\n\n\n\n\n\n\n\nMaybe one outlier, but not too bad.\n\nacf(mod3$residuals)\n\n\n\n\n\n\n\n\nWe seem have taken care of the longer lag autocorrelation, but still some lag 1 left.\n\nn<-length(t)\nextractAIC(mod3, k=log(n))[2]\n\n[1] -68.00597\n\n\nThis model is even better than the AR1 model. We’ll keep this in mind…." + }, + { + "objectID": "VB_TimeDepData_practical_soln.html#example-solution-environmental-predictors-only", + "href": "VB_TimeDepData_practical_soln.html#example-solution-environmental-predictors-only", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Example Solution: environmental predictors only", + "text": "Example Solution: environmental predictors only\nI’ll put in the predictors at the current time period. Since this is monthly averaged data we could probably do either current or lagged.\n\nmod4<-lm(Y ~ MaxTemp + Precip, data=mozTS)\nsummary(mod4)\n\n\nCall:\nlm(formula = Y ~ MaxTemp + Precip, data = mozTS)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.76043 -0.17925 -0.01671 0.15491 0.64193 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -1.248452 0.323576 -3.858 0.000521 ***\nMaxTemp 0.075450 0.011641 6.481 2.72e-07 ***\nPrecip 0.003928 0.011870 0.331 0.742852 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3344 on 32 degrees of freedom\nMultiple R-squared: 0.5676, Adjusted R-squared: 0.5406 \nF-statistic: 21 on 2 and 32 DF, p-value: 1.493e-06\n\n\nThe model is even better than the last – the model explains about 58% of the variability, although the Precip isn’t significant and we might want to consider dropping it. Let’s plot the predictions:\n\n## plot points and fitted lines\nplot(Y~t, data=mozTS, col=1, type=\"l\")\nlines(t, mod4$fitted, col=4, lwd=2)\n\n\n\n\n\n\n\n\nPretty good! Look at all of the diagnostic plots:\n\npar(mfrow=c(1,3), mar=c(4,4,2,0.5)) \n\n## studentized residuals vs fitted\nplot(mod4$fitted, rstudent(mod4), col=4,\n xlab=\"Fitted Values\", \n ylab=\"Studentized Residuals\", \n pch=20, main=\"weather model\")\n\n## qq plot of studentized residuals\nqqnorm(rstudent(mod4), pch=20, col=4, main=\"\" )\nabline(a=0,b=1,lty=2, col=2)\n\n## histogram of studentized residuals\nhist(rstudent(mod4), col=4, \n xlab=\"Studentized Residuals\", \n main=\"\", border=8)\n\n\n\n\n\n\n\n\nMaybe one outlier again, but not too bad.\n\nacf(mod4$residuals)\n\n\n\n\n\n\n\n\nWe seem to have taken care of all of the autoregression, except maybe a bit of AR1.\n\nn<-length(t)\nextractAIC(mod4, k=log(n))[2]\n\n[1] -69.14372\n\n\nEven better, although it’s not much different than the sin/cos" + }, + { + "objectID": "VB_TimeDepData_practical_soln.html#example-solution-ar1-plus-sincos", + "href": "VB_TimeDepData_practical_soln.html#example-solution-ar1-plus-sincos", + "title": "VectorByte Methods Training: Regression Methods for Time Dependent Data (practical - solution)", + "section": "Example Solution: AR1 plus sin/cos", + "text": "Example Solution: AR1 plus sin/cos\nOk, now to combine things:\n\nmod5<-lm(Y ~ Yl1 + sin12 + cos12, data=mozTS)\nsummary(mod5)\n\n\nCall:\nlm(formula = Y ~ Yl1 + sin12 + cos12, data = mozTS)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.49092 -0.25028 -0.02153 0.17287 0.60748 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.38035 0.12935 2.940 0.006148 ** \nYl1 0.49652 0.15681 3.166 0.003453 ** \nsin12 -0.13417 0.10729 -1.251 0.220457 \ncos12 -0.29593 0.07386 -4.007 0.000358 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.3002 on 31 degrees of freedom\nMultiple R-squared: 0.6625, Adjusted R-squared: 0.6298 \nF-statistic: 20.28 on 3 and 31 DF, p-value: 1.835e-07\n\n\nThe model is better than the original trend only model – the AR1 term explains about 48% of the variability. Let’s plot the predictions:\n\n## plot points and fitted lines\nplot(Y~t, data=mozTS, col=1, type=\"l\")\nlines(t, mod5$fitted, col=5, lwd=2)\n\n\n\n\n\n\n\n\nPretty good! Look at all of the diagnostic plots:\n\npar(mfrow=c(1,3), mar=c(4,4,2,0.5)) \n\n## studentized residuals vs fitted\nplot(mod5$fitted, rstudent(mod5), col=5,\n xlab=\"Fitted Values\", \n ylab=\"Studentized Residuals\", \n pch=20, main=\"AR 1 only model\")\n\n## qq plot of studentized residuals\nqqnorm(rstudent(mod5), pch=20, col=5, main=\"\" )\nabline(a=0,b=1,lty=2, col=2)\n\n## histogram of studentized residuals\nhist(rstudent(mod5), col=5, \n xlab=\"Studentized Residuals\", \n main=\"\", border=8)\n\n\n\n\n\n\n\n\nThat’s really good!.\n\nacf(mod5$residuals)\n\n\n\n\n\n\n\n\nWe seem to have taken care of all of the autoregression!\n\nn<-length(t)\nextractAIC(mod5, k=log(n))[2]\n\n[1] -74.25862\n\n\nAnd definitely the best so far. Just to compare more easily:\n\nc(mod1 = extractAIC(mod1, k=log(n))[2],\n mod2 = extractAIC(mod2, k=log(n))[2],\n mod3 = extractAIC(mod3, k=log(n))[2],\n mod4 = extractAIC(mod4, k=log(n))[2],\n mod5 = extractAIC(mod5, k=log(n))[2])\n\n mod1 mod2 mod3 mod4 mod5 \n-44.11057 -66.50482 -68.00597 -69.14372 -74.25862 \n\n\nWe’re looking for difference of about 5 to determine if a model is better. Model 5 is about 5 better than model 4, and models 2-4 are all about even. It may be that AR1 plus temperature might be even better, but it’s easier to forecast with a sine/cosine than using temperature, so I went for that…." + }, { "objectID": "GP_Practical.html", "href": "GP_Practical.html",