finishing the linear models for time dep data, including adding data

VectorByteOrg · Jul 15, 2024 · 9367f95 · 9367f95
1 parent c64b772
commit 9367f95
Show file tree

Hide file tree

Showing 7 changed files with 176,749 additions and 1 deletion.
diff --git a/VB_IntroTimeDepData_practical.qmd b/VB_IntroTimeDepData_practical.qmd
@@ -0,0 +1,177 @@
+---
+title: "VectorByte Methods Training"
+subtitle: "Practical: Intro to Time Dependent Data"
+author: "The VectorByte Team (Leah R. Johnson, Virginia Tech)"
+format:
+  html:
+    toc: true
+    toc-location: left
+    html-math-method: katex
+    css: styles.css
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+set.seed(123)
+```
+
+<br>
+
+# Overview and Instructions
+
+The goal of this practical is to practice building models for time-dependent data using simple regression based techniques. This includes incorporated possible transformations, trying out different time dependent predictors (including lagged variables) and assessing model fit using diagnostic plots. 
+
+<br>
+
+# Guided example: Monthly average mosquito counts in Walton County, FL
+
+The file [Culex_erraticus_walton_covariates_aggregated.csv](data/Culex_erraticus_walton_covariates_aggregated.csv) on the course website contains data on **average monthly counts of mosquitos** (`sample_value`) in Walton, FL, together with monthly average maximum temperature (`MaxTemp` in C) and precipitation (`Precip` in inches) for each month from January 2015 through December 2017 (`Month_Yr`).
+
+
+## Exploring the Data
+
+As always, we first want to take a look at the data, to make sure we understand it, and that we don't have missing or weird values.
+
+```{r}
+mozData<-read.csv("data/Culex_erraticus_walton_covariates_aggregated.csv")
+summary(mozData)
+```
+We can see that the minimum observed average number of mosquitoes it zero, and max is only 3 (there are likely many zeros averaged over many days in the month). There don't appear to be any `NA`s in the data. In this case the dataset itself is small enough that we can print the whole thing to ensure it's complete:
+```{r}
+mozData
+```
+
+## Plotting the data
+
+First we'll examine the data itself, including the predictors:
+
+```{r, fig.align='center', fig.height= 10, fig.width=8}
+months<-dim(mozData)[1]
+t<-1:months ## counter for months in the data set
+par(mfrow=c(3,1))
+plot(t, mozData$sample_value, type="l", lwd=2, 
+     main="Average Monthly Abundance", 
+     xlab ="Time (months)", 
+     ylab = "Average Count")
+plot(t, mozData$MaxTemp, type="l",
+     col = 2, lwd=2, 
+     main="Average Maximum Temp", 
+     xlab ="Time (months)", 
+     ylab = "Temperature (C)")
+plot(t, mozData$Precip, type="l",
+     col="dodgerblue", lwd=2,
+     main="Average Monthly Precip", 
+     xlab ="Time (months)", 
+     ylab = "Precipitation (in)")
+
+```
+Visually we noticed that there may be a bit of clumping in the values for abundance (this is subtle) -- in particular, since we have a lot of very small/nearly zero counts, a transform, such as a square root, may spread things out for the abundances. It also looks like both the abundance and temperature data are more cyclical than the precipitation, and thus more likely to be related to each other. There's also not visually a lot of indication of a trend, but it's usually worthwhile to consider it anyway. Replotting the abundance data with a transformation:
+
+```{r, fig.align='center', fig.width=8}
+months<-dim(mozData)[1]
+t<-1:months ## counter for months in the data set
+plot(t, sqrt(mozData$sample_value), type="l", lwd=2, 
+     main="Sqrt Average Monthly Abundance", 
+     xlab ="Time (months)", 
+     ylab = "Average Count")
+```
+
+That looks a little bit better. I suggest we go with this for our response. 
+
+## Building a data frame
+
+Before we get into model building, we always want to build a data frame to contain all of the predictors that we want to consider, at the potential lags that we're interested in. In the lecture we saw building the AR, sine/cosine, and trend predictors:
+```{r}
+t <- 2:months ## to make building the AR1 predictors easier
+
+mozTS <- data.frame(
+  Y=sqrt(mozData$sample_value[t]), # transformed response
+  Yl1=sqrt(mozData$sample_value[t-1]), # AR1 predictor
+  t=t, # trend predictor
+  sin12=sin(2*pi*t/12), 
+  cos12=cos(2*pi*t/12) # periodic predictors
+  )
+```
+
+We will also put in the temperature and precipitation predictors. But we need to think about what might be an appropriate lag. If this were daily or weekly data, we'd probably want to have a fairly sizable lag -- mosquitoes take a while to develop, so the number we see today is not likely related to the temperature today. However, since these data are agregated across a whole month, as is the temperature/precipitaion, the current month values are likely to be useful. However, it's even possible that last month's values may be so we'll add those in as well:
+
+```{r}
+mozTS$MaxTemp<-mozData$MaxTemp[t] ## current temps
+mozTS$MaxTempl1<-mozData$MaxTemp[t-1] ## previous temps
+mozTS$Precip<-mozData$Precip[t] ## current precip
+mozTS$Precipl1<-mozData$Precip[t-1] ## previous precip
+```
+
+Thus our full dataframe:
+```{r}
+summary(mozTS)
+```
+
+```{r}
+head(mozTS)
+```
+
+## Building a first model
+
+We will first build a very simple model -- just a trend  -- to practice building the model, checking diagnostics, and plotting predictions.
+
+```{r}
+mod1<-lm(Y ~ t, data=mozTS)
+summary(mod1)
+```
+
+The model output indicates that this model is not useful -- the trend is not significant and it only explains about 2% of the variability. Let's plot the predictions:
+
+```{r, fig.align='center', fig.height=4, fig.width=5}
+## plot points and fitted lines
+plot(Y~t, data=mozTS, col=1, type="l")
+lines(t, mod1$fitted, col="dodgerblue", lwd=2)
+```
+
+
+Not good -- we'll definitely need to try something else! Remember that since we're using a linear model for this, that we should check our residual plots as usual, and then also plot the `acf` of the residuals:
+
+```{r, fig.align='center', fig.height=4, fig.width=8}
+par(mfrow=c(1,3), mar=c(4,4,2,0.5))   
+
+## studentized residuals vs fitted
+plot(mod1$fitted, rstudent(mod1), col=1,
+     xlab="Fitted Values", 
+     ylab="Studentized Residuals", 
+     pch=20, main="AR 1 only model")
+
+## qq plot of studentized residuals
+qqnorm(rstudent(mod1), pch=20, col=1, main="" )
+abline(a=0,b=1,lty=2, col=2)
+
+## histogram of studentized residuals
+hist(rstudent(mod1), col=1, 
+     xlab="Studentized Residuals", 
+     main="", border=8)
+```
+
+This doesn't look really bad, although the histogram might be a bit weird. Finally the `acf`
+
+```{r, fig.align='center', fig.height=4, fig.width=8}
+acf(mod1$residuals)
+```
+This is where we can see that we definitely aren't able to capture the pattern. There's substantial autocorrelation left at a 1 month lag, and around 6 months. 
+
+Finally, for moving forward, we can extract the BIC for this model so that we can compare with other models that you'll build next. 
+
+```{r}
+n<-length(t)
+extractAIC(mod1, k=log(n))[2]
+```
+
+# Build and compare your own models
+
+Follow the procedure I showed for the model with a simple trend, and build ***at least*** 4 more models:
+
+1. one that contains an AR term
+2. one with the sine/cosine terms
+3. one with the environmental predictors
+4. one with a combination
+
+Check diagnostics/model assumptions as you go. Then at the end compare all of your models via BIC. What is your best model by that metric? We'll share among the group what folks found to be good models. 
+
diff --git a/_quarto.yml b/_quarto.yml
@@ -27,5 +27,5 @@ format:
     css: styles.css
     toc: true
 
-editor: visual
+editor: source
 
diff --git a/code/data_formatting_TSmozzies.R b/code/data_formatting_TSmozzies.R
@@ -0,0 +1,83 @@
+## code to clean data on mosquito abundance 
+## and combine it with climate predictors
+## based on code from previous VectorBiTE training
+
+require(dplyr)
+require(tidyr)
+
+## mozzie data
+
+Complete_data <- read.csv("../data/Culex_erraticus_walton.csv")
+
+# select only the rows of interest 
+Main_data <- select(Complete_data, c("sample_end_date", "sample_value", 
+                                     "sample_lat_dd", "sample_long_dd"))
+
+# Make sure the sample end date is in date format
+Main_data$sample_end_date <- as.Date(Main_data$sample_end_date, format = "%Y-%m-%d")
+
+# Order by date 
+Main_data <- Main_data[order(Main_data$sample_end_date, decreasing=FALSE),]
+
+# We can now create columns for Month/Year and Month 
+Main_data$Month_Yr <- format(as.Date(Main_data$sample_end_date), "%Y-%m")
+Main_data$Month <- format(as.Date(Main_data$sample_end_date), "%m")
+
+Main_data$MaxTemp <- Main_data$Precip <- NA
+
+## Climate Data
+
+Climate_data <- read.csv("../data/vectorbase_locations_dates_climate.csv")
+
+## We can now populate these columns by matching up the date for each row, 
+## and the closest co-ordinates we have in our climate data.
+
+# For each row in Main_data
+for (row in 1:nrow(Main_data)){
+
+  # extract the date associated with the row 
+  date <- as.character(Main_data[row, "sample_end_date"])
+
+  # subset the climate data to only those with the same date
+  data <- subset(Climate_data, Climate_data$Collection.date.range == date)
+
+  if (nrow(data)>0){
+
+    # define the lat and long desired
+    lat <- as.numeric(Main_data[row, "sample_lat_dd"])
+    long <- as.numeric(Main_data[row, "sample_long_dd"])
+
+    # find the absolute differences between desired lat/longs to the climate datasets
+    x <- abs(data$Initial_latitude - lat)
+    y <- abs(data$Initial_longitude - long)
+
+    # find the index for which there is the minimum overall difference between lat/longs 
+    z<-which(x+y==min(x+y))
+
+    # draw out the max temp and place into main data frame 
+    Main_data[row, "MaxTemp"] <- data[z, "Max.Temp"]
+    Main_data[row, "Precip"] <- data[z, "Precipitation"]
+
+  }
+
+  else{
+
+    # If there aren't any data to extract for a given date, input NAs
+    Main_data[row, "MaxTemp"] <- NA
+    Main_data[row, "Precip"] <- NA
+  } 
+}
+
+summary(Main_data) ## looks good
+
+## take averages for each month
+Aggregated_data <- aggregate(cbind(sample_value, MaxTemp, Precip) ~ Month_Yr, 
+                             data = Main_data, mean)
+print(Aggregated_data)
+
+write.csv(Aggregated_data, file="Culex_erraticus_walton_covariates_aggregated.csv", 
+          row.names = FALSE)
+
+plot(Aggregated_data$sample_value, type="l")
+
+