Skip to content

Commit

Permalink
made a few changes to get GP content into the format that matches the…
Browse files Browse the repository at this point in the history
… rest of the content
  • Loading branch information
lrjohnson0 committed Jul 19, 2024
1 parent 2ce5c11 commit 31bc49e
Show file tree
Hide file tree
Showing 6 changed files with 349 additions and 40 deletions.
16 changes: 11 additions & 5 deletions GP.qmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
---
title: "Gaussian Processes"
title: "VectorByte Methods Training"
subtitle: "Introduction to Gaussian Processes for Time Dependent Data"
editor: source
author: "The VectorByte Team (Parul Patil, Virginia Tech)"
title-slide-attributes:
data-background-image: VectorByte-logo_lg.png
data-background-size: contain
data-background-opacity: "0.2"
format: revealjs
editor: visual
fontsize: 20pt
fontsize: 22pt
html-math-method: katex
---

Expand Down Expand Up @@ -120,11 +126,11 @@ ggplot() +

- Recall that in Linear Regression, you have $\ \Sigma \ = \sigma^2 \mathbb{I}$

- For a GP, the cavariance matrix ( \$ \Sigma \$ ) is defined by a kernel.
- For a GP, the covariance matrix ( $\Sigma$ ) is defined by a kernel.

- Consider,

\$\Sigma\_n = \tau\^2  C_n  \$;    \$C_n = \exp \left( - \vert \vert x - x' \vert \vert\^2 \right) \$, where $x$ and $x'$ are input locations.
$\Sigma_n = \tau^2 C_n$ where $C_n = \exp \left( - \vert \vert x - x' \vert \vert^2 \right)$, and $x$ and $x'$ are input locations.

. . .

Expand Down
40 changes: 26 additions & 14 deletions GP_Notes.qmd
Original file line number Diff line number Diff line change
@@ -1,13 +1,25 @@
---
title: "Notes"
author: "Parul Patil"
date: "2024-07-18"
output: html_document
title: "VectorByte Methods Training"
subtitle: "Introduction to Gaussian Processes for Time Dependent Data"
author: "The VectorByte Team (Parul Patil, Virginia Tech)"
title-slide-attributes:
data-background-image: VectorByte-logo_lg.png
data-background-size: contain
data-background-opacity: "0.2"
format:
html:
toc: true
toc-location: left
html-math-method: katex
css: styles.css
---

### Gaussian Processes
# Introduction to Gaussian Processes for Time Dependent Data

#### Introduction
This document introduces the conceptual background to Gaussian Process (GP) regression, along with mathematical concepts. We also demonstrate briefly fitting GPs using the `laGP` package in R. The material here is intended to give a more verbose introduction to what is covered in the [lecture](GP.qmd) in order to support a student to work through the [practical component](GP_Practical.qmd). This material has been adapted from chapter 5 of the book [Surrogates: Gaussian process modeling, design and optimization for the applied sciences](https://bobby.gramacy.com/surrogates/) by Robert Gramacy.


# Gaussian Processes

A Gaussian Process (GP) is a non-parameteric and flexible method to model data. Here, we assume that the data follow a Multivariate Normal Distribution. It is widely used in the fields of spatial statistics commonly known as *kriging* as well as machine learning as a surrogate model due to it's ability of making fast predictions with good uncertainty quantification.

Expand All @@ -17,7 +29,7 @@ For e.g. Suppose we wish to estimate the amount of energy released when a bomb e

In the field of ecology, as we will see in this tutorial, we will use a Gaussian Process model as a Non-Parametric Regression tool, similar to Linear Regression. We will assume our data follows a GP, and use Bayesian Methods to infer the distribution of new locations given the data. The predictions obtained will have good uncertainty quantification which is ofcourse valuable in this field due to the noisy nature of our data.

##### Gaussian Process Prior
## Gaussian Process Prior

In our setup, we assume that the data follows a Multivariate Normal Distribution. We can think of this as a Prior. Mathematically, we can write it as:

Expand All @@ -35,7 +47,7 @@ $$\Sigma(x, x') = \exp \left( \ \vert \vert{x - x'} \vert \vert^2 \ \right)$$

This creates a positive semi-definite correlation kernel. It also uses the proximity of $x$ and $x'$ as a measure of correlation i.e. the closer two points in the input space are, the more highly their corresponding responses are correlated. We will learn the exact form of $\Sigma_n$ later in the tutorial. First, we need to learn about MVN Distribution and the Posterior Distribution given the data.

**Multivariate Normal Distribution**
## Multivariate Normal Distribution

Suppose we have $X = (X_1, X_2)$

Expand Down Expand Up @@ -116,7 +128,7 @@ Y(x_p) \vert \ D_n \ & \sim \mathcal{N} \left(\mu(x_p) \ , \ \sigma^2(x_p) \righ
\end{aligned}
\end{equation}

**Visualization**
## Example: GP for Toy Example

Suppose we have data from the following function,
$$Y(x) = 5 \ \sin (x)$$
Expand Down Expand Up @@ -180,7 +192,7 @@ ggplot() +

<!-- Do it for one location -->

##### Covariance Function
## Covariance Function

As mentioned before, the main action is in the specification of $\Sigma_n$. Let's express $\Sigma_n = \tau^2 C_n$ where $C_n$ is the correlation function. We will be using the squared exponential distance based correlation function. The kernel can be written down mathematically as,

Expand All @@ -190,7 +202,7 @@ Here, if $x$ and $x'$ are closer in distance, the responses are more highly corr

We have three main hyper-parameters here:

###### $\tau^2$: Scale
### $\tau^2$: Scale

This parameter dictates the amplitude of our function. A MVN Distribution with scale = 1 will usually have data between -2 to 2. As the scale increases, this range expands. Here is a plo that shows two different scales for the same function, with all other parameters fixed.

Expand Down Expand Up @@ -224,7 +236,7 @@ matplot(X, t(Y_scaled), type = 'l', main = expression(paste(tau^2, " = 25")),
ylab = "Y", xlab = "X", lwd = 2, col = "red")
```

###### $\theta$: Length-scale
### $\theta$: Length-scale

The length-scale controls the wiggliness of the function. It is also referred to as the rate of decay of correlation. The smaller it's value, the more wiggly the function gets. This is because we expect the change in directionalilty of the function to be rather quick. We will once again demonstrate how a difference in magnitude of $\theta$ affects the function, keeping everything else constant.

Expand Down Expand Up @@ -262,7 +274,7 @@ $$C_n = \exp{ \left( - \sum_{k=1}^{m} \frac{\vert\vert x - x' \vert \vert ^2}{\

This type of modelling is also referred to a seperable GP since we can take the sum outside the exponent and it will be a product of $m$ independent dimensions. If we set all $\theta_k= \theta$, it would be an isotropic GP.

###### $g$: Nugget
### $g$: Nugget

This parameter adds some noise into the function. For $g > 0$, it determines the magnitude of discontinuity as $x'$ tends to $x$. It is also called the nugget effect. For $g=0$, there would be no noise and the function would interpolate between the points. This effect is only added to the diagonals of the correlation matrix. We never add $g$ to the off-diagonal elements. This also allows for numeric stability. Below is a snippet of what different magnitudes of $g$ look like.

Expand Down Expand Up @@ -306,7 +318,7 @@ For now, let's get back to GP and fitting and learn how to use it. We have alrea
- $\tau^2$: An estimate is obtained as a closed form solution once we plug in g.


### Heteroskedastic Gaussian Processes
# Heteroskedastic Gaussian Processes

Let $Y_N$ be the response vector of size N. Let $X_1, X_2 ... X_d$ be the input space.

Expand Down
51 changes: 31 additions & 20 deletions GP_Practical.qmd
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
---
title: "Gaussian Processes: Practical"
author: "Parul Patil"
date: "2024-06-21"
output: html_document
title: "VectorByte Methods Training"
subtitle: "Practical: Introduction to Gaussian Processes for Time Dependent Data"
author: "The VectorByte Team (Parul Patil, Virginia Tech)"
format:
html:
toc: true
toc-location: left
html-math-method: katex
css: styles.css
---

## Gaussian Processes
# Objectives

This practical will lead you through fitting a few versions of GPs using two R packages: `laGP` and `hetGP`. We will begin with a toy example from the lecture and then move on to a real data example to forecast tick abundances for a NEON site.

### Basics: Fitting a GP Model
# Basics: Fitting a GP Model

Remember the function we looked at before?
Remember the function we looked at before:
$$Y(x) = 5 \ \sin(x)$$
Now, let's learn how we actually use the library `laGP` to fit a GP and make predictions at new locations. Let's begin by loading some libraries

Expand Down Expand Up @@ -117,10 +123,13 @@ ggplot() +

Looks pretty cool.


# Using GPs for data on tick abundances over time

We will try all this on a simple dataset: Tick Data from NEON Forecasting Challenge. We will first learn a little bit about this dataset, followed by setting up our predictors and using them in our model to predict tick density for the future season. We will also learn how to fit a separable GP and specify priors for our parameters. Finally, we will learn some basics about a HetGP (Heteroskedastic GP) and try and fit that model as well.


### Overview of the Data
## Overview of the Data

- **Objective**: Forecast tick density for 4 weeks into the future

Expand All @@ -144,7 +153,7 @@ target <- readr::read_csv("https://data.ecoforecast.org/neon4cast-targets/ticks/
head(target)
```

#### Initial Setup
## Initial Setup

- For a GP model, we assume the response ($Y$) should be normally distributed.

Expand Down Expand Up @@ -180,7 +189,7 @@ fi <- function(y) {
}
```

#### Predictors
## Predictors

- The goal is to forecast tick populations for a season so our response (Y) here, is the tick density. However, we do not have a traditional data set with an obvious input space. What is the X? <!-- interactive -->

Expand Down Expand Up @@ -278,7 +287,7 @@ $X^* = (X_1^*, X_2^* ... X_n^*)$ will be the standarized $X$'s with all $X_i^*$

We can either write a function for this, or in our case, we can just divide Iso-week by 53 since that would result effectively be the same. Our Sin Predictor already lies in the interval [0, 1].

### Model Fitting
## Model Fitting

Now, let's start with modelling. We will start with one random location out of the 9 locations.

Expand Down Expand Up @@ -310,7 +319,7 @@ df_train <- subset(df, df$datetime <= cutoff)
df_test <- subset(df, df$datetime > cutoff)
```

#### GP Model
## GP Model

Now we will setup our X's. We already have the functions to do this and can simply pass in the datetime. We then combine $X_1$ and $X_2$ to create out input matrix $X$. Remember, everything is ordered as in our dataset.

Expand Down Expand Up @@ -433,11 +442,10 @@ rmse
```


Next, we can attempt a hetGP.

#### HetGP Model
## HetGP Model

We are now interested in fitting a vector of nuggets rather than a single value.
Next, we can attempt a hetGP. We are now interested in fitting a vector of nuggets rather than a single value.

Let's use the same data we have to fit a hetGP. We already have our data `(X, y)` as well as our prediction set `XXt`. We use the `mleHetGP` command to fit a GP and pass in our data. The default covariance structure is the Squared Exponential structure. We use the `predict` function in base R and pass the hetGP object i.e. `het_gpi` to make predictions on our set `XXt`.

Expand Down Expand Up @@ -552,11 +560,13 @@ plot(X, y)
# Add code to fit a hetGP model and visualise it as above
```

### Challenges
# Challenges

- Fit a GP Model for the location "SERC" i.e. `site_number = 7`.
Now it's your turn to try to fit a GP on your own. Here are some options:

- Use an environmental predictor in your model. Following is a function `fx.green` that creates the variable given the `datetime` and the `location`.
### Fit a GP Model for the location "SERC" i.e. `site_number = 7`.

### Use an environmental predictor in your model. Following is a function `fx.green` that creates the variable given the `datetime` and the `location`.

Here is a snippet of the supporting file that you will use; You can look into the data.frame and try to plot `ker` for one site at a time and see what it yields.

Expand All @@ -576,10 +586,11 @@ fx.green <- function(datetime, site, site_info = df_green){
}
```

- Fit a GP Model for all the locations (*More advanced*).
### Fit a GP Model for all the locations (*More advanced*).

Hint: Write a *function* that can fit a GP and make predictions. Then, write a *for loop* where you subset the data for each location and then call the function to fit the GP.

<!-- Change colors on plots -->

<!-- \link{https://projects.ecoforecast.org/neon4cast-docs/Ticks.htmls} -->
<!-- \link{https://projects.ecoforecast.org/neon4cast-docs/Ticks.htmls} -->

File renamed without changes.
Loading

0 comments on commit 31bc49e

Please sign in to comment.