Skip to content

Commit

Permalink
slight corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
stineb committed Sep 2, 2021
1 parent 7bf438a commit 2fbc059
Show file tree
Hide file tree
Showing 11 changed files with 24 additions and 14 deletions.
2 changes: 1 addition & 1 deletion .Rproj.user/B7E35732/bibliography-index/biblio-files
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
1629982167:/Users/bestocke/ml4ec_workshop/book.bib
1630569059:/Users/bestocke/ml4ec_workshop/packages.bib
1630577408:/Users/bestocke/ml4ec_workshop/packages.bib
4 changes: 2 additions & 2 deletions .Rproj.user/B7E35732/sources/prop/CEF9E438
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"source_window_id": "",
"Source": "Source",
"cursorPosition": "20,18",
"scrollLine": "0",
"cursorPosition": "33,155",
"scrollLine": "27",
"docOutlineVisible": "1"
}
8 changes: 6 additions & 2 deletions 07-solutions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ summary(linmod_baser)
BIC(linmod_baser)
```

The variable `PA_F` was not significant in the linear model. Therefore, we won't use it for the models below.

```{r warning=FALSE, message=FALSE}
## Fit an lm model on the same data, but with PA_F removed.
linmod_baser_nopaf <- lm(
Expand Down Expand Up @@ -186,6 +188,10 @@ eval_model <- function(mod, df_train, df_test){
eval_model(mod = linmod_baser, df_train = ddf_train, df_test = ddf_test)
```

Here, the function `eval_model()` returned an object that is made up of two plots (`return(gg1 + gg2)` in the function definition). This combination of plots by `+` is enabled by the [**patchwork**](https://patchwork.data-imaginist.com/) library. The individual plot objects (`gg1` and `gg2`) are returned by the `ggplot()` functions. The visualisation here is density plot of hexagonal bins. It shows the number of points inside each bin, encoded by the color (see legend "count"). We want the highest density of points along the 1:1 line (the dotted line). Predictions match observations perfectly for points lying on the 1:1 line. Alternatively, we could also use a scatterplot to visualise the model evaluation. However, a large number of points would overlie each other. As typical machine learning applications make use of large number of data, such evaluation plots would typically face the problem of overlying points and density plots are a solution.

Metrics are given in the subtitle of the plots. Note that the $R^2$ and the RMSE measure different aspects of model-data agreement. Here, the measure the correlation (fraction of variation explained), and the average error. We should generally consider multiple metrics measuring multiple aspects of the prediction-observation fit to evaluate models.

## KNN

### Check data
Expand All @@ -201,8 +207,6 @@ ddf_train %>%
facet_wrap(~variable, scales = "free")
```

The variable `PA_F` looks weird and was not significant in the linear model. Therefore, we won't use it for the models below.

### Training

Fit two KNN models on `ddf_train` (excluding `"PA_F"`), one with $k = 2$ and one with $k = 30$, both without resampling. Use the RMSE as the loss function. Center and scale data as part of the pre-processing and model formulation specification using the function `recipe()`.
Expand Down
2 changes: 1 addition & 1 deletion docs/data-splitting.html
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,7 @@ <h2><span class="header-section-number">3.1</span> Reading and wrangling data</h
<span id="cb2-18"><a href="data-splitting.html#cb2-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-19"><a href="data-splitting.html#cb2-19" aria-hidden="true" tabindex="-1"></a> <span class="do">## drop QC variables (no longer needed), except NEE_VUT_REF_QC</span></span>
<span id="cb2-20"><a href="data-splitting.html#cb2-20" aria-hidden="true" tabindex="-1"></a> <span class="fu">select</span>(<span class="sc">-</span><span class="fu">ends_with</span>(<span class="st">&quot;_QC&quot;</span>), NEE_VUT_REF_QC)</span></code></pre></div>
<p>If the style of the code above looks unfamiliar - this is the <strong><a href="https://www.tidyverse.org/">tidyverse</a></strong>. The tidyverse is a R syntax “dialect” and a collection of R functions and packages. They share the structure of arguments and function return values than can be combined to a chain by the <code>%&gt;%</code> (“pipe”) operator. For this, the output of each function is a data frame which is “piped” to the next function, and each function takes a data frame as input. What is piped into a function takes the place of the first argument, normally provided in brackets. This enables ease with typical data wrangling and visualization tasks (<strong><a href="https://ggplot2.tidyverse.org/">ggplot2</a></strong> is part of it). This tutorial is generally written using tidyverse packages and code syntax.</p>
<p>If the style of the code above looks unfamiliar - this is the <strong><a href="https://www.tidyverse.org/">tidyverse</a></strong>. The tidyverse is a R syntax “dialect” and a collection of R functions and packages. They share the structure of arguments and function return values than can be combined to a chain by the <code>%&gt;%</code> (“pipe”) operator. For this, the output of each function is a data frame which is “piped” to the next function, and each function takes a data frame as input. What is piped into a function takes the place of the first argument, normally provided inside the brackets. This enables ease with typical data wrangling and visualization tasks (<strong><a href="https://ggplot2.tidyverse.org/">ggplot2</a></strong> is part of the tidyverse). This tutorial is generally written using tidyverse packages and code syntax.</p>
<p>The column <code>NEE_VUT_REF_QC</code> provides information about the fraction of gap-filled half-hourly data used to calculate daily aggregates. Let’s use only <code>GPP_NT_VUT_REF</code> data, where at least 80% of the underlying half-hourly data was good quality measured data, and not gap-filled. Make sure to not actually remove the respective rows, but rather replace values with NA.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="data-splitting.html#cb3-1" aria-hidden="true" tabindex="-1"></a>ddf <span class="ot">&lt;-</span> ddf <span class="sc">%&gt;%</span> </span>
<span id="cb3-2"><a href="data-splitting.html#cb3-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">mutate</span>(<span class="at">GPP_NT_VUT_REF =</span> <span class="fu">ifelse</span>(NEE_VUT_REF_QC <span class="sc">&lt;</span> <span class="fl">0.8</span>, <span class="cn">NA</span>, GPP_NT_VUT_REF))</span></code></pre></div>
Expand Down
2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ <h2><span class="header-section-number">1.1</span> Apps</h2>
<div id="libraries" class="section level2" number="1.2">
<h2><span class="header-section-number">1.2</span> Libraries</h2>
<p>Install missing packages for this tutorial.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="index.html#cb1-1" aria-hidden="true" tabindex="-1"></a>list_pkgs <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;caret&quot;</span>, <span class="st">&quot;recipes&quot;</span>, <span class="st">&quot;rsample&quot;</span>, <span class="st">&quot;tidyverse&quot;</span>, <span class="st">&quot;conflicted&quot;</span>, <span class="st">&quot;modelr&quot;</span>, <span class="st">&quot;forcats&quot;</span>, <span class="st">&quot;yardstick&quot;</span>, <span class="st">&quot;visdat&quot;</span>, <span class="st">&quot;skimr&quot;</span>, <span class="st">&quot;ranger&quot;</span>, <span class="st">&quot;knitr&quot;</span>)</span>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="index.html#cb1-1" aria-hidden="true" tabindex="-1"></a>list_pkgs <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;caret&quot;</span>, <span class="st">&quot;recipes&quot;</span>, <span class="st">&quot;rsample&quot;</span>, <span class="st">&quot;tidyverse&quot;</span>, <span class="st">&quot;conflicted&quot;</span>, <span class="st">&quot;modelr&quot;</span>, <span class="st">&quot;forcats&quot;</span>, <span class="st">&quot;yardstick&quot;</span>, <span class="st">&quot;visdat&quot;</span>, <span class="st">&quot;skimr&quot;</span>, <span class="st">&quot;ranger&quot;</span>, <span class="st">&quot;knitr&quot;</span>, <span class="st">&quot;patchwork&quot;</span>)</span>
<span id="cb1-2"><a href="index.html#cb1-2" aria-hidden="true" tabindex="-1"></a>new_pkgs <span class="ot">&lt;-</span> list_pkgs[<span class="sc">!</span>(list_pkgs <span class="sc">%in%</span> <span class="fu">installed.packages</span>()[, <span class="st">&quot;Package&quot;</span>])]</span>
<span id="cb1-3"><a href="index.html#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> (<span class="fu">length</span>(new_pkgs) <span class="sc">&gt;</span> <span class="dv">0</span>) <span class="fu">install.packages</span>(new_pkgs)</span></code></pre></div>
<p>This book was compiled with the <em>bookdown</em> library and source files (RMarkdown), available on <a href="https://github.com/stineb/ml4ec_workshop">Github</a>. Navigate there also for working on the exercises (Chapter <a href="exercises.html#exercises">7</a>) and using the solutions (Chapter <a href="solutions.html#solutions">8</a>).</p>
Expand Down
Binary file modified docs/ml4ec_workshop.epub
Binary file not shown.
Binary file modified docs/ml4ec_workshop.pdf
Binary file not shown.
12 changes: 8 additions & 4 deletions docs/ml4ec_workshop.tex
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ \section{Libraries}\label{libraries}}

\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{list\_pkgs }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\StringTok{"caret"}\NormalTok{, }\StringTok{"recipes"}\NormalTok{, }\StringTok{"rsample"}\NormalTok{, }\StringTok{"tidyverse"}\NormalTok{, }\StringTok{"conflicted"}\NormalTok{, }\StringTok{"modelr"}\NormalTok{, }\StringTok{"forcats"}\NormalTok{, }\StringTok{"yardstick"}\NormalTok{, }\StringTok{"visdat"}\NormalTok{, }\StringTok{"skimr"}\NormalTok{, }\StringTok{"ranger"}\NormalTok{, }\StringTok{"knitr"}\NormalTok{)}
\NormalTok{list\_pkgs }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\StringTok{"caret"}\NormalTok{, }\StringTok{"recipes"}\NormalTok{, }\StringTok{"rsample"}\NormalTok{, }\StringTok{"tidyverse"}\NormalTok{, }\StringTok{"conflicted"}\NormalTok{, }\StringTok{"modelr"}\NormalTok{, }\StringTok{"forcats"}\NormalTok{, }\StringTok{"yardstick"}\NormalTok{, }\StringTok{"visdat"}\NormalTok{, }\StringTok{"skimr"}\NormalTok{, }\StringTok{"ranger"}\NormalTok{, }\StringTok{"knitr"}\NormalTok{, }\StringTok{"patchwork"}\NormalTok{)}
\NormalTok{new\_pkgs }\OtherTok{\textless{}{-}}\NormalTok{ list\_pkgs[}\SpecialCharTok{!}\NormalTok{(list\_pkgs }\SpecialCharTok{\%in\%} \FunctionTok{installed.packages}\NormalTok{()[, }\StringTok{"Package"}\NormalTok{])]}
\ControlFlowTok{if}\NormalTok{ (}\FunctionTok{length}\NormalTok{(new\_pkgs) }\SpecialCharTok{\textgreater{}} \DecValTok{0}\NormalTok{) }\FunctionTok{install.packages}\NormalTok{(new\_pkgs)}
\end{Highlighting}
Expand Down Expand Up @@ -347,7 +347,7 @@ \section{Reading and wrangling data}\label{reading-and-wrangling-data}}
\end{Highlighting}
\end{Shaded}

If the style of the code above looks unfamiliar - this is the \textbf{\href{https://www.tidyverse.org/}{tidyverse}}. The tidyverse is a R syntax ``dialect'' and a collection of R functions and packages. They share the structure of arguments and function return values than can be combined to a chain by the \texttt{\%\textgreater{}\%} (``pipe'') operator. For this, the output of each function is a data frame which is ``piped'' to the next function, and each function takes a data frame as input. What is piped into a function takes the place of the first argument, normally provided in brackets. This enables ease with typical data wrangling and visualization tasks (\textbf{\href{https://ggplot2.tidyverse.org/}{ggplot2}} is part of it). This tutorial is generally written using tidyverse packages and code syntax.
If the style of the code above looks unfamiliar - this is the \textbf{\href{https://www.tidyverse.org/}{tidyverse}}. The tidyverse is a R syntax ``dialect'' and a collection of R functions and packages. They share the structure of arguments and function return values than can be combined to a chain by the \texttt{\%\textgreater{}\%} (``pipe'') operator. For this, the output of each function is a data frame which is ``piped'' to the next function, and each function takes a data frame as input. What is piped into a function takes the place of the first argument, normally provided inside the brackets. This enables ease with typical data wrangling and visualization tasks (\textbf{\href{https://ggplot2.tidyverse.org/}{ggplot2}} is part of the tidyverse). This tutorial is generally written using tidyverse packages and code syntax.

The column \texttt{NEE\_VUT\_REF\_QC} provides information about the fraction of gap-filled half-hourly data used to calculate daily aggregates. Let's use only \texttt{GPP\_NT\_VUT\_REF} data, where at least 80\% of the underlying half-hourly data was good quality measured data, and not gap-filled. Make sure to not actually remove the respective rows, but rather replace values with NA.

Expand Down Expand Up @@ -1007,6 +1007,8 @@ \subsection{Training}\label{training-4}}
## [1] 15450.89
\end{verbatim}

The variable \texttt{PA\_F} was not significant in the linear model. Therefore, we won't use it for the models below.

\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Fit an lm model on the same data, but with PA\_F removed.}
Expand Down Expand Up @@ -1197,6 +1199,10 @@ \subsection{Prediction}\label{prediction-3}}

\begin{center}\includegraphics{ml4ec_workshop_files/figure-latex/unnamed-chunk-43-1} \end{center}

Here, the function \texttt{eval\_model()} returned an object that is made up of two plots (\texttt{return(gg1\ +\ gg2)} in the function definition). This combination of plots by \texttt{+} is enabled by the \href{https://patchwork.data-imaginist.com/}{\textbf{patchwork}} library. The individual plot objects (\texttt{gg1} and \texttt{gg2}) are returned by the \texttt{ggplot()} functions. The visualisation here is density plot of hexagonal bins. It shows the number of points inside each bin, encoded by the color (see legend ``count''). We want the highest density of points along the 1:1 line (the dotted line). Predictions match observations perfectly for points lying on the 1:1 line. Alternatively, we could also use a scatterplot to visualise the model evaluation. However, a large number of points would overlie each other. As typical machine learning applications make use of large number of data, such evaluation plots would typically face the problem of overlying points and density plots are a solution.

Metrics are given in the subtitle of the plots. Note that the \(R^2\) and the RMSE measure different aspects of model-data agreement. Here, the measure the correlation (fraction of variation explained), and the average error. We should generally consider multiple metrics measuring multiple aspects of the prediction-observation fit to evaluate models.

\hypertarget{knn-1}{%
\section{KNN}\label{knn-1}}

Expand All @@ -1218,8 +1224,6 @@ \subsection{Check data}\label{check-data-1}}

\begin{center}\includegraphics{ml4ec_workshop_files/figure-latex/unnamed-chunk-44-1} \end{center}

The variable \texttt{PA\_F} looks weird and was not significant in the linear model. Therefore, we won't use it for the models below.

\hypertarget{training-5}{%
\subsection{Training}\label{training-5}}

Expand Down
2 changes: 1 addition & 1 deletion docs/search_index.json

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion docs/solutions.html
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,7 @@ <h3><span class="header-section-number">8.3.1</span> Training</h3>
<span id="cb34-4"><a href="solutions.html#cb34-4" aria-hidden="true" tabindex="-1"></a><span class="do">## tends to be more conservative than the AIC.</span></span>
<span id="cb34-5"><a href="solutions.html#cb34-5" aria-hidden="true" tabindex="-1"></a><span class="fu">BIC</span>(linmod_baser)</span></code></pre></div>
<pre><code>## [1] 15450.89</code></pre>
<p>The variable <code>PA_F</code> was not significant in the linear model. Therefore, we won’t use it for the models below.</p>
<div class="sourceCode" id="cb36"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb36-1"><a href="solutions.html#cb36-1" aria-hidden="true" tabindex="-1"></a><span class="do">## Fit an lm model on the same data, but with PA_F removed.</span></span>
<span id="cb36-2"><a href="solutions.html#cb36-2" aria-hidden="true" tabindex="-1"></a>linmod_baser_nopaf <span class="ot">&lt;-</span> <span class="fu">lm</span>(</span>
<span id="cb36-3"><a href="solutions.html#cb36-3" aria-hidden="true" tabindex="-1"></a> <span class="at">form =</span> GPP_NT_VUT_REF <span class="sc">~</span> ., </span>
Expand Down Expand Up @@ -485,6 +486,8 @@ <h3><span class="header-section-number">8.3.2</span> Prediction</h3>
<span id="cb42-64"><a href="solutions.html#cb42-64" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-65"><a href="solutions.html#cb42-65" aria-hidden="true" tabindex="-1"></a><span class="fu">eval_model</span>(<span class="at">mod =</span> linmod_baser, <span class="at">df_train =</span> ddf_train, <span class="at">df_test =</span> ddf_test)</span></code></pre></div>
<p><img src="ml4ec_workshop_files/figure-html/unnamed-chunk-43-1.png" width="696" style="display: block; margin: auto;" /></p>
<p>Here, the function <code>eval_model()</code> returned an object that is made up of two plots (<code>return(gg1 + gg2)</code> in the function definition). This combination of plots by <code>+</code> is enabled by the <a href="https://patchwork.data-imaginist.com/"><strong>patchwork</strong></a> library. The individual plot objects (<code>gg1</code> and <code>gg2</code>) are returned by the <code>ggplot()</code> functions. The visualisation here is density plot of hexagonal bins. It shows the number of points inside each bin, encoded by the color (see legend “count”). We want the highest density of points along the 1:1 line (the dotted line). Predictions match observations perfectly for points lying on the 1:1 line. Alternatively, we could also use a scatterplot to visualise the model evaluation. However, a large number of points would overlie each other. As typical machine learning applications make use of large number of data, such evaluation plots would typically face the problem of overlying points and density plots are a solution.</p>
<p>Metrics are given in the subtitle of the plots. Note that the <span class="math inline">\(R^2\)</span> and the RMSE measure different aspects of model-data agreement. Here, the measure the correlation (fraction of variation explained), and the average error. We should generally consider multiple metrics measuring multiple aspects of the prediction-observation fit to evaluate models.</p>
</div>
</div>
<div id="knn-1" class="section level2" number="8.4">
Expand All @@ -500,7 +503,6 @@ <h3><span class="header-section-number">8.4.1</span> Check data</h3>
<span id="cb43-7"><a href="solutions.html#cb43-7" aria-hidden="true" tabindex="-1"></a> <span class="fu">geom_density</span>() <span class="sc">+</span></span>
<span id="cb43-8"><a href="solutions.html#cb43-8" aria-hidden="true" tabindex="-1"></a> <span class="fu">facet_wrap</span>(<span class="sc">~</span>variable, <span class="at">scales =</span> <span class="st">&quot;free&quot;</span>)</span></code></pre></div>
<p><img src="ml4ec_workshop_files/figure-html/unnamed-chunk-44-1.png" width="696" style="display: block; margin: auto;" /></p>
<p>The variable <code>PA_F</code> looks weird and was not significant in the linear model. Therefore, we won’t use it for the models below.</p>
</div>
<div id="training-5" class="section level3" number="8.4.2">
<h3><span class="header-section-number">8.4.2</span> Training</h3>
Expand Down
2 changes: 1 addition & 1 deletion index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ For this workshop, you need [R](https://www.r-project.org/) and [RStudio](https:

Install missing packages for this tutorial.
```{r}
list_pkgs <- c("caret", "recipes", "rsample", "tidyverse", "conflicted", "modelr", "forcats", "yardstick", "visdat", "skimr", "ranger", "knitr")
list_pkgs <- c("caret", "recipes", "rsample", "tidyverse", "conflicted", "modelr", "forcats", "yardstick", "visdat", "skimr", "ranger", "knitr", "patchwork")
new_pkgs <- list_pkgs[!(list_pkgs %in% installed.packages()[, "Package"])]
if (length(new_pkgs) > 0) install.packages(new_pkgs)
```
Expand Down

0 comments on commit 2fbc059

Please sign in to comment.