-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
271 lines (197 loc) · 14 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# Containerised R workflow template
<!-- badges: start -->
[](https://www.repostatus.org/#wip)
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://github.com/ecohealthalliance/container-template/actions/workflows/container-workflow-template.yml)
[-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[-CC_BY_4.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
<!-- badges: end -->
This is a template repository of a containerised R workflow built on the `targets` framework, made portable using `renv`, and ran manually or automatically using `GitHub Actions`. To use this template click on the "use this template button" and then select create a new repository.
Check out the [`containerTemplateUtils`](https://github.com/ecohealthalliance/containerTemplateUtils) package for handling common tasks related to this repo (sending emails, uploading files to AWS, etc. )
Note that `git-crypt` is not part of the template repo. See the [EHA M&A handbook](https://ecohealthalliance.github.io/eha-ma-handbook/16-encryption.html#set-up-encryption-for-a-repo-that-did-not-previously-use-git-crypt) for how to add git-crypt.
Follow the links for more information about:
- [`targets`](https://ecohealthalliance.github.io/eha-ma-handbook/3-projects.html#targets)
- [`renv`](https://ecohealthalliance.github.io/eha-ma-handbook/3-projects.html#package-management-with-renv)
- [git-crypt](https://ecohealthalliance.github.io/eha-ma-handbook/16-encryption.html)
- [Reproducible workflows](https://github.com/ecohealthalliance/building-blocks-of-reproducibility)
Recommendations:
- One function per file in R/
- Non-function R scripts in another directory like `scripts/`
- Use the same names for targets and function arguments for those targets unless a function
- Nouns for targets, verbs for functions
- Use common suffixes for target types: `_file` for files, `_raw` for read-in but unprocessed data
- Use `fnmate` and `tflow` RStudio Add-Ins to make this easy, create shortcuts for these add-ins ([talk](https://www.youtube.com/watch?v=jU1Zv21GvT4)), or the `usethis` package
## Quick start
- Create repo from template
- rename .Rproj file
- streamline packages in `packages.R`
- modify `.gitattributes` to include any files that may need encryption
- initialize `git-crypt` for repo
- add relevant environment variables to `.env` file
- rename github actions workflows
- update safe repo section of github action
- add `git-crypt` key as secret variable to repo
## GitHub Actions
[GitHub Actions](https://docs.github.com/en/actions) allows automation, customisation, and execution of your research project workflows right in your GitHub repository.
In gist, [GitHub Actions](https://docs.github.com/en/actions) is a *workflow* composed of a *job* or a number of *jobs*. The *job/s* are then composed of *steps* that control the order in which *actions* are run in order to complete a *job/s*. This *workflow* is scheduled or triggered by a specific *event* and runs on what is called a *runner* - a server that has the [GitHub Actions](https://docs.github.com/en/actions) runner application installed - that is either hosted by GitHub, or self-hosted on your own machines.
This whole **workflow** including the **event** trigger and the **runner** on which the **workflow** will run in are specified and detailed using a workflow `.yml` file that is saved inside a directory named `.github` within your GitHub repository in which you want to use [GitHub Actions](https://docs.github.com/en/actions) on.
<img src=https://miro.medium.com/max/2617/1*8mUtip6z_oydfLi4P86KUw.png />
This repository, contains a template [GitHub Actions](https://docs.github.com/en/actions) workflow with its corresponding `.yml` file that illustrates how [GitHub Actions](https://docs.github.com/en/actions) can be used to run and maintain an R workflow that uses `targets` and `renv`.
## Using containers in GitHub Actions workflow
A **container** is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
**Containers** can be used within a [GitHub Actions](https://docs.github.com/en/actions) workflow and can be specified either at the **job** level or at the **step** level. If specified at the **job** level, all the **steps** within that **job** will be run inside that container. When specified at the **steps** level, different containers can be used for each **step**.
The example/template workflow can be found inside the `.github` folder and is shown below:
```yaml
name: container-workflow-template
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master
workflow_dispatch:
branches:
- '*'
#schedule:
# - cron: "0 8 * * *"
jobs:
container-workflow-tempalte:
runs-on: ubuntu-latest # Run on GitHub Actions runner
#runs-on: [self-hosted, linux, x64, onprem-aegypti] # Run the workflow on EHA aegypti runner
#runs-on: [self-hosted, linux, x64, onprem-prospero] # Run the workflow on EHA prospero runner
container:
image: rocker/verse:4.1.2
steps:
- uses: actions/checkout@v2
- name: Install system dependencies
run: |
apt-get update && apt-get install -y --no-install-recommends \
libcurl4-openssl-dev \
libssl-dev
- name: Restore R packages
run: |
renv::restore()
shell: Rscript {0}
- name: Run targets workflow
run: |
targets::tar_make()
shell: Rscript {0}
```
In this example, we show a data quality check workflow report for a nutrition survey of children 6-59 months old.
### The trigger
The trigger for GitHub Actions is specified in these lines in the workflow YAML file:
```yaml
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master
workflow_dispatch:
branches:
- '*'
#schedule:
# - cron: "0 8 * * *"
```
This workflow automatically runs when there is a **push** or **pull request** event to the main branch of the repository. This workflow has also been set to have the option to be run manually from the GitHub Actions page for any branch of the repository through the `workflow-dispatch` specification in the workflow YAML file.
GitHub Actions can also be scheduled to run at specific times and frequency using the `schedule` specification in the workflow YAML file using [POSIX cron syntax](https://en.wikipedia.org/wiki/Cron). Scheduled workflows run on the latest commit on the default or base branch. The shortest interval you can run scheduled workflows is once every 5 minutes. In the example workflow, the `schedule` specification has been set to run at 8 am everyday but this has been hashed out. If you would like to schedule your workflow runs, remove the hash and then set the POSIX cron syntax to the frequency that you require. *Note while github actions is highly reliable Github does not guarantee that a scheduled job will run if you're using github servers and jobs are less likely to run if you choose a popular run time (generally on the hour).*
### The job
The job for GitHub Actions is specified in these lines in the workflow YAML file:
```yaml
jobs:
container-workflow-template:
runs-on: ubuntu-latest # Run on GitHub Actions runner
#runs-on: [self-hosted, linux, x64, onprem-aegypti] # Run the workflow on EHA aegypti runner
#runs-on: [self-hosted, linux, x64, onprem-prospero] # Run the workflow on EHA prospero runner
container:
image: rocker/verse:4.1.2
```
The job named `container-workflow-template` is specified to run on runners hosted by GitHub Actions. These runners can be identified through a tag that specifies the operating software followed by the version. In the example workflow, the line specifying `runs-on: ubuntu-latest` runs the workflow on a machine hosted by GitHub Actions with the latest Ubuntu operating software.
The job can also be run on a self-hosted GitHub Actions runner that is installed on EHA's high performance computing machines using the `runs-on` workflow YAML specification. Tags unique to this GitHub runner are used to identify the specific machine to use. Syntax on how to specify these runners are shown but hashed out.
To further make the GitHub Actions workflow more robust and reproducible, we setup a container at the **job** level. The container specified is a versioned R image that has `tidyverse` and other R publishing tools installed. This container image would generally be adequate for most workflows that require data wrangling and manipulation using the `tidyverse` tools and reporting using `rmarkdown`. Some projects/workflows (like those using spatial packages such as `sf`) may benefit from using a different R image so change the container specification accordingly. To read more about available R images, see https://www.rocker-project.org/images/.
## Using this GitHub Actions workflow template
This repository has been set as a private template repository. This means that this can be used by EHA staff for creating new repositories with the same filesystem.
This can be done as follows:
1. In your GitHub account, go to the EcoHealth Alliance organisation (https://github.com/ecohealthalliance) then click on the green button labeled `New`.
2. You will now be directed to the `Create new repository` page. Here, right at the top, you will see the `Repository template` heading. Click on the drop down button right below this that says `No template`. You will then see all the available templates within EHA. Select the template named `ecohealthalliance/container-template`.
3. Give your new repository a name, set the appropriate repository visibility, and then click on `Create repository`.
4. You will now have a new repository the contents of which are the same files and structure as this template repository.
5. You can now make the necessary changes and additions that are specific to your workflow.
## Using `git-crypt` to encrypt files in your workflow
Your project may contain a mix of public and private content. Being able to encrypt the private contents of your project is very useful. It is recommended that you use PGP (Pretty Good Privacy) encryption, implemented by the program [`git-crypt`](https://github.com/AGWA/git-crypt). It takes a bit to set up but once activated makes sharing secure and seamless. To setup PGP and `git-crypt` on your project that is based on this template, see the [*Encryption* chapter of the EHA Modeling and Analytics Handbook](https://ecohealthalliance.github.io/eha-ma-handbook/14-encryption.html).
Once you have enabled `git-crypt` on your project, you will need to make the following edits to the `container-workflow-template.yml` file to be able to perform symmetric key decryption described [here](https://ecohealthalliance.github.io/eha-ma-handbook/14-encryption.html#extra-use-a-symmetric-key-for-automated-processes). Here is the `container-workflow-template.yml` file updated to allow and perform symmetric key decryption:
```yaml
name: container-workflow-encrypted-template
on:
push:
branches:
- main
- master
pull_request:
branches:
- main
- master
workflow_dispatch:
branches:
- '*'
#schedule:
# - cron: "0 8 * * *"
env:
GIT_CRYPT_KEY64: ${{ secrets.GIT_CRYPT_KEY64 }}
jobs:
container-workflow-encrypted-tempalte:
runs-on: ubuntu-latest # Run on GitHub Actions runner
#runs-on: [self-hosted, linux, x64, onprem-aegypti] # Run the workflow on EHA aegypti runner
#runs-on: [self-hosted, linux, x64, onprem-prospero] # Run the workflow on EHA prospero runner
container:
image: rocker/verse:4.1.2
steps:
- uses: actions/checkout@v2
- name: Install system dependencies
run: |
apt-get update && apt-get install -y --no-install-recommends \
git-crypt \
libcurl4-openssl-dev \
libssl-dev
- name: Decrypt repository using symmetric key
run: |
echo $GIT_CRYPT_KEY64 > git_crypt_key.key64 && base64 -di git_crypt_key.key64 > git_crypt_key.key && git-crypt unlock git_crypt_key.key
rm git_crypt_key.key git_crypt_key.key64
- name: Restore R packages
run: |
renv::restore()
shell: Rscript {0}
- name: Run targets workflow
run: |
targets::tar_make()
shell: Rscript {0}
```
Once you have edited your worklfow YAML file and before you push the changes to GitHub, you will then have to add the symmetric key to your GitHub repository as a secret.
First, generate a symmetric key by running this in your project directory.
```bash
git-crypt export-key git_crypt_key.key
```
`git_crypt_key.key` can now be used to decrypt the repository, and you can provide it to GitHub Actions as a secret environment variable (see https://docs.github.com/en/actions/security-guides/encrypted-secrets). However, since it is binary data, you’ll need to convert it to base64 first. So run something like:
```bash
cat git_crypt_key.key | base64 | pbcopy
```
to convert this file to base64 data, then paste it in GitHub’s secret environment variable field as `GIT_CRYPT_KEY64`.