-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #6 from arcus/add_clustering_code
Add clustering code
- Loading branch information
Showing
96 changed files
with
3,795 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
These are reports for DART data analysis. | ||
|
||
Preliminary data extraction and cleaning happens in `src/scripts`. | ||
Interim data files created during the cleaning process are stored in `data/interim` and/or `data/deidentified`, and then those cleaned data files are used in the notebooks here. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
This subdirectory contains files and reports for generating pathway assignments for the DART study, for both waves 1 and 2. | ||
The process includes the following steps: | ||
|
||
1. Apply hierarchical clustering to the data from the needs assessment survey to identify groups of learners with similar needs/interests: `define_clusters.rmd` | ||
2. Examine patterns within each cluster of learners to get an overall picture of what characterizes each group (note that this also involves proposing a place to cut the clustering tree, i.e. how many clusters to keep): `explore_clusters_w1.rmd` and `explore_clusters_w2.rmd` | ||
3. Read through the descriptions of each cluster and consider them against the available content in our catalog of learning modules. For each cluster, propose a pathway of modules that would meet their needs (if needed, can revise the number of clusters at this point): `proposed_pathways_w1.md` and `proposed_pathways_w2.md` | ||
|
||
Once the pathway assignments are finalized, pathway assignments are entered back into the learner platform -- for wave 1 this was Thinkific, for wave 2 it was [NALMS](https://github.com/arcus/NALMS_Not_Another_LMS) -- and also saved in the main REDCap project for this study (DART Pipeline, pid=56668). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
--- | ||
title: "Pathways clustering" | ||
author: "Rose Hartman" | ||
date: '2023-01-17' | ||
output: github_document | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE, | ||
out.width = "50%") | ||
library(ggplot2) | ||
# load custom functions for plotting | ||
source(here::here("src", "scripts", "functions_plotting.R")) | ||
# if the figures directory doesn't exist, create it | ||
dir.create(here::here("reports"), showWarnings = FALSE) | ||
dir.create(here::here("reports", "figures"), showWarnings = FALSE) | ||
dir.create(here::here("reports", "tables"), showWarnings = FALSE) | ||
``` | ||
|
||
`r if(!knitr::opts_chunk$get()$echo) ">Note that code chunks are not printed in this report in order to keep the output tidy. To see all of the code to generate these results, open the .Rmd file."` | ||
|
||
```{r wave_assignments} | ||
waves <- readr::read_csv(here::here("participant_waves.csv"), show_col_types = FALSE) | ||
``` | ||
|
||
Note that there are two important decisions inherent in the data cleaning process for the likert items (rate your [expertise/relevance to your work/your desire to learn] the following data science tasks): | ||
|
||
- "Not Sure" responses are treated as missing | ||
- All missing values (including Not Sure) are imputed with 1, the lowest value on the scale. In other words, when participants skipped an item or marked it "Not Sure" we're treating that as them indicating "Very Low" expertise/relevance/desire to learn. | ||
|
||
Also note that these items are being treated as continuous variables in this analysis (i.e. the increase from "very low" to "low" is assumed to be the same size as the increase from "low" to "medium", and so on). | ||
|
||
# Wave 1 | ||
|
||
## Load data | ||
|
||
```{r load_data_w1} | ||
# load the cleaned needs assessment data | ||
needs_assessment <- readRDS(here::here("data", "deidentified", "needs_assessment.rds")) |> | ||
dplyr::left_join(waves, by = "record_id") |> | ||
# only keep participants from the current wave | ||
dplyr::filter(wave == 1) |> | ||
dplyr::select(-wave) | ||
# save just the items to be analyzed (the likert items) | ||
numeric_data <- dplyr::select(needs_assessment, -record_id) | ||
# convert the record_id column to row names | ||
row.names(numeric_data) <- dplyr::pull(needs_assessment, record_id) | ||
# scale the data | ||
numeric_data_scaled <- scale(numeric_data) | ||
``` | ||
|
||
## Hierarchical Clustering | ||
|
||
```{r } | ||
dist <- dist(numeric_data_scaled, method = "euclidean") | ||
``` | ||
|
||
### Finding the clustering method | ||
|
||
#### Ward's | ||
|
||
Minimizes the total **within cluster** variance. | ||
|
||
From the docs: | ||
|
||
> Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(*, method="ward") corresponds to hclust(*, "ward.D2"). | ||
```{r ward_w1} | ||
hc_ward <- hclust(dist, method = "ward.D2") | ||
``` | ||
|
||
```{r, eval=FALSE} | ||
# Plot the result | ||
plot(hc_ward, cex = 0.6, hang = -1) | ||
``` | ||
|
||
#### Comparing dendograms | ||
|
||
Calculate agglomerative coefficient (closer to 1 mean stronger clustering structure) | ||
|
||
```{r} | ||
m <- c("average", "single", "complete", "ward") | ||
names(m) <- c( "average", "single", "complete", "ward") | ||
purrr::map_dbl(m, ~ cluster::agnes(numeric_data_scaled, method = .)$ac) |> | ||
round(2) | ||
``` | ||
|
||
Ward's gives the best cluster cohesion. | ||
|
||
```{r save_tree_w1} | ||
# save a copy of the whole tree | ||
saveRDS(hc_ward, file = here::here("data", "deidentified", "hclust_tree_w1.rds")) | ||
``` | ||
|
||
### Assigning clusters | ||
|
||
```{r cut_tree_w1} | ||
sub_grp4 <- cutree(hc_ward, k = 4) | ||
sub_grp5 <- cutree(hc_ward, k = 5) | ||
sub_grp6 <- cutree(hc_ward, k = 6) | ||
sub_grp7 <- cutree(hc_ward, k = 7) | ||
sub_grp8 <- cutree(hc_ward, k = 8) | ||
sub_grp9 <- cutree(hc_ward, k = 9) | ||
``` | ||
|
||
|
||
```{r save_assignments_w1} | ||
groups <- data.frame(record_id = as.numeric(names(sub_grp4)), | ||
group4 = as.factor(sub_grp4), | ||
group5 = as.factor(sub_grp5), | ||
group6 = as.factor(sub_grp6), | ||
group7 = as.factor(sub_grp7), | ||
group8 = as.factor(sub_grp8), | ||
group9 = as.factor(sub_grp9)) | ||
# join cluster assignments to needs assessment data and save the file | ||
dplyr::left_join(needs_assessment, groups, by = "record_id") |> | ||
saveRDS(file = here::here("data", "deidentified", "needs_assessment_with_clusters_w1.rds")) | ||
``` | ||
|
||
# Wave 2 | ||
|
||
## Load data | ||
|
||
```{r load_data_w2} | ||
# load the cleaned needs assessment data | ||
needs_assessment <- readRDS(here::here("data", "deidentified", "needs_assessment.rds")) |> | ||
dplyr::left_join(waves, by = "record_id") |> | ||
# only keep participants from the current wave | ||
dplyr::filter(wave == 2) |> | ||
dplyr::select(-wave) | ||
# save just the items to be analyzed (the likert items) | ||
numeric_data <- dplyr::select(needs_assessment, -record_id) | ||
# convert the record_id column to row names | ||
row.names(numeric_data) <- dplyr::pull(needs_assessment, record_id) | ||
# scale the data | ||
numeric_data_scaled <- scale(numeric_data) | ||
``` | ||
|
||
## Hierarchical Clustering | ||
|
||
```{r } | ||
dist <- dist(numeric_data_scaled, method = "euclidean") | ||
``` | ||
|
||
### Finding the clustering method | ||
|
||
#### Ward's | ||
|
||
Minimizes the total **within cluster** variance. | ||
|
||
From the docs: | ||
|
||
> Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(*, method="ward") corresponds to hclust(*, "ward.D2"). | ||
```{r ward_w2} | ||
hc_ward <- hclust(dist, method = "ward.D2") | ||
``` | ||
|
||
```{r, eval=FALSE} | ||
# Plot the result | ||
plot(hc_ward, cex = 0.6, hang = -1) | ||
``` | ||
|
||
|
||
#### Comparing dendograms | ||
|
||
Calculate agglomerative coefficient (closer to 1 mean stronger clustering structure) | ||
|
||
```{r} | ||
m <- c("average", "single", "complete", "ward") | ||
names(m) <- c( "average", "single", "complete", "ward") | ||
purrr::map_dbl(m, ~ cluster::agnes(numeric_data_scaled, method = .)$ac) |> | ||
round(2) | ||
``` | ||
|
||
Ward's gives the best cluster cohesion. | ||
|
||
```{r save_tree_w2} | ||
# save a copy of the whole tree | ||
saveRDS(hc_ward, file = here::here("data", "deidentified", "hclust_tree_w2.rds")) | ||
``` | ||
|
||
### Assigning clusters | ||
|
||
```{r cut_tree} | ||
sub_grp04 <- cutree(hc_ward, k = 4) | ||
sub_grp05 <- cutree(hc_ward, k = 5) | ||
sub_grp06 <- cutree(hc_ward, k = 6) | ||
sub_grp07 <- cutree(hc_ward, k = 7) | ||
sub_grp08 <- cutree(hc_ward, k = 8) | ||
sub_grp09 <- cutree(hc_ward, k = 9) | ||
sub_grp10 <- cutree(hc_ward, k = 10) | ||
sub_grp11 <- cutree(hc_ward, k = 11) | ||
sub_grp12 <- cutree(hc_ward, k = 12) | ||
sub_grp13 <- cutree(hc_ward, k = 13) | ||
sub_grp14 <- cutree(hc_ward, k = 14) | ||
sub_grp15 <- cutree(hc_ward, k = 15) | ||
sub_grp16 <- cutree(hc_ward, k = 16) | ||
sub_grp17 <- cutree(hc_ward, k = 17) | ||
sub_grp18 <- cutree(hc_ward, k = 18) | ||
sub_grp19 <- cutree(hc_ward, k = 19) | ||
sub_grp20 <- cutree(hc_ward, k = 20) | ||
``` | ||
|
||
|
||
```{r save_assignments} | ||
groups <- data.frame(record_id = as.numeric(names(sub_grp04)), | ||
group04 = as.factor(sub_grp04), | ||
group05 = as.factor(sub_grp05), | ||
group06 = as.factor(sub_grp06), | ||
group07 = as.factor(sub_grp07), | ||
group08 = as.factor(sub_grp08), | ||
group09 = as.factor(sub_grp09), | ||
group10 = as.factor(sub_grp10), | ||
group11 = as.factor(sub_grp11), | ||
group12 = as.factor(sub_grp12), | ||
group13 = as.factor(sub_grp13), | ||
group14 = as.factor(sub_grp14), | ||
group15 = as.factor(sub_grp15), | ||
group16 = as.factor(sub_grp16), | ||
group17 = as.factor(sub_grp17), | ||
group18 = as.factor(sub_grp18), | ||
group19 = as.factor(sub_grp19), | ||
group20 = as.factor(sub_grp20)) | ||
# join cluster assignments to needs assessment data and save the file | ||
dplyr::left_join(needs_assessment, groups, by = "record_id") |> | ||
saveRDS(file = here::here("data", "deidentified", "needs_assessment_with_clusters_w2.rds")) | ||
``` | ||
|
Oops, something went wrong.