Skip to content

Commit

Permalink
Merge pull request #6 from arcus/add_clustering_code
Browse files Browse the repository at this point in the history
Add clustering code
  • Loading branch information
rosemm authored Oct 22, 2024
2 parents 09f22d1 + b81632b commit fb4625f
Show file tree
Hide file tree
Showing 96 changed files with 3,795 additions and 0 deletions.
Binary file added data/deidentified/hclust_tree_w1.rds
Binary file not shown.
Binary file added data/deidentified/hclust_tree_w2.rds
Binary file not shown.
Binary file not shown.
Binary file not shown.
4 changes: 4 additions & 0 deletions reports/notebooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
These are reports for DART data analysis.

Preliminary data extraction and cleaning happens in `src/scripts`.
Interim data files created during the cleaning process are stored in `data/interim` and/or `data/deidentified`, and then those cleaned data files are used in the notebooks here.
8 changes: 8 additions & 0 deletions reports/notebooks/define_pathways/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
This subdirectory contains files and reports for generating pathway assignments for the DART study, for both waves 1 and 2.
The process includes the following steps:

1. Apply hierarchical clustering to the data from the needs assessment survey to identify groups of learners with similar needs/interests: `define_clusters.rmd`
2. Examine patterns within each cluster of learners to get an overall picture of what characterizes each group (note that this also involves proposing a place to cut the clustering tree, i.e. how many clusters to keep): `explore_clusters_w1.rmd` and `explore_clusters_w2.rmd`
3. Read through the descriptions of each cluster and consider them against the available content in our catalog of learning modules. For each cluster, propose a pathway of modules that would meet their needs (if needed, can revise the number of clusters at this point): `proposed_pathways_w1.md` and `proposed_pathways_w2.md`

Once the pathway assignments are finalized, pathway assignments are entered back into the learner platform -- for wave 1 this was Thinkific, for wave 2 it was [NALMS](https://github.com/arcus/NALMS_Not_Another_LMS) -- and also saved in the main REDCap project for this study (DART Pipeline, pid=56668).
240 changes: 240 additions & 0 deletions reports/notebooks/define_pathways/define_clusters.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
---
title: "Pathways clustering"
author: "Rose Hartman"
date: '2023-01-17'
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
out.width = "50%")
library(ggplot2)
# load custom functions for plotting
source(here::here("src", "scripts", "functions_plotting.R"))
# if the figures directory doesn't exist, create it
dir.create(here::here("reports"), showWarnings = FALSE)
dir.create(here::here("reports", "figures"), showWarnings = FALSE)
dir.create(here::here("reports", "tables"), showWarnings = FALSE)
```

`r if(!knitr::opts_chunk$get()$echo) ">Note that code chunks are not printed in this report in order to keep the output tidy. To see all of the code to generate these results, open the .Rmd file."`

```{r wave_assignments}
waves <- readr::read_csv(here::here("participant_waves.csv"), show_col_types = FALSE)
```

Note that there are two important decisions inherent in the data cleaning process for the likert items (rate your [expertise/relevance to your work/your desire to learn] the following data science tasks):

- "Not Sure" responses are treated as missing
- All missing values (including Not Sure) are imputed with 1, the lowest value on the scale. In other words, when participants skipped an item or marked it "Not Sure" we're treating that as them indicating "Very Low" expertise/relevance/desire to learn.

Also note that these items are being treated as continuous variables in this analysis (i.e. the increase from "very low" to "low" is assumed to be the same size as the increase from "low" to "medium", and so on).

# Wave 1

## Load data

```{r load_data_w1}
# load the cleaned needs assessment data
needs_assessment <- readRDS(here::here("data", "deidentified", "needs_assessment.rds")) |>
dplyr::left_join(waves, by = "record_id") |>
# only keep participants from the current wave
dplyr::filter(wave == 1) |>
dplyr::select(-wave)
# save just the items to be analyzed (the likert items)
numeric_data <- dplyr::select(needs_assessment, -record_id)
# convert the record_id column to row names
row.names(numeric_data) <- dplyr::pull(needs_assessment, record_id)
# scale the data
numeric_data_scaled <- scale(numeric_data)
```

## Hierarchical Clustering

```{r }
dist <- dist(numeric_data_scaled, method = "euclidean")
```

### Finding the clustering method

#### Ward's

Minimizes the total **within cluster** variance.

From the docs:

> Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(*, method="ward") corresponds to hclust(*, "ward.D2").
```{r ward_w1}
hc_ward <- hclust(dist, method = "ward.D2")
```

```{r, eval=FALSE}
# Plot the result
plot(hc_ward, cex = 0.6, hang = -1)
```

#### Comparing dendograms

Calculate agglomerative coefficient (closer to 1 mean stronger clustering structure)

```{r}
m <- c("average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
purrr::map_dbl(m, ~ cluster::agnes(numeric_data_scaled, method = .)$ac) |>
round(2)
```

Ward's gives the best cluster cohesion.

```{r save_tree_w1}
# save a copy of the whole tree
saveRDS(hc_ward, file = here::here("data", "deidentified", "hclust_tree_w1.rds"))
```

### Assigning clusters

```{r cut_tree_w1}
sub_grp4 <- cutree(hc_ward, k = 4)
sub_grp5 <- cutree(hc_ward, k = 5)
sub_grp6 <- cutree(hc_ward, k = 6)
sub_grp7 <- cutree(hc_ward, k = 7)
sub_grp8 <- cutree(hc_ward, k = 8)
sub_grp9 <- cutree(hc_ward, k = 9)
```


```{r save_assignments_w1}
groups <- data.frame(record_id = as.numeric(names(sub_grp4)),
group4 = as.factor(sub_grp4),
group5 = as.factor(sub_grp5),
group6 = as.factor(sub_grp6),
group7 = as.factor(sub_grp7),
group8 = as.factor(sub_grp8),
group9 = as.factor(sub_grp9))
# join cluster assignments to needs assessment data and save the file
dplyr::left_join(needs_assessment, groups, by = "record_id") |>
saveRDS(file = here::here("data", "deidentified", "needs_assessment_with_clusters_w1.rds"))
```

# Wave 2

## Load data

```{r load_data_w2}
# load the cleaned needs assessment data
needs_assessment <- readRDS(here::here("data", "deidentified", "needs_assessment.rds")) |>
dplyr::left_join(waves, by = "record_id") |>
# only keep participants from the current wave
dplyr::filter(wave == 2) |>
dplyr::select(-wave)
# save just the items to be analyzed (the likert items)
numeric_data <- dplyr::select(needs_assessment, -record_id)
# convert the record_id column to row names
row.names(numeric_data) <- dplyr::pull(needs_assessment, record_id)
# scale the data
numeric_data_scaled <- scale(numeric_data)
```

## Hierarchical Clustering

```{r }
dist <- dist(numeric_data_scaled, method = "euclidean")
```

### Finding the clustering method

#### Ward's

Minimizes the total **within cluster** variance.

From the docs:

> Two different algorithms are found in the literature for Ward clustering. The one used by option "ward.D" (equivalent to the only Ward option "ward" in R versions <= 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option "ward.D2" implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(*, method="ward") corresponds to hclust(*, "ward.D2").
```{r ward_w2}
hc_ward <- hclust(dist, method = "ward.D2")
```

```{r, eval=FALSE}
# Plot the result
plot(hc_ward, cex = 0.6, hang = -1)
```


#### Comparing dendograms

Calculate agglomerative coefficient (closer to 1 mean stronger clustering structure)

```{r}
m <- c("average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
purrr::map_dbl(m, ~ cluster::agnes(numeric_data_scaled, method = .)$ac) |>
round(2)
```

Ward's gives the best cluster cohesion.

```{r save_tree_w2}
# save a copy of the whole tree
saveRDS(hc_ward, file = here::here("data", "deidentified", "hclust_tree_w2.rds"))
```

### Assigning clusters

```{r cut_tree}
sub_grp04 <- cutree(hc_ward, k = 4)
sub_grp05 <- cutree(hc_ward, k = 5)
sub_grp06 <- cutree(hc_ward, k = 6)
sub_grp07 <- cutree(hc_ward, k = 7)
sub_grp08 <- cutree(hc_ward, k = 8)
sub_grp09 <- cutree(hc_ward, k = 9)
sub_grp10 <- cutree(hc_ward, k = 10)
sub_grp11 <- cutree(hc_ward, k = 11)
sub_grp12 <- cutree(hc_ward, k = 12)
sub_grp13 <- cutree(hc_ward, k = 13)
sub_grp14 <- cutree(hc_ward, k = 14)
sub_grp15 <- cutree(hc_ward, k = 15)
sub_grp16 <- cutree(hc_ward, k = 16)
sub_grp17 <- cutree(hc_ward, k = 17)
sub_grp18 <- cutree(hc_ward, k = 18)
sub_grp19 <- cutree(hc_ward, k = 19)
sub_grp20 <- cutree(hc_ward, k = 20)
```


```{r save_assignments}
groups <- data.frame(record_id = as.numeric(names(sub_grp04)),
group04 = as.factor(sub_grp04),
group05 = as.factor(sub_grp05),
group06 = as.factor(sub_grp06),
group07 = as.factor(sub_grp07),
group08 = as.factor(sub_grp08),
group09 = as.factor(sub_grp09),
group10 = as.factor(sub_grp10),
group11 = as.factor(sub_grp11),
group12 = as.factor(sub_grp12),
group13 = as.factor(sub_grp13),
group14 = as.factor(sub_grp14),
group15 = as.factor(sub_grp15),
group16 = as.factor(sub_grp16),
group17 = as.factor(sub_grp17),
group18 = as.factor(sub_grp18),
group19 = as.factor(sub_grp19),
group20 = as.factor(sub_grp20))
# join cluster assignments to needs assessment data and save the file
dplyr::left_join(needs_assessment, groups, by = "record_id") |>
saveRDS(file = here::here("data", "deidentified", "needs_assessment_with_clusters_w2.rds"))
```

Loading

0 comments on commit fb4625f

Please sign in to comment.