-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-predicting-song-artist.qmd
247 lines (198 loc) · 7.18 KB
/
01-predicting-song-artist.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: "Predicting song artist from lyrics"
format: html
---
# Load required packages
```{r}
library(tidyverse)
library(tidymodels)
library(stringr)
library(textrecipes)
library(themis)
library(vip)
library(here)
# set seed for randomization
set.seed(123)
theme_set(theme_minimal())
```
# Import data
```{r}
# get beyonce and taylor swift lyrics
beyonce_lyrics <- read_csv(here("data", "beyonce_lyrics.csv"))
taylor_swift_lyrics <- read_csv(here("data", "taylor_swift_lyrics.csv"))
extra_lyrics <- read_csv(here("data", "updated-album-lyrics.csv")) # albums released since 2020
# clean lyrics for binding
beyonce_clean <- bind_rows(
beyonce_lyrics,
extra_lyrics
) %>%
# convert to one row per song
group_by(song_id, song_name, artist_name) %>%
summarize(Lyrics = str_flatten(line, collapse = " ")) %>%
ungroup() %>%
# clean column names
select(artist = artist_name, song_title = song_name, lyrics = Lyrics)
taylor_swift_clean <- taylor_swift_lyrics %>%
# clean column names
select(artist = Artist, song_title = Title, lyrics = Lyrics)
# combine into single data file
lyrics <- bind_rows(beyonce_clean, taylor_swift_clean) %>%
mutate(artist = factor(artist)) %>%
drop_na()
lyrics
```
# Preprocess the dataset for modeling
## Resampling folds
- Split the data into training/test sets with 75% allocated for training
- Split the training set into 10 cross-validation folds
```{r rsample, dependson = "get-data"}
# split into training/testing
lyrics_split <- initial_split(data = ______, strata = ______, prop = ______)
lyrics_train <- training(lyrics_split)
lyrics_test <- testing(lyrics_split)
# create cross-validation folds
lyrics_folds <- vfold_cv(data = ______, strata = ______)
```
## Define the feature engineering recipe
- Define a feature engineering recipe to predict the song's artist as a function of the lyrics
- Tokenize the song lyrics
- Remove stop words
- Only keep the 500 most frequently appearing tokens
- Calculate tf-idf scores for the remaining tokens
- This will generate one column for every token. Each column will have the standardized name `tfidf_lyrics_*` where `*` is the specific token. Instead we would prefer the column names simply be `*`. You can remove the `tfidf_lyrics_` prefix using
```r
# Simplify these names
step_rename_at(starts_with("tfidf_lyrics_"),
fn = ~ str_replace_all(
string = .,
pattern = "tfidf_lyrics_",
replacement = ""
)
)
```
- [Downsample](/notes/supervised-text-classification/#concerns-regarding-multiclass-classification) the observations so there are an equal number of songs by Beyoncé and Taylor Swift in the analysis set
```{r}
# define preprocessing recipe
lyrics_rec <- recipe(artist ~ lyrics, data = lyrics_train) %>%
...
lyrics_rec
```
# Estimate a random forest model
- Define a random forest model grown with 1000 trees using the `ranger` engine.
- Define a workflow using the feature engineering recipe and random forest model specification. Fit the workflow using the cross-validation folds.
- Use `control = control_resamples(save_pred = TRUE)` to save the assessment set predictions. We need these to assess the model's performance.
```{r}
# define the model specification
ranger_spec <- ______
# define the workflow
ranger_workflow <- workflow() %>%
add_recipe(lyrics_rec) %>%
add_model(ranger_spec)
# fit the model to each of the cross-validation folds
ranger_cv <- ranger_workflow %>%
______
```
## Evaluate model performance
- Calculate the model's accuracy and ROC AUC. How did it perform?
- Draw the ROC curve for each validation fold
- Generate the resampled confusion matrix for the model and draw it using a heatmap. How does the model perform predicting Beyoncé songs relative to Taylor Swift songs?
```{r}
# extract metrics and predictions
ranger_cv_metrics <- ______(ranger_cv)
ranger_cv_predictions <- ______(ranger_cv)
# how well did the model perform?
ranger_cv_metrics
# roc curve
ranger_cv_predictions %>%
group_by(id) %>%
...
# confusion matrix
conf_mat_resampled(x = ______, tidy = ______) %>%
autoplot(type = "heatmap")
```
# Penalized regression
## Define the feature engineering recipe
Define the same feature engineering recipe as before, with two adjustments:
1. Calculate all possible 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams
1. Retain the 2000 most frequently occurring tokens.
```{r}
# redefine recipe to include multiple n-grams
glmnet_rec <- recipe(artist ~ lyrics, data = lyrics_train) %>%
...
glmnet_rec
```
## Tune the penalized regression model
- Define the penalized regression model specification, including tuning placeholders for `penalty` and `mixture`
- Create the workflow object
- Define a tuning grid with every combination of:
- `penalty = 10^seq(-6, -1, length.out = 20)`
- `mixture = c(0, 0.2, 0.4, 0.6, 0.8, 1)`
- Tune the model using the cross-validation folds
- Evaluate the tuning procedure and identify the best performing models based on ROC AUC
```{r}
# define the penalized regression model specification
glmnet_spec <- ______
# define the new workflow
glmnet_workflow <- workflow() %>%
add_recipe(glmnet_rec) %>%
add_model(glmnet_spec)
# create the tuning grid
glmnet_grid <- tidyr::crossing(
penalty = 10^seq(-6, -1, length.out = 20),
mixture = c(0, 0.2, 0.4, 0.6, 0.8, 1)
)
# tune over the model hyperparameters
glmnet_tune <- ______
```
```{r}
# evaluate results
collect_metrics(x = glmnet_tune)
autoplot(glmnet_tune)
# identify the five best hyperparameter combinations
show_best(x = glmnet_tune, metric = "roc_auc")
```
## Fit the best model
- Select the hyperparameter combinations that achieve the highest ROC AUC
- Fit the penalized regression model using the best hyperparameters and the full training set. How well does the model perform on the test set?
```{r}
# select the best model's hyperparameters
glmnet_best <- select_best(glmnet_tune, metric = "roc_auc")
# fit a single model using the selected hyperparameters and the full training set
glmnet_final <- glmnet_workflow %>%
finalize_workflow(parameters = glmnet_best) %>%
last_fit(split = lyrics_split)
collect_metrics(glmnet_final)
```
## Variable importance
```{r}
# extract parnsip model fit
glmnet_imp <- extract_fit_parsnip(glmnet_final) %>%
# calculate variable importance for the specific penalty parameter used
vi(lambda = glmnet_best$penalty)
# clean up the data frame for visualization
glmnet_imp %>%
mutate(
Sign = case_when(
Sign == "POS" ~ "More likely from Beyoncé",
Sign == "NEG" ~ "More likely from Taylor Swift"
),
Importance = abs(Importance)
) %>%
group_by(Sign) %>%
# extract 20 most important n-grams for each artist
slice_max(order_by = Importance, n = 20) %>%
ggplot(mapping = aes(
x = Importance,
y = fct_reorder(Variable, Importance),
fill = Sign
)) +
geom_col(show.legend = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_brewer(type = "qual") +
facet_wrap(facets = vars(Sign), scales = "free") +
labs(
y = NULL,
title = "Variable importance for predicting the song artist",
subtitle = "These features are the most important in predicting\nwhether a song is by Beyoncé or Taylor Swift"
)
```