-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanalysis.rmd
370 lines (253 loc) · 14.7 KB
/
analysis.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
---
title: "March Madness Tournament Round Predictions"
subtitle: "NCAA Men's Division I Basketball"
author: " - Raghav Maini\n
- Nick Caseria\n
- Gabriel Correa\n
- Katherine McElroy\n
- Taylor Noren\n"
date: "April 17, 2022"
output:
html_document:
highlight: tango
theme: united
toc: true
toc_depth: 2
toc_float: true
number_sections: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library (C50)
library(caret)
library(class)
library(rpart)
library(gmodels)
library(ggplot2)
library(kernlab)
library(lattice)
library(Metrics)
library(rpart.plot)
library(randomForest)
set.seed(656)
```
# Downloading and Prepping the Data
```{r}
#Downloading and Prepping the Data
basketball <- read.csv("cbb.csv", stringsAsFactors = TRUE)
basketball$TEAM <- NULL
basketball$CONF <- NULL
basketball$SEED <- NULL
summary(basketball)
str(basketball)
```
The data set we have chosen details college basketball statistics from 2013-2019. The data set is broken down by team (355 total). Each team is further categorized by conference. There are a number of statistics that are tracked for each team including the number of games they have played (G) and the number of wins they have (W). The rest of the categories in the data set essentially sum up a teams offensive and defensive efficiency, including turnover percentage, field goal percentage, etc. The goal of our analysis is to use the data set to train a model on the 2013-2019 data to successfully predict the outcomes of the upcoming 2022 march madness tournament. This poses a number of interesting business questions that we hope to answer. First and foremost, there is a rising popularity in sports betting. With college sports in particular, betters are only able to bet on team outcome rather than individual players unlike the NBA. This means our particular data set is a valuable means of assessing how teams might fare this year. Further, college basketball requires comprehensive investment. It is difficult for universities to predict whether or not their program will be successful, and oftentimes coaching contracts are arbitrarily extended based on the performance of one season. Our hope is that using this historical data will shed insight on how colleges should handle their programs in the future and if there are any aspects coaches should focus on when creating their rosters for the next year.
With this question in mind, our response variable for our analysis would be the "POSTSEASON" outcome which details where a team finished at the end of the season.
```{r}
#Making seed and year factors
#basketball$SEED <- as.factor(basketball$SEED)
#basketball$YEAR <- as.factor(basketball$YEAR)
#Create new columns that have the difference between 3P_O and 3P_D; 2P_O and 2P_D
basketball$Delta_2P <- basketball$X2P_O - basketball$X2P_D
basketball$Delta_3P <- basketball$X3P_O - basketball$X3P_D
summary(basketball)
```
As far as data cleaning goes, there isn't any substantial work to be done. We were able to source this data set on Kaggle which was already accompanied by in depth analysis. With this, the data was essentially cleaned and ready to use. However, we decided to make a couple of key changes. Firstly, we decided to make SEED and YEAR factors. Given that SEED and YEAR both have a limited number of outcomes, this provides us a stronger basis for considering how 1 seeds vs 2 seeds, etc. perform against each other as well as how our predictions change from year to year.
Further, we decided to create two new columns: Delta_3P and Delta_2P. These detail the differential between how many 3 point shots and 2 point shots they allow to score vs how many they actually score themselves. The goal of creating these columns is to assess how teams generally fare against their opposition in terms of these shots. A negative delta implies that they, on average, are worse than their competition while a positive delta implies the opposite.
## Creating numeric reponse variable column
```{r}
basketball$POSTSEASON_NUM <- ifelse(basketball$POSTSEASON == "R64", 1, ifelse(basketball$POSTSEASON == "R32", 2, ifelse(basketball$POSTSEASON == "S16", 3, ifelse(basketball$POSTSEASON == "E8", 4, ifelse(basketball$POSTSEASON == "F4", 5, ifelse(basketball$POSTSEASON == "2ND", 6, ifelse(basketball$POSTSEASON == "Champions", 7, 0)))))))
basketball$POSTSEASON_NUM[is.na(basketball$POSTSEASON_NUM)] <- 0
basketball$POSTSEASON <- NULL
summary(basketball)
success_eval <- data.frame(matrix(ncol=5, nrow=737))
colnames(success_eval) <- c("LR", "kNN", "SVM", "RF", "Actual")
```
## Split Data into Training Set and Validation Set
```{r}
#70 - 30 Split
train_set <- sample(1:nrow(basketball), 0.7*nrow(basketball))
#Training set
tr <- basketball[train_set, ]
x_tr <- basketball[train_set, -23] #x only
y_tr <- basketball[train_set, 23] # y only
#Validation set
val <- basketball[-train_set, ]
x_val <- basketball[-train_set, -23]
y_val <- basketball[-train_set, 23]
head(tr)
success_eval$Actual <- y_val
```
# Linear Regression
```{r}
if(!"caret" %in% installed.packages()){install.packages("caret")}
# Simple linear regression
m1 <- lm(POSTSEASON_NUM ~ G + ADJOE + ADJDE + BARTHAG + TOR + DRB + FTR + WAB + Delta_2P + Delta_3P, data = tr)
summary(m1)
# Step 2. Making predictions
pred4 <- predict(m1, x_val)
#Creating predict interval
pred5 <- predict(m1, x_val, interval ="prediction", level= 0.95)
# Step 3. Evaluating the model performance
# Put postResample result in the object, result
# Evaluating model performance
postResample(pred4, y_val)
md1_result <- postResample(pred4, y_val)
#Residual Plot
#attach(tr)
#require(gridExtra)
#library(gridExtra)
#plot1 <- ggplot(data = tr, aes(x_tr, resid(m1))) + geom_point() + geom_smooth()
#nrow(residuals(m1))
#nrow(x_tr)
#plot1
res <- resid(m1)
plot(fitted(m1), res)
abline(0, 0)
```
# Creating Prediction Variable
```{r}
```
# kNN
```{r}
trup <- upSample(x=tr[,-ncol(tr)],
y=as.factor(tr$POSTSEASON_NUM))
valup <- upSample(x=val[,-ncol(val)],
y=as.factor(val$POSTSEASON_NUM))
postseason_test_pred <- knn(train = trup, test = val,
cl = trup$Class,
k=43)
rmse(as.numeric(valup$Class), as.numeric(postseason_test_pred))
success_eval$kNN <- as.numeric(postseason_test_pred)
```
```{r}
#tr , val
ctrl <- trainControl(method="repeatedcv", repeats=3)
knnFit <- train(POSTSEASON_NUM ~ ., data = basketball,
method = "knn",
trControl=ctrl,
preProcess = c("center", "scale"),
tuneLength = 20)
knn_pred <- predict(knnFit, newdata = val)
```
# Support Vector Machine
```{r}
tr$POSTSEASON_NUM <- as.factor(tr$POSTSEASON_NUM)
# y_val <- as.factor(y_val)
rank_classifier_vanilla <- ksvm(POSTSEASON_NUM ~., data = tr,
kernel = "vanilladot", C = 1)
## look at basic information about the model
# rank_classifier_vanilla
rank_predictions_vanilla <- predict(rank_classifier_vanilla, newdata = val)
agreement_vanilla <- rank_predictions_vanilla == y_val
# table(agreement_vanilla)
prop.table(table(agreement_vanilla))
success_eval$SVM <- as.numeric(rank_predictions_vanilla)
rmse(as.numeric(rank_predictions_vanilla), as.numeric(y_val))
tr$POSTSEASON_NUM <- as.numeric(tr$POSTSEASON_NUM)
y_val <- as.numeric(y_val)
# success_eval
```
## Vanilla SVM Stats
RMSE: `r rmse(as.numeric(rank_predictions_vanilla), as.numeric(y_val))`
# Random Forest
```{r}
rfmodel <- randomForest(as.factor(POSTSEASON_NUM) ~ ., data = tr)
rfpredict <- predict(rfmodel, val)
rfpredict <- as.numeric(rfpredict)
# rfpredict
success_eval$RF = rfpredict - 1
rmse(success_eval$RF, as.numeric(y_val))
varImpPlot(rfmodel)
```
# Creating the Regression Tree
```{r}
# STEP 1: Create decision tree using regression
buckets <- rpart(POSTSEASON_NUM ~ G + W + ADJOE + ADJDE + BARTHAG + EFG_O + TOR + ORB + DRB + FTR + ADJ_T + WAB, method = "anova", data = basketball)
# STEP 2: Plot and Print out results for decision tree using regression from above
rpart.plot(buckets, uniform = TRUE,
main = "Predicting Postseason Breadth")
print(buckets)
# STEP 3: Create test data (we will use ours from above already created)
val <- basketball[-train_set, ]
x_val <- basketball[-train_set, -23]
y_val <- basketball[-train_set, 23]
# STEP 4: Predict Post Season Breadth off of our train data set
postseason_predictions <- predict(buckets, tr, method = "anova")
# STEP 5: Visualizations!
table(postseason_predictions,tr$POSTSEASON_NUM)
# STEP 6: Correlation!
cor(postseason_predictions,tr$POSTSEASON_NUM)
# STEP 7: Assign the Regression Tree output to success_eval column
# success_eval$RT <- postseason_predictions
# success_eval
```
## Understanding Regression Tree Output and Diagram
Reviewing our output, we re-coded our values to show that not making the tournament = 0, making it as far as the round of 64 = 1, making it as far as the round of 32 = 3, making it as far as the sweet 16 = 4, making it as far as the elite 8 = 5, making it as far as the final 4 = 6, and making the championship game = 7. Given this output review, our regression tree shows results predicted from "WAB" which is Wins Above Bubble, "W" which is Wins, and "G" which is Games.
- If a team's WAB is LESS than 1.6 games, LESS than 0.15 games, and their number of W is LESS than 23, their output in our 0-7 postseason breadth result is 0.5 according to the model. 79% of the data distribution falls in this branch.
- If a team's WAB is LESS than 1.6 games, LESS than 0.15 games, and their number of W is MORE than 23, their output in our 0-7 postseason breadth result is 0.5 according to the model. 7% of the data distribution falls in this branch.
- If a team's WAB is LESS than 1.6 games, but MORE than 0.15 games, their output in our 0-7 postseason breadth result is 1.1 according to the model. 4% of the data distribution falls in this branch.
- If a team's WAB is MORE than 1.6 games, LESS than 37 games played, and LESS than 35 games played, their output in our 0-7 postseason breadth result is 1.4 according to the model. 4% of the data distribution falls in this branch.
- If a team's WAB is MORE than 1.6 games, LESS than 37 games, but MORE than 35 games played, their output in our 0-7 postseason breadth result is 2.2 according to the model. 4% of the data distribution falls in this branch.
- If a team's WAB is MORE than 1.6 games, MORE than 37 games, but LESS than 39 games played, their output in our 0-7 postseason breadth result is 3.7 according to the model. 2% of the data distribution falls in this branch.
- If a team's WAB is MORE than 1.6 games, MORE than 37 games, and MORE than 39 games played, their output in our 0-7 postseason breadth result is 5.9 according to the model. 1% of the data distribution falls in this branch.
In summary, this shows that a teams post season breadth is predicted to be lowest and closest to zero if they have fewer than 0.15 Wins Above the Bubble and fewer than 23 wins. If these two things exist in a college team's season, it is unlikely they see the postseason. Alternatively, If a team's Wins Above Bubble is greater than 1.6, and their games played is over 39, that 1% is predicted to see the 5.9th round of the tournament. That equals them making it to the Elite 8, almost to the Final 4.
# Stacking the Models
```{r}
success_eval$Outcome <- ifelse(success_eval$kNN == success_eval$Actual, 1,
ifelse(success_eval$SVM == success_eval$Actual, 1,
ifelse(success_eval$RF == success_eval$Actual, 1,
0)))
# convert to factor for decision tree
success_eval$kNN <- as.factor(success_eval$kNN)
success_eval$SVM <- as.factor(success_eval$SVM)
success_eval$RF <- as.factor(success_eval$RF)
success_eval$Actual <- as.factor(success_eval$Actual)
success_eval$LR <- NULL
# success_eval$RT <- NULL
success_eval$Actual <- NULL
success_eval$Outcome <- as.factor(success_eval$Outcome)
# floor(nrow(success_eval) * 0.7)
success_train <- success_eval[1:515,]
success_test <- success_eval[516:730,]
stack_model <- C5.0(success_train[,-4], success_train[,4])
stack_pred <- predict(stack_model, success_test)
plot(stack_model)
```
# Validating Models against the 2021 Tournament Outcomes
```{r}
new_season <- read.csv("cbb21.csv", stringsAsFactors = TRUE)
# create df with 2021 season data
teams_21 <- data.frame(matrix(ncol=1, nrow=347))
colnames(teams_21) <- c("Teams")
teams_21$Teams <- new_season$TEAM
new_season$TEAM <- NULL
new_season$CONF <- NULL
new_season$SEED <- NULL
new_season$Delta_2P <- new_season$X2P_O - new_season$X2P_D
new_season$Delta_3P <- new_season$X3P_O - new_season$X3P_D
new_season$POSTSEASON_NUM <- ifelse(new_season$POSTSEASON == "R64", 1,
ifelse(new_season$POSTSEASON == "R32", 2,
ifelse(new_season$POSTSEASON == "S16", 3,
ifelse(new_season$POSTSEASON == "E8", 4,
ifelse(new_season$POSTSEASON == "F4", 5,
ifelse(new_season$POSTSEASON == "2ND", 6,
ifelse(new_season$POSTSEASON == "Champions", 7,
0)))))))
new_season$POSTSEASON_NUM[is.na(new_season$POSTSEASON_NUM)] <- 0
teams_21$Actual <- new_season$POSTSEASON_NUM
new_season$POSTSEASON <- NULL
summary(new_season)
str(new_season)
rfpredict21 <- predict(rfmodel, new_season)
rfpredict21 <- as.numeric(rfpredict21)
teams_21$Prediction = rfpredict21 - 1
rmse(teams_21$Actual, teams_21$Prediction)
str(rfpredict21)
```
# Conclusion
While our final prediction did not yield actionable results, models such as these offer many advantages over traditional intuition based decision making. While some traditional metrics may seem like common sense, it is the unlikely characteristics which will provide the most advantage.
<b> Enhanced odds making and risk management </b>
For firms in the sports betting industry, successfully predicting tournament outcomes can help to facilitate betting action on either side of the bet for each game. Offering odds which are determined through statistical analysis can also attract customers looking to try to "beat the house" with their own algorithm based strategies.
<b> Bonus calculation for team coaches </b>
Academic institutions must commit a large amount of capital to their athletic programs, and coaching salaries represent a substantial portion of these budgets. With bonus structures tied to tournament performance at many schools, financial planning can be augmented with insights gained from models similar to those created here.