-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathcreditcarddefault.rmd
768 lines (610 loc) · 33.1 KB
/
creditcarddefault.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
---
title: "Analysis of Credit Card Default Against Cardholders' Background"
author: "Yu-Chen (Amber) Lu"
date: '`r paste("Date:",Sys.Date())`'
output: pdf_document
---
## Abstract
To find out the best fit algorithm for amount of given credit in NT dollars against other factors,
which are important variables using Bayesian information criterion.
## 0. Introduction
High credit card default rate can make a business in trouble even bankrupt. The propose of this
project is to predict whether a cilent defaults on her/his credit card, the business can underwrite
credit cards more carefully to the potential clients who cannot pay bills with high probability.
In order to get more insight of this dataset, I did exploratory data analysis using `ggplot2`.
Another perceptive of this project is to predict the limit balance of a credit card. Thus, I used
"Default of Credit Card Clients Dataset", which was downloaded from
https://archive.ics.uci.edu/ml/machine-learning-databases/00350/.
> ### 1) About the data
> This dataset contains information on default payments, demographic factors, credit data, history
of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.
> ### 2) About the variables
> (@) ID: ID of each client
(@) LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
(@) SEX: Gender (1=male, 2=female)
(@) EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
(@) MARRIAGE: Marital status (1=married, 2=single, 3=others)
(@) AGE: Age in years
(@) PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month,
2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
(@) PAY_2: Repayment status in August, 2005 (scale same as above)
(@) PAY_3: Repayment status in July, 2005 (scale same as above)
(@) PAY_4: Repayment status in June, 2005 (scale same as above)
(@) PAY_5: Repayment status in May, 2005 (scale same as above)
(@) PAY_6: Repayment status in April, 2005 (scale same as above)
(@) BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
(@) BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
(@) BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
(@) BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
(@) BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
(@) BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
(@) PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
(@) PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
(@) PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
(@) PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
(@) PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
(@) PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
(@) default.payment.next.month: Default payment (1=yes, 0=no)
> ### 3) Citation
> Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive
accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.
> ### 4) Library sources
```{r}
library(plyr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(leaps)
library(glmnet)
library(tree)
library(randomForest)
library(gbm)
library(MASS)
library(class)
library(data.table)
library(gam)
set.seed(1)
```
## 1. Exploratory Data Analysis
```{r}
CreditCard = read.csv(file="UCI_Credit_Card.csv", header=TRUE)
# names(CreditCard) = sapply(names(CreditCard), tolower)
head(CreditCard)
glimpse(CreditCard)
# Get some plots of the data set
CreditCard_plot = CreditCard
CreditCard_plot$LIM_cut = cut(as.numeric(as.character(CreditCard_plot$LIMIT_BAL)),
c((0:8)*130000), right = FALSE,
labels = c("0-130K", "130K-260K", "260K-390K",
"390K-520K", "520K-650K", "650K-780K",
"780K-910K", "910K-1040K")) # Categorize LIMIT_BAL
CreditCard_plot$Age_cut = cut(as.numeric(as.character(CreditCard_plot$AGE)),
c(seq(20,80,10)), right = FALSE) # Categorize Defualt Rate
# Convert format
CreditCard_plot$default.payment.next.month =
as.character(CreditCard_plot$default.payment.next.month)
CreditCard_plot$EDUCATION = as.character(CreditCard_plot$EDUCATION)
# Plot 1 --------------------------------------------------------------------------
ggplot(data=CreditCard_plot, aes(LIM_cut, Age_cut)) +
geom_bar(stat = "identity", aes(fill = Age_cut), position = "dodge") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs( title = "Age and Limit Balance Catagories", x = "Limit Balance Catagories (NTD)",
y = "Age Catagories (Years)")
## Comment: Middle ages have higher amount of given credit in NT dollars
# Plot 2 --------------------------------------------------------------------------
# The original dataset in default.payment.next.month column shows 0 and 1,
# which mean credit card does not default and do default respectively.
# It is not so clear for someone who does not know this dataset,
# so I changed the labels of default.payment.next.month on the plot
# First, given a list for converting the labels
default_names = list('0' ="No Default", '1'= "Default")
# Then, define a labeller to convert labels of default.payment.next.month
default_labeller = function(variable,value){
return(default_names[value])
}
ggplot(data=CreditCard_plot, aes(x=AGE)) +
geom_histogram(binwidth=.5, colour="black", fill="white") +
facet_grid(default.payment.next.month ~., labeller=default_labeller) +
geom_vline(data=CreditCard_plot, aes(xintercept=mean(AGE, na.rm=T)),
linetype="dashed", size=1, colour="red") +
labs(title = "Histogram of Age and Credit Card Default", x = "Age (Years)",
y = "Count", tag = "B")
## Comment: The bar charts shows it is lower percentage of credit card default
## for people between age 25 and age 40. Also, most of the clients are
## between age 25 and age 35.
# Plot 3 --------------------------------------------------------------------------
CreditCard_plot2 = CreditCard_plot
CC_Default = CreditCard_plot$default.payment.next.month
CreditCard_plot2$default.payment.next.month[CC_Default=="1"] = "Default"
CreditCard_plot2$default.payment.next.month[CC_Default=="0"] = "No Default"
CreditCard_plot2$SEX[CreditCard_plot$SEX=="1"] = "Male"
CreditCard_plot2$SEX[CreditCard_plot$SEX=="2"] = "Female"
ggplot(data=CreditCard_plot2, aes(x=AGE)) +
geom_histogram(binwidth=.5, colour="black", fill="white") +
facet_grid(default.payment.next.month ~SEX) +
geom_vline(data=CreditCard_plot2, aes(xintercept=mean(AGE, na.rm=T)),
linetype="dashed", size=1, colour="red") +
labs(title = "Histogram of Age, Gender, and Credit Card Default", x = "Age (Years)",
y = "Count", tag = "C")
# Plot 4 --------------------------------------------------------------------------
ggplot(data=CreditCard_plot, aes(x=LIMIT_BAL, colour=default.payment.next.month)) +
stat_density(geom="line",position="identity") +
stat_density(geom="line", aes(color = "default.payment.next.month")) +
labs(title = "Density of Limit Balance and Credit Card Default",
x = "Limit Balance (NTD)", y = "Density", tag = "D") +
scale_colour_discrete(name="Default", breaks=c("0", "1", "default.payment.next.month"),
labels=c("No", "Yes", "All (Yes and No)"))
## Comment: Light blue line, which represents the density of credit card default,
## has a high peak at limit balance about 10000 NTD. It might tell us
## that credit card might be too easy to approve without careful
## considerations of applicants' credit score.
# Plot 5 --------------------------------------------------------------------------
ggplot(data=CreditCard_plot, aes(MARRIAGE, fill = default.payment.next.month)) +
labs(title = "Stacked Bar Chart of Marital Status and Credit Card Default",
subtitle = ("0: Missing Data; 1: Married; 2:Single; 3: Others"),
x = "Marital status", y = "Density", tag = "E") + geom_histogram(bins = 7) +
scale_fill_discrete(name="Default", breaks=c("0", "1"), labels=c("No", "Yes"))
## Comment: Most of cardholders were married or single.
## Single had less probabilities of credit card default by percentage.
# Plot 6 --------------------------------------------------------------------------
edu_count_table = as.data.frame(table(CreditCard_plot$EDUCATION))
edu_count_table$Prob = edu_count_table$Freq / sum(edu_count_table$Freq)
colnames(edu_count_table) = c("Edu", "Freq", "Prob" )
ggplot(data=edu_count_table, aes(x="", y=Prob, fill=Edu)) +
geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) +
labs(title = "Pie Chart of Eduction Level", y = "Probability", tag = "F") +
scale_fill_discrete(name="Education Level", breaks=seq(0,6),
labels=c("Missing Data", "Graduate", "University",
"High school", "Others", "Unknown", "Unknown"))
## Comment: More than two-third cardholders have Bachelor's degree.
```
```{r}
# Get default data
default = CreditCard[CreditCard$default.payment.next.month == 1,]
summary(default)
# Define function to calculate default rate based on one factor
DefaultRate = function(tab){
names = colnames(tab) # Get column names of table
N = 2:6 # Get LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE
DefaultRateList = list() # Initialize
for ( i in 1:length(N)){
factor = names[N[i]]
fre_cc = as.data.frame(table(CreditCard[factor]))
fre_de = as.data.frame(table(default[factor]))
# Left join
fre_table = merge(fre_cc, fre_de, by='Var1', all.x=TRUE)
fre_table[is.na(fre_table)] = 0 # Replace NA as 0
colnames(fre_table) = c(factor, 'AllData', 'Default')
# Get the default rate and count of no default
fre_table$NoDefault = fre_table$AllData - fre_table$Default
fre_table$Rate = fre_table$Default / fre_table$AllData
DefaultRateList[[i]] = as.matrix(fre_table)
}
return(DefaultRateList)
}
DefaultRateMat = DefaultRate(CreditCard)
# Plot 7 --------------------------------------------------------------------------
# Extract LIMIT_D_R default rate
LimitBal_D_R = as.data.frame(DefaultRateMat[[1]])
LimitBal_D_R$LIM_cut = cut(as.numeric(as.character(LimitBal_D_R$LIMIT_BAL)),
c((0:8)*130000), right = FALSE,
labels = c("0-130K", "130K-260K", "260K-390K",
"390K-520K", "520K-650K", "650K-780K",
"780K-910K", "910K-1040K")) # Categorize LIMIT_BAL
LimitBal_D_R$Rate_cut = cut(as.numeric(as.character(LimitBal_D_R$Rate)),
c(seq(0,1.1,0.1)), right = FALSE) # Categorize Defualt Rate
# Visualize the default of LIMIT_BAL
LimitBal_D_R$Rate = as.numeric(as.character(LimitBal_D_R$Rate))
LimitBal_D_R$LIMIT_BAL = as.numeric(as.character(LimitBal_D_R$LIMIT_BAL))
ggplot(data=LimitBal_D_R, aes(x=LIMIT_BAL, y=Rate, color=Rate_cut)) +
geom_point() + labs(title = "Probability of Default Rate Based on Limit Balance",
x = "Limit Balance (NTD)", y = "Probability", tag = "G") +
scale_colour_discrete(name="Categorize Default Rate",
labels=c("Less than 10%", "10%-20%", "20%-30%",
"30%-40%", "40%-50%", "50%-60%","More than 60%"))
## Comment: The average of default rate is relatively low
## when the balance limit is between 250,000 and 500,000.
## The volatility of default rate is relatively high
## when the balance limit is between 500,000 and 750,000.
# Plot 8 --------------------------------------------------------------------------
Edu_D_R = as.data.frame(DefaultRateMat[[3]])
Edu_D_R$Rate = as.numeric(as.character(Edu_D_R$Rate))
Edu_D_R$EDUCATION = c("MissingData", "Graduate", "University",
"HighSchool", "Others", "Unknown1", "Unknown2")
Edu_D_R_new = Edu_D_R[,c(1,3,4,5)] %>% gather(DefaultYN, Count, Default:NoDefault)
Edu_D_R_new$Count = as.numeric(as.character(Edu_D_R_new$Count))
ggplot(data=Edu_D_R_new, aes(x=EDUCATION, y=Count, fill=DefaultYN)) +
geom_bar(colour="black", stat="identity", position=position_dodge()) +
geom_text(aes(label=Count), position=position_dodge(width=0.9), vjust=-0.25, size=3) +
labs(title = "Count of Default and Non-Default Based on Education Level",
x = "Education Level", y = "Count ", tag = "H") + theme_minimal() +
scale_fill_brewer(palette="Paired", name="Default Yes or No", labels=c("Yes", "No"))
## Comment: Most of the cardholders have bachelor's degree or master's degree.
ggplot(data=Edu_D_R, aes(x=EDUCATION, y=Rate)) + geom_jitter(aes(x=EDUCATION, y=Rate)) +
labs(title = "Scatter Plot of Default Rate and Education Level",
x = "Education Level", y = "Count ", tag = "I")
## Comment: Clients with higher education level has lower default rate.
# Plot 9 --------------------------------------------------------------------------
Age_D_R = as.data.frame(DefaultRateMat[[5]])
# Convert format to numeric
for (i in 1:length(Age_D_R)){
Age_D_R[,i] = as.numeric(as.character(Age_D_R[,i]))
}
ggplot(Age_D_R, aes(x=AGE, y=AllData, color=Rate)) +
geom_point(aes(size=Rate)) +
labs(title = "Level of Default Rate Baesd on Age",
x = "Age (Years)", y = "Count of All Data", tag = "J") + theme_minimal()
## Comment: More people have credit card between age 22 and age 40.
## The variance is higher for clients older than 60.
```
```{r}
ftable(CreditCard$SEX, CreditCard$EDUCATION, CreditCard$MARRIAGE)
## We can see some unknown data and 0 in this data set
##### Note. can do predict for those missing data set as well,
##### but have not included in this project.
# Data Cleaning
# Remove rows with Education=0,5,6 and MARRIAGE=0,3 and LIMIT_BAL,SEX,AGE=0
without0 = apply(CreditCard,1, function(x) all(x[2:6]!=0) && x[4]!=5 && x[4]!=6 && x[5]!=3)
CreditCard = CreditCard[without0,]
```
There are 24 factors against amount of given credit. In order to aviod overfitting, I selected the
most important factors using forward stepwise selection.
**Algorithm: Forward Stepwise Selection**
a. Let $\mathcal{M_0}$ denote the null model, which contains no predictors.
b. For $k=0,1,\dots ,p-1$; $p$ is the number of predictors
b.1 Consider all $p-k$ models that augment the predictors in $\mathcal{M_k}$ with one additional predictor
b.2 Choose the best among these $p-k$ models, and call it $\mathcal{M_{k-1}}$. Here best is defined as
having the smallest RSS, or equivalently largest $R^2$.
c. Select a single best model fromamong $\mathcal{M_0}, \mathcal{M_1}, \dots,\mathcal{M_p}$ using
cross-validated prediction error, $Cp$, $AIC$, $BIC$, or adjusted $R^2$.
At the step c, I chose Bayesian Information Criterion (BIC) for determining the cross-validated prediction error.
The Bayesian Information Criterion (BIC) gives unnecessary variable much greater penalty, so it can more
efficient to aviod overfitting.
**Criteria: Bayesian Information Criterion**
For the least squares model with d predictors up to irrelevant
constants
$$BIC=\frac{1}{n}(RSS+d\hat{\sigma}^2\log n)$$
Since $\log n > 2$ for $n > 7$, the BIC places a heavier penalty on models with many variables.
$\mathcal{M_0}$
## 2. Quantitative factors as responses
Statistic Learning on LIMIT_BAL against other factors
```{r}
# Set up the 70% for train set and the rest of 30% for test set
train = round(nrow(CreditCard) * 0.7,0)
train = sample(nrow(CreditCard) ,train)
CreditCard.train = CreditCard[train, ]
CreditCard.test = CreditCard[-train, ]
# Determine which parameters are more important using forward selection
Forward = regsubsets(LIMIT_BAL ~., data = CreditCard.train, method="forward",
nvmax=length(CreditCard)-1)
Forward.summary = summary(Forward)
# Determine how many paramenters will be used using **Bayesian Information Criterion (BIC)**
plot(Forward.summary$bic,type='b',col="blue", pch=19, xlab = "Number of Variables",
ylab = "Cross-Validated Prediction Error",
main = "Forward Stepwise Selection using BIC")
points(which.min(Forward.summary$bic), Forward.summary$bic[which.min(Forward.summary$bic)],
col="red", pch=19)
# List all improtant parameters using BIC
GetColumn = t(Forward.summary$which)[,which.min(Forward.summary$bic)]
## Remove LIMIT_BAL from column names because LIMIT_BAL is the response of regression
NameCol = names(CreditCard)[-2]
NameCol[GetColumn[2:length(CreditCard)]]
## I pick 20 parameters, which has the minimum error using bic
```
```{r}
# Set the formula for part 2 of this project
formulaQ2 =
as.formula(LIMIT_BAL ~ .-ID-PAY_4-BILL_AMT6-default.payment.next.month)
# Logistic Regression
fit.lm = lm(formulaQ2, data = CreditCard.train)
# Predict
yhat.lm = predict(fit.lm, CreditCard.test)
# Test MSE
mse_lm = round(mean((yhat.lm - CreditCard.test$LIMIT_BAL)^2), 4)
paste("The test MSE using linear regession is", mse_lm)
##################################################################################
# Lasso and Elastic-Net Regularized Generalized Linear Models
tryalpha = seq(0,1,0.1)
x.train = model.matrix(formulaQ2, data = CreditCard.train)
x.test = model.matrix(formulaQ2, data = CreditCard.test)
y = CreditCard.train$LIMIT_BAL
mse_glmnet = rep(NA, length(tryalpha))
for (i in 1:length(tryalpha)){
fit.glmnet = glmnet(x.train, y, alpha = tryalpha[i])
pred.glmnet = predict(fit.glmnet, s = 0.01, newx = x.test)
mse_glmnet[i] = round(mean((pred.glmnet - CreditCard.test$LIMIT_BAL)^2), 4)
}
plot(tryalpha, mse_glmnet, xlab = "Alpha", ylab = "Test Mean-Squared Errors",
main = "Test MSE using Regularized Generalized Linear Models")
# Test MSE
# Lasso and Elastic-Net Regularized Generalized Linear Models (glmnet)
paste("The lowest test MSE using glmnet is",
min(mse_glmnet), "with alpha =",
tryalpha[which.min(mse_glmnet)], "as alpha is in [0, 1]")
##################################################################################
# Tree
# Fit a regression tree to the training set
fit.tree = tree(formulaQ2, data = CreditCard.train)
# Plot the tree
plot(fit.tree)
text(fit.tree, pretty = 0)
title("Decision Tree of Amount of Given Credit (LIMIT_BAL)")
# Test MSE
tree.pred = predict(fit.tree, newdata = CreditCard.test)
mse.tree = mean((tree.pred - CreditCard.test$LIMIT_BAL)^2)
print (paste("The test MSE using tree is", round(mse.tree,4) ))
##################################################################################
# Random Forest
# Use the bagging approach
trymtry = seq(10,16,2) # (which.min(Forward.summary$bic)-1)
mse.bag = rep(NA, length(trymtry))
for (i in 1:length(trymtry)){
fit.bag = randomForest(formulaQ2, data = CreditCard.train, mtry = i,
ntree = 500, importance = TRUE)
yhat.bag = predict(fit.bag, newdata = CreditCard.test)
mse.bag[i] = round(mean((yhat.bag - CreditCard.test$LIMIT_BAL)^2),4)
}
plot(trymtry, mse.bag, type = "b", xlab = "Number of variables", ylab = "Test MSE",
main = "Test MSE using Bagging Approach")
paste("The lowest test MSE using bagging is", min(mse.bag))
##################################################################################
# Generalized Boosted Regression Models
# Give shrinkage parameters 'lambda'
lambdas = 10^seq(0, 0.2, by = 0.001)
test.err = rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
boost.Limitbal = gbm(formulaQ2, data = CreditCard.train, distribution = "gaussian",
n.trees = 1000, shrinkage = lambdas[i])
yhat = predict(boost.Limitbal, CreditCard.test, n.trees = 1000)
test.err[i] = mean((yhat - CreditCard.test$LIMIT_BAL)^2)
}
# Plot with different shrinkage parameters on the x-axis
# and the corresponding training set MSE on the y-axis
plot(lambdas, test.err, type = "b", xlab = "Shrinkage values", ylab = "Test MSE",
main = "Test MSE using Boosting Algorithm")
mse_boosting = round(min(test.err), 4)
paste("The test MSE using boosting is", mse_boosting)
```
```{r}
P2_accuracy = data.frame("Test MSE"=c( mse_lm, min(mse_glmnet),
mse.tree, min(mse.bag), mse_boosting))
rownames(P2_accuracy) = c("lm", "glmnet", "tree", "bag", "boosting")
P2_accuracy
```
I did machine leanring on this credit card dataset using five algorithms, including linear regression (lm),
lasso and elastic-net regularized generalized linear models (glmnet), classification tree, bagging, and boosting.
From the test MSE table, we can see linear regression is not a good fit for our credit card dataset to predict
limit balance, even shrinking the coefficients $\alpha$ from 0 to 1. Among these five algorithms, bagging
approach has the lowest test MSE. It is because bagging approach (Bootstrap aggregation) can reduce the variance
and hence decrease the prediction mean-squared errors of a statistical learning method. Also, it takes many
training sets from the population and build a separate prediction model using each training set. Then we average
the resulting predictions. Thus, the test MSE is lower than the test MSE using a single.
## 3. Qualitative factors as responses
And then, I did the same process on whether clients default payment next month (default.payment.next.month)
against others factors.
```{r}
# Determine which parameters are more important using forward selection
Forward = regsubsets(default.payment.next.month ~., data = CreditCard.train,
method="forward", nvmax=length(CreditCard)-1)
Forward.summary = summary(Forward)
# Determine how many paramenters will be used using **Bayesian Information Criterion (BIC)**
plot(Forward.summary$bic,type='b',col="blue", pch=19,
xlab = "Number of Variables",
ylab = "Cross-Validated Prediction Error",
main = "Forward Stepwise Selection using BIC")
points(which.min(Forward.summary$bic),
Forward.summary$bic[which.min(Forward.summary$bic)],
col="red", pch=19)
# List all improtant parameters using BIC
names(CreditCard) [t(Forward.summary$which)[,which.min(Forward.summary$bic)]
[2:length(CreditCard)]]
## I picked 6 parameters, which has the minimum error using bic
```
```{r}
# Set formula for part 3 of this project
formulaQ3 =
as.formula(default.payment.next.month~MARRIAGE+PAY_0+PAY_2+PAY_3+
BILL_AMT1+PAY_AMT1)
######################################################################################
# Generalized Linear Model with Logistic Regression
fit.glm = glm(formulaQ3, data = CreditCard.train, family = binomial)
fit.glm = glm(formulaQ3, data = CreditCard.train, family = binomial)
# Predict
pred.prob = predict(fit.glm, CreditCard.test, type = "response")
pred.glm = rep(0, length(pred.prob))
pred.glm[pred.prob > 0.5] = 1
pred.table = table(pred.glm, CreditCard.test$default.payment.next.month)
pred.table
# Sensitivity: TP/P = TPR
Sensitivity = pred.table[1,1] / sum(pred.table[,1])
# Specificity: TN/N = TNR
Specificity = pred.table[2,2] / sum(pred.table[,2])
# Accuracy: (TP + TN)/(P + N)
Accuracy = sum(pred.table[1,1], pred.table[2,2]) / sum(pred.table[,])
# Total Error Rate: (FP + FN)/(P + N)
TotalError = sum(pred.table[1,2],pred.table[2,1]) / sum(pred.table[,])
glm.Confusion = round(data.frame(Sensitivity, Specificity, Accuracy, TotalError),4)
row.names(glm.Confusion) = "GLM"
paste("The accuracy using Logistic regression is", Accuracy,4)
######################################################################################
# Linear Discriminant Analysis (LDA)
fit.lda = lda(formulaQ3, data = CreditCard.train)
# Predict default.payment.next.month in tesing data set
pred.prob.lda = predict(fit.lda, CreditCard.test)
# Predict table
pred.table.lda = table(pred.prob.lda$class, CreditCard.test$default.payment.next.month)
pred.table.lda
# Sensitivity: TP/P = TPR
Sensitivity = pred.table.lda[1,1] / sum(pred.table.lda[,1])
# Specificity: TN/N = TNR
Specificity = pred.table.lda[2,2] / sum(pred.table.lda[,2])
# Accuracy: (TP + TN)/(P + N)
Accuracy = sum(pred.table.lda[1,1],pred.table.lda[2,2]) / sum(pred.table.lda[,])
# Total Error Rate: (FP + FN)/(P + N)
TotalError = sum(pred.table.lda[1,2],pred.table.lda[2,1]) / sum(pred.table.lda[,])
lda.Confusion = round(data.frame(Sensitivity, Specificity, Accuracy, TotalError),4)
row.names(lda.Confusion) = "LDA"
paste("The accuracy using LDA is", Accuracy)
######################################################################################
# Quadratic discriminant analysis (QDA)
fit.qda = qda(formulaQ3, data = CreditCard.train)
# Predict default.payment.next.month in tesing data set
pred.prob.qda = predict(fit.qda, CreditCard.test)
# Predict table
pred.table.qda = table(pred.prob.qda$class, CreditCard.test$default.payment.next.month)
pred.table.qda
# Sensitivity: TP/P = TPR
Sensitivity = round(pred.table.qda[1,1] / sum(pred.table.qda[,1]),4)
# Specificity: TN/N = TNR
Specificity = round(pred.table.qda[2,2] / sum(pred.table.qda[,2]),4)
# Accuracy: (TP + TN)/(P + N)
Accuracy = round(sum(pred.table.qda[1,1],
pred.table.qda[2,2]) / sum(pred.table.qda[,]),4)
# Total Error Rate: (FP + FN)/(P + N)
TotalError = round(sum(pred.table.qda[1,2],
pred.table.qda[2,1]) / sum(pred.table.qda[,]),4)
qda.Confusion = data.frame(Sensitivity, Specificity, Accuracy, TotalError)
row.names(qda.Confusion) = "QDA"
paste("The accuracy using QDA is", Accuracy)
######################################################################################
# K-nearest neighbors algorithm
CreditCard.bic = dplyr::select(CreditCard, MARRIAGE, AGE, PAY_0, PAY_2,
PAY_3, BILL_AMT1, PAY_AMT1, PAY_AMT2)
# Set up the 70% for train set and the rest of 30% for test set
train = round(nrow(CreditCard) * 0.7,0)
train = sample(nrow(CreditCard) ,train)
CreditCard.bic.train = CreditCard.bic[train, ]
CreditCard.bic.test = CreditCard.bic[-train, ]
pred.tables.knn = table(NULL)
Accuracy.knn.table = table(NULL)
for(K in 6:25){
pred.knn = knn(CreditCard.bic.train, CreditCard.bic.test,
CreditCard.train$default.payment.next.month, k = K)
pred.table.knn = table(pred.knn, CreditCard.test$default.payment.next.month)
Accuracy = round(sum( pred.table.knn[1,1],
pred.table.knn[2,2] ) / sum(pred.table.knn[,]),4)
# rbind Accuracy table and Confusion table for K=1,2,3
Accuracy.knn.table = rbind(Accuracy.knn.table, Accuracy)
pred.tables.knn = rbind(pred.tables.knn, pred.table.knn)
}
# Convert pred.tables.knn from **matrix** into **data.table**
predicts = rownames(pred.tables.knn)
pred.tables.knn = data.table(pred.tables.knn)
# Create two columns for the predictions and K, respectively
pred.tables.knn$Predicts = predicts
pred.tables.knn$K = rep(6:25, each=2)
# Swith the order of columns
pred.tables.knn = pred.tables.knn[,c(4,3, 1:2)]
# Rename the rows and columns of the Accuracy table
rownames(Accuracy.knn.table) = c("K=6", "K=7", "K=8", "K=9", "K=10", "K=11", "K=12",
"K=13", "K=14", "K=15", "K=16", "K=17", "K=18",
"K=19", "K=20", "K=21", "K=22", "K=23", "K=24", "K=25")
colnames(Accuracy.knn.table) = "Accurancy"
# Plot the accuracy as K = [6,25]
plot(x=seq(6,25), y=Accuracy.knn.table, xlab = "Number of K",
ylab = "Accuracy (%)", main = "Accuracy using KNN")
paste("The highest accuracy using KNN is", max(Accuracy.knn.table), " with K =",
which.max(Accuracy.knn.table))
######################################################################################
# Generalized Additive Modelz
fit.gam = gam(formulaQ3, data = CreditCard.train)
# Predict the out-of-state tuition using test set
pred.prob.gam = predict(fit.gam, CreditCard.test)
pred.gam = rep(0, length(pred.prob.gam))
pred.gam[pred.gam > 0.5] = 1
pred.table.gam = table(pred.gam, CreditCard.test$default.payment.next.month)
pred.table.gam
pred.table.gam.Accuracy = round(pred.table.gam[1] / length(pred.prob.gam),4)
paste("The accuracy of generalized additive model is", pred.table.gam.Accuracy)
######################################################################################
# Tree
# Fit a regression tree to the training set
fit.tree = tree(formulaQ3, data = CreditCard.train)
# Plot the tree
plot(fit.tree)
text(fit.tree, pretty = 0)
title("Decision Tree of Credit Default (default.payment.next.month)")
# Predict default.payment.next.month in tesing data set
pred.prob.tree = predict(fit.tree, CreditCard.test)
# Use default.payment.next.month to test the model accuracy
pred.tree = rep(0, length(pred.prob.tree))
pred.tree[pred.prob.tree > 0.5] = 1
pred.table.tree = table(pred.tree, CreditCard.test$default.payment.next.month)
pred.table.tree
# Sensitivity: TP/P = TPR
Sensitivity = pred.table.tree[1,1] / sum(pred.table.tree)
# Specificity: TN/N = TNR
Specificity = pred.table.tree[2,2] / sum(pred.table.tree[,2])
# Accuracy: (TP + TN)/(P + N)
Accuracy = sum(pred.table.tree[1,1], pred.table.tree[2,2]) / sum(pred.table.tree)
# Total Error Rate: (FP + FN)/(P + N)
TotalError = sum(pred.table.tree[1,2], pred.table.tree[2,1]) / sum(pred.table.tree[,])
tree.Confusion = round(data.frame(Sensitivity, Specificity, Accuracy, TotalError),4)
row.names(tree.Confusion) = "Tree"
paste("The accuracy of Tree is", Accuracy)
######################################################################################
# Random Forest:
# In random forests data is resampled from the the train set for as many trees as in the forest
# (default is 500 in R).
# Since the respons are only have two unique values,
# it is not enough for the random forest to create unique trees.
# Thus, I won't use Random Forest to do prediction.
######################################################################################
# Lasso and Elastic-Net Regularized Generalized Linear Models
# Fit a regression tree to the training set
x.train = model.matrix(formulaQ3, data = CreditCard.train)
x.test = model.matrix(formulaQ3, data = CreditCard.test)
tryalpha = seq(0,1,0.1)
acc_glmnet = rep(NA, length(tryalpha))
for (i in 1:length(tryalpha)){
fit.glmnet = glmnet(x.train, CreditCard.train$default.payment.next.month, alpha = tryalpha[i])
pred.prob.glmnet = predict(fit.glmnet, s = 0.01, newx = x.test)
# Use default.payment.next.month to test the model accuracy
pred.glmnet = rep(0, length(pred.prob.glmnet))
pred.glmnet[pred.prob.glmnet > 0.5] = 1
pred.table.glmnet = table(pred.glmnet, CreditCard.test$default.payment.next.month)
pred.table.glmnet
# Sensitivity: TP/P = TPR
Sensitivity = pred.table.glmnet[1,1] / sum(pred.table.glmnet)
# Specificity: TN/N = TNR
Specificity = pred.table.glmnet[2,2] / sum(pred.table.glmnet[,2])
# Accuracy: (TP + TN)/(P + N)
Accuracy = sum(pred.table.glmnet[1,1], pred.table.glmnet[2,2]) / sum(pred.table.glmnet[,])
# Total Error Rate: (FP + FN)/(P + N)
TotalError = sum(pred.table.glmnet[1,2], pred.table.glmnet[2,1]) / sum(pred.table.glmnet[,])
glmnet.Confusion = data.frame(Sensitivity, Specificity, Accuracy, TotalError)
acc_glmnet[i] = round(glmnet.Confusion$Accuracy,4)
}
plot(tryalpha, acc_glmnet, xlab = "Alpha", ylab = "Accuracy (%)",
main = "Accuracy using Regularized Generalized Linear Models")
paste("The highest accuracy using glmnet is",
max(acc_glmnet), "with alpha =",
tryalpha[which.max(acc_glmnet)], "as alpha is in [0, 1]")
```
```{r}
P3_accuracy = data.frame("Accuracy"= c( glm.Confusion$Accuracy, lda.Confusion$Accuracy,
qda.Confusion$Accuracy, max(Accuracy.knn.table),
pred.table.gam.Accuracy, tree.Confusion$Accuracy,
max(acc_glmnet)))
rownames(P3_accuracy) = c("glm", "lda", "qda", "KNN", "gam", "tree", "glmnet" )
P3_accuracy
# From these seven algorithms, Tree has the highest accuracy.
```
In part 3, I did machine leanring on this credit card dataset using seven algorithms, including generalized
linear model (glm), linear and quadratic discriminant analysis, k-nearest neighbors, generalized additive model,
classification tree, and Lasso and elastic-net regularized generalized linear models. From the accuracy table,
the possibility of credit card default next month against other factors is near linear relation based on high
accuracy of generalized linear model. Additionally, it has clear feature on repayment status of the previous
two months. Thus, the accuracy of classification tree is the highest, and lda is the second highest.
## Conclusion
1. More credit card defualt for limit balance about 10000. It might mean that credit card might be too easy to
be issued for people who have low credit scores. The variance of the default rate for limit balance over
500,000 NTD is higher than other range of limit balance.
2. It is lower default rate for cardholders have higher education level. Moreover, the default rate for clients
whose age over 60 was higher than mid age and young people.
3. The best fit algorithm for predicting limit balance is bagging approach.
4. The best fit algorithm for predicting whether a client default next month is classification tree.
## Reference
1. https://bradzzz.gitbooks.io/ga-dsi-seattle/content/dsi/dsi_05_classification_databases/2.1-lesson/assets/datasets/DefaultCreditCardClients_yeh_2009.pdf
2. https://gerardnico.com/data_mining/stepwise_regression
3. http://www-math.mit.edu/~rmd/650/bic.pdf