-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReport.qmd
2836 lines (2021 loc) · 130 KB
/
Report.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Imbalanced classification with deep learning - Used cars problem"
author: "Ahmet Zamanis"
format:
gfm:
toc: true
editor: visual
jupyter: python3
execute:
warning: false
---
## Introduction
In this report, we'll build and test binary classification algorithms, on a dataset of used cars up for sale. Our goal is to predict if a car is a "kick": A car that is heavily damaged or unusable, and should not be purchased. The dataset was sourced from [OpenML](https://www.openml.org/search?type=data&sort=runs&id=41162&status=active), and was apparently used for a past [Kaggle competition](https://www.kaggle.com/competitions/DontGetKicked/overview), and possibly for a [Chalearn](http://www.chalearn.org/) competition.
We will test logistic regression, support vector machine and XGBoost classifiers with the scikit-learn package. As the dataset is quite large, we'll train our logistic regression & SVM models with Stochastic Gradient Descent. We'll also build a deep neural network model with PyTorch Lightning. We'll perform hyperparameter tuning for all models with the Optuna framework. We'll compare the performances of all models, both with classification performance metrics, and with a sensitivity analysis based on a hypothetical business scenario.
Our target variable is highly imbalanced, as positive observations are strongly in the minority. Training classifiers and evaluating their performance can be tricky in such cases. To try and address the imbalance, we'll train our scikit-learn models with class weights, use focal loss as our loss function for the neural network, and use suitable metrics & plots to correctly assess performance.
## Setup
```{python Imports}
#| code-fold: true
#| code-summary: "Show packages"
# Data handling
import pandas as pd
import numpy as np
from scipy.io.arff import loadarff
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Categorical encoding
from feature_engine.encoding import OneHotEncoder
from feature_engine.creation import CyclicalFeatures
from category_encoders.target_encoder import TargetEncoder
# Preprocessing pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
# Modeling
from sklearn.utils.class_weight import compute_class_weight
from sklearn.kernel_approximation import RBFSampler
from sklearn.calibration import CalibratedClassifierCV
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier
# Neural network modeling
import torch, torchvision, torchmetrics
import lightning.pytorch as pl
from optuna.integration.pytorch_lightning import PyTorchLightningPruningCallback
# Hyperparameter tuning
import optuna
# Loss functions & performance metrics
from sklearn.metrics import log_loss
from sklearn.metrics import hinge_loss
from sklearn.metrics import average_precision_score, brier_score_loss
from sklearn.metrics import PrecisionRecallDisplay, precision_recall_curve
# Utilities
import warnings
from itertools import product
```
```{python Settings}
#| code-fold: true
#| code-summary: "Show settings"
# Set printing options
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_columns', None)
# Set plotting options
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams["figure.autolayout"] = True
sns.set_style("darkgrid")
# Set Torch settings
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('high')
pl.seed_everything(1923, workers = True)
warnings.filterwarnings("ignore", ".*does not have many workers.*")
```
## Data cleaning
We load our dataset, which is in .arff format, and carry out some data cleaning operations. Most of the operations are explained in the code comments, so I'll keep the text to a minimum in this section.
```{python LoadData}
# Load raw data
raw_data = loadarff("./RawData/kick.arff")
# Convert to pandas dataframe
df = pd.DataFrame(raw_data[0])
# Print first & last 5 rows, all columns
print(df)
```
The categorical variables in the dataset have the bytes datatype, and need to be converted to strings. Also, some missing values are recorded as "?", which will be replaced with proper missing values.
```{python ConvertColumns}
# Convert object columns from bytes to string datatype
object_cols = df.select_dtypes(["object"]).columns
for column in object_cols:
df[column] = df[column].apply(lambda x: x.decode("utf-8"))
del column
# Replace "?" string values with NAs
for column in object_cols:
df.loc[df[column] == "?", column] = np.nan
del column
```
We have missing values in several columns. Two columns, PRIMEUNIT and AUCGUART are missing to a high degree. We'll inspect and handle each column one by one.
```{python MissingValues}
# Print n. of missing values in each column
pd.isnull(df).sum()
```
The target variable is highly imbalanced.
```{python TargetBalance}
print("Target class (im)balance: ")
df["IsBadBuy"].value_counts(normalize = True)
```
```{python PurchaseDate}
# Purchase date is in UNIX timestamp format. Convert to datetime
df["PurchDate"]
df["PurchDate"] = pd.to_datetime(df["PurchDate"], unit = "s")
```
```{python Auction}
# 3 unique auctioneers. ADESA, MANHEIM and other. Mostly MANHEIM.
df["Auction"].value_counts(normalize = True)
```
```{python VehYearAge}
# Vehicle years range from 2001 to 2010
print(
df[["VehYear", "VehicleAge"]].describe()
)
# Purchase age almost always matches PurchYear - VehYear.
purch_year = df["PurchDate"].dt.year
veh_year = df["VehYear"]
print("\nCases where purchase year - vehicle year = vehicle age: " +
str(((purch_year - veh_year) == df["VehicleAge"]).sum())
)
```
The "Make" column records the car's brand, and has many categorical levels (high cardinality), though there are columns with many more.
- Some of these brands are sub-brands of others. For example, Scion is a sub-brand of Toyota, and Infiniti is a sub-brand of Nissan.
- We could recode these to reduce cardinality, but most of these sub-brands offer cars of higher segments compared to the main brands (mainly in the US), so keeping them separate could be more useful.
```{python Make}
# There are brands with very few observations.
print(
df["Make"].value_counts()
)
# Recode TOYOTA SCION into SCION
df.loc[df["Make"] == "TOYOTA SCION", "Make"] = "SCION"
```
The "Model" column records the car model. There are more than 1000 models, so if we want to use this column as a feature, we'll likely use target encoding. More on this in the feature engineering & preprocessing sections.
```{python Model}
# There are 1063 unique models, many with only 1 observation.
df["Model"].value_counts()
```
Trim refers to slightly different versions / packages of a car model. These values are meaningless, and possibly misleading, without the associated model value.
```{python Trim}
# 134 trims, many with only 1 observation. 2360 missing values.
df["Trim"].value_counts()
```
We also have submodels, which are also likely meaningless or misleading without the main model value. This column offers information such as the car's drivetrain, engine type or chassis type.
```{python SubModel}
# 863 submodels, many with only 1 observation. 8 missing values.
df["SubModel"].value_counts()
```
```{python ModelSubModel}
# 2784 unique model & submodel combinations, many with only 1 observation.
(df["Model"] + " " + df["SubModel"]).value_counts()
```
```{python Cars}
# 3000+ unique cars in dataset (some could be different namings / spellings of the same car)
(df["Model"] + " " + df["Trim"] + " " + df["SubModel"]).value_counts()
```
```{python Color}
# 8 missing values for color, recode them as NOT AVAIL
print(df["Color"].value_counts())
df.loc[pd.isna(df["Color"]), "Color"] = "NOT AVAIL"
```
Some of the missing values in our data can be manually worked out from the car model, or other information. Transmission is one such column.
```{python Transmission}
# 9 missing values for transmission. Try to work them out from car model. 1 occurence of manual spelled differently.
print(df["Transmission"].value_counts())
# Replace "Manual" with MANUAL in transmission
df.loc[df["Transmission"] == "Manual", "Transmission"] = "MANUAL"
```
We'll print the model information for cars with missing Transmission values. We can then fill these in with manual search (or personal interest & knowledge in my case). We'll apply this method to several other columns with missing values.
```{python TransmissionNAs}
# Work out & impute transmission NAs from car model
print("Rows with Transmission values missing: ")
print(df.loc[pd.isna(df["Transmission"]), ["VehYear", "Make", "Model", "Trim", "SubModel"]])
transmission_nas = ["AUTO", "MANUAL", "MANUAL", "MANUAL", "AUTO", "MANUAL", "MANUAL", "AUTO", "AUTO"]
df.loc[pd.isna(df["Transmission"]), "Transmission"] = transmission_nas
```
There are two columns that record essentially the same information: WheelType and WheelTypeID. We'll keep only one, and recode the missing values simply as "Other".
```{python Wheel}
# 3169 missing values in WheelTypeID, 3174 in WheelType. Crosscheck these columns.
print(df["WheelTypeID"].value_counts())
print("\n")
print(df["WheelType"].value_counts())
```
```{python WheelNAs}
print("N. of WheelTypeID NAs that are also WheelType NAs: " +
str(pd.isnull(df.loc[pd.isnull(df["WheelTypeID"]), "WheelType"]).sum())
)
print("\nRemaining 5 rows with WheelType NAs are WheelTypeID = 0: \n" +
str(df.loc[df["WheelTypeID"] == "0", "WheelType"])
)
```
```{python WheelIDvsType}
print("Cases where WheelTypeID 1 = WheelType Alloy: " +
str((df.loc[df["WheelTypeID"] == "1", "WheelType"] == "Alloy").sum())
)
print("Cases where WheelTypeID 2 = WheelType Covers: " +
str((df.loc[df["WheelTypeID"] == "2", "WheelType"] == "Covers").sum())
)
print("Cases where WheelTypeID 3 = WheelType Special: " +
str((df.loc[df["WheelTypeID"] == "3", "WheelType"] == "Special").sum())
)
```
```{python DropWheelID}
# Recode WheelType NAs as Other, drop WheelTypeID column
df.loc[pd.isnull(df["WheelType"]), "WheelType"] = "Other"
df = df.drop("WheelTypeID", axis = 1)
```
```{python Nationality}
# 5 missing values in Nationality. Work these out from car make.
df["Nationality"].value_counts()
```
```{python NationalityNAs}
# Work out the 5 missing Nationality values from make
print("Makes of rows with missing Nationality values: ")
print(df.loc[pd.isnull(df["Nationality"]), "Make"])
nationality_nas = ["AMERICAN", "AMERICAN", "OTHER ASIAN", "AMERICAN", "AMERICAN"]
df.loc[pd.isnull(df["Nationality"]), "Nationality"] = nationality_nas
```
```{python Size}
# 5 missing values in size. Work these out from the model.
df["Size"].value_counts()
```
```{python SizeNAs}
# Work out the 5 missing Size values from make & model
print("Rows with Size values missing: ")
print(df.loc[pd.isnull(df["Size"]), ["VehYear", "Make", "Model"]])
print("\nSize values of other rows with the same models: ")
print(
df.loc[df["Model"].str.contains("SIERRA"), "Size"].iloc[0] + "\n" +
df.loc[df["Model"].str.contains("NITRO 4WD"), "Size"].iloc[0] + "\n" +
df.loc[df["Model"].str.contains("ELANTRA"), "Size"].iloc[0] + "\n" +
df.loc[df["Model"].str.contains("PATRIOT 2WD"), "Size"].iloc[0]
)
size_nas = ["LARGE TRUCK", "MEDIUM SUV", "MEDIUM", "SMALL SUV", "SMALL SUV"]
df.loc[pd.isnull(df["Size"]), "Size"] = size_nas
```
We have a column recording whether a car's make is GM, Chrysler, Ford, or another brand. I doubt we need this, as this information is already built into the Make column.
```{python Top3}
# Unnecessary column, information already incorporated in Make. Drop it
print(df["TopThreeAmericanName"].value_counts())
df = df.drop("TopThreeAmericanName", axis = 1)
```
There are several columns that record the MMR values of the cars. MMR stands for Manheim Market Report, a valuation offered by car auction company Manheim.
- The MMR columns titled "Acquisition" record the MMR valuations of the car at time of purchase. There are several different valuations available, such as auction & retail prices, each for average & clean condition cars respectively.
- The MMR columns titled "Current" are the car's current MMR valuation prices. For sake of realism, we'll assume we're performing this analysis before purchasing the cars, so we'll drop the Current MMR columns.
```{python MMR}
print("Missing values in MMR columns:")
print(pd.isnull(
df[['MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice',
'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']]
).sum()
)
# Drop current MMR prices to make the exercise more realistic
df = df.drop([
'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice'], axis = 1)
```
```{python MMRZerosOnes}
# Some MMR values are zero
print("N. of rows with at least one MMR value of 0: " +
str(
len(
df.loc[
(df["MMRAcquisitionAuctionAveragePrice"] == 0) |
(df["MMRAcquisitionAuctionCleanPrice"] == 0) |
(df["MMRAcquisitionRetailAveragePrice"] == 0) |
(df["MMRAcquisitonRetailCleanPrice"] == 0)
]
)
)
)
# Some MMR values are one
print("N. of rows with at least one MMR value of 1: " +
str(
len(
df.loc[
(df["MMRAcquisitionAuctionAveragePrice"] == 1) |
(df["MMRAcquisitionAuctionCleanPrice"] == 1) |
(df["MMRAcquisitionRetailAveragePrice"] == 1) |
(df["MMRAcquisitonRetailCleanPrice"] == 1)
]
)
)
)
# Drop rows with NAs in MMR
df = df.dropna(subset = [
'MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice'])
# Drop rows with 0s in MMR
df = df.loc[
(df["MMRAcquisitionAuctionAveragePrice"] > 0) &
(df["MMRAcquisitionAuctionCleanPrice"] > 0) &
(df["MMRAcquisitionRetailAveragePrice"] > 0) &
(df["MMRAcquisitonRetailCleanPrice"] > 0)].copy()
```
The PRIMEUNIT and AUCGUART columns mostly consist of missing values. But for rows with known values, they could be important predictors.
- PRIMEUNIT records whether a car attracted unusually high demand. "YES" values are very rare even among non-missing rows. The missing values could simply mean "NO".
- AUCGUART records the level of inspection passed by a car before being auctioned. "GREEN" means fully inspected, "YELLOW" means partially inspected, "RED" means "you buy what you see". There are no "YELLOW" values in the data, which means the missing values could simply mean "YELLOW".
- For both columns, we'll recode missing values as "UNKNOWN". Later we'll have options to handle these based on the categorical encoding method we use.
```{python PRIMEUNIT}
# 95% missing column
print(df["PRIMEUNIT"].value_counts(normalize = True))
# Fill NAs in PRIMEUNIT with UNKNOWN.
df.loc[pd.isnull(df["PRIMEUNIT"]), "PRIMEUNIT"] = "UNKNOWN"
```
```{python AUCGUART}
# 95% missing column
print(df["AUCGUART"].value_counts(normalize = True))
# Fill NAs in AUCGUART with UNKNOWN.
df.loc[pd.isnull(df["AUCGUART"]), "AUCGUART"] = "UNKNOWN"
```
```{python BYRNO}
# BYRNO is buyer no. 74 unique buyers, some with only 1 observation.
df["BYRNO"].value_counts()
```
```{python VNZIP1}
# VNZIP1 is zipcode of purchase location, 153 locations, some with only 1 obs.
df["VNZIP1"].value_counts()
```
```{python VNST}
# VNST is purchase state.
df["VNST"].value_counts()
```
Vehicle purchase price is likely an important predictor, which is why we'll drop the few rows with missing purchase prices. There is one car with a purchase price of 1, which may mean it was a symbolic transaction. We'll keep this row for now.
```{python Price}
# VehBCost is purchase price. 68 missing values, drop these rows. One car has a purchase price of 1.
print(df["VehBCost"].describe())
df = df.dropna(subset = "VehBCost")
```
Warranty cost is paid for 36 months worth of warranty, or until the car racks up 36k miles. Likely another important predictor, especially relative to the purchase price.
```{python Warranty}
# Warranty cost is for 36 months, or until 36k miles
df["WarrantyCost"].describe()
```
## Feature engineering
We can create some new features from our existing ones. We can also explicitly map some interaction terms, to make them easier to discover for some models.
```{python TimeFeatures}
# Time features from date: Purchase year, month, day of week
df["PurchaseYear"] = df["PurchDate"].dt.year
df["PurchaseMonth"] = df["PurchDate"].dt.month
df["PurchaseDay"] = df["PurchDate"].dt.weekday
df = df.drop("PurchDate", axis = 1)
```
The model and submodel names incorporate plenty of technical information on the cars. We'll create numerous binary columns that flag various traits of a car: Engine type, drivetrain type, chassis type and number of doors. I also thought about retrieving engine displacement in liters as a numeric feature, but this information is not openly stated for many cars, and would raise many missing values.
```{python EngineDrivetrain}
# Engine type features from Model: V6, V8, I4/I-4, 4C, 6C
df["EngineV6"] = df["Model"].str.contains("V6").astype(int)
df["EngineV8"] = df["Model"].str.contains("V8").astype(int)
df["EngineI4"] = df["Model"].str.contains("I4|I-4", regex = True).astype(int)
df["Engine4C"] = df["Model"].str.contains("4C").astype(int)
df["Engine6C"] = df["Model"].str.contains("6C").astype(int)
# Drivetrain type features from Model: 2WD, 4WD, AWD, FWD, RWD
df["2WD"] = df["Model"].str.contains("2WD").astype(int)
df["4WD"] = df["Model"].str.contains("4WD").astype(int)
df["AWD"] = df["Model"].str.contains("AWD").astype(int)
df["FWD"] = df["Model"].str.contains("FWD").astype(int)
df["RWD"] = df["Model"].str.contains("RWD").astype(int)
```
```{python Chassis}
# Chassis type features from SubModel: WAGON, SEDAN, COUPE, HATCHBACK, CONVERTIBLE
# Work out and recode these features manually for rows where SubModel is missing
print("Rows where SubModel is missing:")
print(df.loc[pd.isnull(df["SubModel"]), "Model"])
df["ChassisWagon"] = df["SubModel"].str.contains("WAGON")
df.loc[pd.isnull(df["ChassisWagon"]), "ChassisWagon"] = [
False, False, False, False, False, False, False, True]
df["ChassisWagon"] = df["ChassisWagon"].astype(int)
df["ChassisSedan"] = df["SubModel"].str.contains("SEDAN")
df.loc[pd.isnull(df["ChassisSedan"]), "ChassisSedan"] = [
True, True, False, True, True, True, False, False]
df["ChassisSedan"] = df["ChassisSedan"].astype(int)
df["ChassisCoupe"] = df["SubModel"].str.contains("COUPE")
df.loc[pd.isnull(df["ChassisCoupe"]), "ChassisCoupe"] = False
df["ChassisCoupe"] = df["ChassisCoupe"].astype(int)
df["ChassisHatch"] = df["SubModel"].str.contains("HATCHBACK")
df.loc[pd.isnull(df["ChassisHatch"]), "ChassisHatch"] = False
df["ChassisHatch"] = df["ChassisHatch"].astype(int)
df["ChassisConvertible"] = df["SubModel"].str.contains("CONVERTIBLE")
df.loc[pd.isnull(df["ChassisConvertible"]), "ChassisConvertible"] = False
df["ChassisConvertible"] = df["ChassisConvertible"].astype(int)
```
```{python Doors}
# Door type feature from SubModel: 4D
df["FourDoors"] = df["SubModel"].str.contains("4D")
# Work out and recode this feature manually for rows where SubModel is missing (displayed in previous cell)
df.loc[pd.isnull(df["FourDoors"]), "FourDoors"] = [
True, True, False, True, True, True, False, False]
df["FourDoors"] = df["FourDoors"].astype(int)
```
I decided to combine the Model & SubModel columns to have one "car" feature. This column will have very high cardinality, but we'll use target encoding, and models with this feature performed slightly better than models without it. I did not include Trim in this feature, but it's possible to give it a try.
```{python ModelSubModel}
# Recode SubModel NAs into empty string
df.loc[pd.isnull(df["SubModel"]), "SubModel"] = ""
# Combine Model & SubModel as one feature
df["ModelSubModel"] = df["Model"] + " " + df["SubModel"]
# Drop trim, submodel
df = df.drop(["Trim", "SubModel"], axis = 1)
```
Some obvious interaction features we can add are miles traveled per year, the premium / discount paid on the MMR prices, and the warranty cost to purchase price ratio.
```{python MilesPerYear}
# Miles per year feature
df["MilesPerYear"] = df["VehOdo"] / df["VehicleAge"]
df.loc[df["MilesPerYear"] == np.inf, "MilesPerYear"] = df["VehOdo"] # Replace inf values raised when vehicle age = 0
```
```{python Premium}
# Premiums / discounts paid on MMR prices: VehBCost - MMR price / MMR price
df["PremiumAuctionAvg"] = (df["VehBCost"] - df["MMRAcquisitionAuctionAveragePrice"]) / df["MMRAcquisitionAuctionAveragePrice"]
df["PremiumAuctionClean"] = (df["VehBCost"] - df["MMRAcquisitionAuctionCleanPrice"]) / df["MMRAcquisitionAuctionCleanPrice"]
df["PremiumRetailAvg"] = (df["VehBCost"] - df["MMRAcquisitionRetailAveragePrice"]) / df["MMRAcquisitionRetailAveragePrice"]
df["PremiumRetailClean"] = (df["VehBCost"] - df["MMRAcquisitonRetailCleanPrice"]) / df["MMRAcquisitonRetailCleanPrice"]
```
The row with a purchase price of 1 results in an astronomically high value for the warranty cost / purchase price ratio, so we drop it after all. We have plenty of data, and it's likely better to keep it accurate.
```{python WarrantyRatio}
# Warranty ratio to purchase price
df["WarrantyRatio"] = df["WarrantyCost"] / df["VehBCost"]
# The observation with purchase price = 1 skews the WarrantyRatio feature greatly. Drop it.
print(df[["VehBCost", "WarrantyRatio"]].sort_values(by = "WarrantyRatio", ascending = False).iloc[0])
df = df.loc[df["VehBCost"] != 1].copy()
```
Most machine learning algorithms require all features to be numeric. We'll use one-hot encoding for most of our categorical columns.
- This can be done before splitting training & testing data, as it won't leak information from the testing set. However, we can't use this approach in future prediction sets if they have new categorical levels unseen in our training data.
- Adding too many columns may cause our models to suffer from the curse of dimensionality, which affects different algorithms to different extents. We'll one-hot encode only the columns with relatively fewer levels, and leave the high-cardinality columns to target encoding.
- A categorical column with N levels can be encoded with N-1 binary columns, as the Nth value will be reflected by all columns taking the value of 0. We'll take advantage of this to keep our column count low.
- We'll also create binary columns only for known values of PRIMEUNIT and AUCGUART, effectively encoding the missing values as 0 for all binary columns.
```{python OneHotEncode}
# One hot encode some categoricals in-place
encoder_onehot = OneHotEncoder(
drop_last = True, # Create N-1 binary columns to encode N categorical levels
drop_last_binary = True,
variables = ['Auction', 'VehYear', 'Color', 'Transmission', 'WheelType',
'Nationality', 'Size', 'PurchaseYear'],
ignore_format = True)
df = encoder_onehot.fit_transform(df)
# One hot encode PRIMEUNIT and AUCGUART only using known values (effectively making "unknown" the default / 0 case)
df["PRIMEUNIT_YES"] = (df["PRIMEUNIT"] == "YES").astype(int)
df["PRIMEUNIT_NO"] = (df["PRIMEUNIT"] == "NO").astype(int)
df["AUCGUART_GREEN"] = (df["AUCGUART"] == "GREEN").astype(int)
df["AUCGUART_RED"] = (df["AUCGUART"] == "RED").astype(int)
df = df.drop(["PRIMEUNIT", "AUCGUART"], axis = 1)
```
Month and day of week are cyclical features: The "distance" between month 1 and month 12 is not eleven, it is one. Simply label encoding the months as 1-12 , or weekdays as 1-7 would mislead our models.
- Instead, we'll use cyclical encoding, which encodes each feature using a sine and cosine pair, conveying their cyclical nature to our algorithms.
- Another option is one-hot encoding, but it creates considerably more columns than cyclical encoding. It could also be advantageous to avoid "splitting" related information into many columns, to ensure they are evaluated together.
```{python CyclicalEncode}
# Cyclical encode month and day of week features
encoder_cyclical = CyclicalFeatures(
variables = ["PurchaseMonth", "PurchaseDay"], drop_original = True)
df = encoder_cyclical.fit_transform(df)
```
The remaining, high-cardinality categorical features will be target encoded, which won't change the number of columns.
```{python ViewFeatures}
# View final feature set before preprocessing
print(df.head())
```
## Preprocessing pipeline
The rest of our preprocessing will take place in a scikit-learn pipeline, to ensure we avoid any target leakage, and to keep our modeling workflow clean.
- First, we'll set aside 20% of our data for testing. We'll ensure the split is performed with stratification: We want to maintain the target class balance as much as possible.
- We'll use the remaining 80% for hyperparameter tuning and validation. We'll test the final models with the best hyperparameters on the testing data.
```{python TrainTestSplit}
# Split target and features
y = df["IsBadBuy"].astype(int)
x = df.drop("IsBadBuy", axis = 1)
# Perform train-test split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size = 0.2, random_state = 1923, stratify = y
)
```
We will use several target encoders to encode the remaining, high-cardinality categorical features.
- Target encoders derive a numeric value based on the target values for each categorical level. This brings the obvious risk of leaking the target values from the testing data if not applied properly.
- Another potential issue is overfitting: If a categorical level has very few observations, the value derived from the target will be unreliable.
- The target encoder applies a form of regularization to alleviate this: The encoded value for infrequent categorical levels are blended with the values for all observations. The degree of this regularization can be controlled by the parameters `min_sample_leaf` and `smoothing` in [target_encoder](https://contrib.scikit-learn.org/category_encoders/targetencoder.html). We'll use the default settings.
- The target encoder can also make use of a hierarchy of columns: Categorical levels of one column may be subsets of levels of another column. We have two such hierarchies in our data: The make-model-submodel hierarchy, and the states-zipcodes hierarchy.
- We'll map the target encoders for these columns with their respective higher hierarchies, so regularization can be applied by blending in the higher hierarchy values instead of values for all observations.
- Aside from these risks and considerations, target encoding brings several benefits:
- It does not create any new columns, and it does not split the information from one feature into several columns.
- It is also able to handle any missing values, or previously unseen categorical levels in prediction data, by applying the encoded values for all observations, or the values for the hierarchy, as done in regularization.
```{python TargetEncoders}
# Target encoder: ModelSubModel encoded with Model, Make hierarchy
hier_submodels = pd.DataFrame(x[["Make", "Model"]]).rename({
"Make": "HIER_ModelSubModel_1", # Make as first (top) level of hierarchy
"Model": "HIER_ModelSubModel_2"}, # Model as second level of hierarchy
axis = 1)
encode_target_submodel = TargetEncoder(
cols = ["ModelSubModel"], # Column to encode (bottom level of hierarchy)
hierarchy = hier_submodels)
# Target encoder: Model encoded with Make hierarchy
hier_models = pd.DataFrame(x["Make"]).rename({"Make": "HIER_Model_1"}, axis = 1)
encode_target_model = TargetEncoder(cols = ["Model"], hierarchy = hier_models)
# Target encoder: Zipcodes encoded with states hierarchy
hier_states = pd.DataFrame(x["VNST"]).rename({"VNST": "HIER_VNZIP1_1"}, axis = 1)
encode_target_zip = TargetEncoder(cols = ["VNZIP1"], hierarchy = hier_states)
# Target encoder: Make, buyer, state encoded without hierarchy (apply last):
encode_target = TargetEncoder(cols = ["Make", "BYRNO", "VNST"])
```
Many machine learning algorithms can be sensitive to the value scales of features.
- Those with very large absolute values may dominate training, and prevent smaller-scale values from having much effect.
- All our models except the tree-based XGBoost algorithm can be sensitive to scale. Especially neural networks can suffer from various numerical stability issues in training.
- We'll scale all features between 0 and 1 as our final preprocessing step.
For hyperparameter validation, we will use 3-fold crossvalidation:
- The 80% training data will be split to 3 equal folds. Each hyperparameter configuration we evaluate will be trained on two folds, and validated on the third, for a total of three different combinations. Stratification will be applied just like in the train-test split.
- We could have more robust validation scores with more folds, but the neural network tuning can be computationally expensive, and I wanted to compare each model on "equal terms".
- An even more reliable tuning & testing scheme is nested crossvalidation: Splitting the full data into N training-testing folds (outer resampling), and splitting each training fold into K folds for parameter tuning (inner resampling). This is likely not necessary with such a big dataset.
- Since we'll build a custom model validation function for each model, we'll retrieve the train-test indices from the 3-fold CV splitter, for a total of three train-test index pairs.
```{python Pipeline}
# Scaling method
scaler_minmax = MinMaxScaler()
# Full preprocessing pipeline
pipe_process = Pipeline(steps = [
("target_encoder_zipcode", encode_target_zip),
("target_encoder_submodel", encode_target_submodel),
("target_encoder_model", encode_target_model),
("target_encoder", encode_target),
("minmax_scaler", scaler_minmax)
])
# Inner validation method for hyperparameter tuning
cv_kfold = StratifiedKFold(n_splits = 3)
# Get training / inner validation split indices (3 pairs) from k-fold splitter
cv_indices = list(cv_kfold.split(x_train, y_train))
```
## Modeling & hyperparameter optimization
The hyperparameter optimization will follow 3 main steps for every model:
- A **validation function** will be defined, to perform 3-fold crossvalidation with a given hyperparameter set.
- An **Optuna objective function** will be defined, which will suggest hyperparameter values from a predefined range, validate the parameter set (with our validation function), and report the average validation score of the parameter set.
- One iteration of the objective function for one set of hyperparameters is called an **Optuna trial.**
- An **Optuna study** will be created and performed. The study performs a given number of trials and records their results.
- The study uses a **sampler** algorithm to intelligently suggest parameter values for trials, and a **pruner** algorithm to stop unpromising trials early on and save time.
Since the process is very similar for all models, we'll go over the full process only for the first model, logistic regression. For the other models, only the differences will be explained. Of course, the actual tuning process takes hours, so I'll just display the code I used without running it and load previously saved results.
### Logistic regression with SGD
Logistic regression predicts the log-odds of an event occurring (also called logits) with a linear model. It's a simple, robust method that can perform very well compared to more advanced models.
- The main assumption of logistic regression is a linear relationship between the predictors and the logits. The less linear the relationships are, the less accurate logistic regression will be.
- Logistic regression won't find and model interaction effects unless they are explicitly mapped as new features (we did this for some features).
- Plain linear regression models can suffer from multicollinearity and curse of dimensionality in the feature set, but **regularization** is almost always applied to model coefficients to avoid overfitting.
- **L1 (lasso)** regularization can potentially shrink model coefficients to zero, effectively dropping predictors from the model and performing feature selection.
- **L2 (ridge)** regularization can't shrink coefficients to zero, but just reduces their contribution to the model.
- **Elastic net** regularization is a blend of L1 and L2 regularization, controlled by a ratio parameter. This is what we will use.
Logistic regression (implemented with [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) in scikit-learn) is normally fit on all training observations at once: The model coefficients are all estimated in one go using a solver algorithm. This calculation can be a bit slow with a large dataset like ours, especially since we'll fit hundreds of models for hyperparameter tuning. To save time, we'll train our logistic regression (and SVM) models with **stochastic gradient descent:**
- In SGD, the model is trained only on one observation in **one training step**. The **training error (loss)** is calculated and the model parameters are updated at every training step. When the model sees each training observation once, we complete a **training epoch.**
- We perform multiple epochs of training. Training is usually **early stopped** when the validation score stops improving. This is usually assessed after each epoch.
- This approach is similar to how neural networks are trained. A key difference is NNs use **minibatch** SGD: A training step consists of a batch of observations, which can range from 16-32 to 1024-2048 observations (usually a power of 2). This is a faster approach than single-observation SGD. It's possible to implement this manually in scikit-learn, but it likely won't bring much improvement for logistic regression or SVM.
We can train linear classification models with SGD in scikit-learn using [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html). Below, we will define a validation function that takes a set of hyperparameters, builds a logistic regression model with SGDClassifier and validates the performance of the parameter set.
- In one instance of the function, a model will be created, trained & validated for each CV fold, for a total of three times.
- The model will be trained with the `.partial_fit()` method, which trains the SGDClassifier for only one epoch. We'll call the method repeatedly to train the model for several epochs. Our function will handle early stopping manually by recording each epoch's score. By default, 10 consecutive epochs with less than 0.0001 improvement in the validation error will terminate the model training.
- **Class weights** will be applied to both training and validation error calculations: The errors for the minority class will be penalized more heavily, forcing our models to pay more attention to them.
- The class weight for each class will be inversely proportional to its class frequency, calculated automatically by a scikit-learn utility (only using the training data to avoid leakage). Using these class weights, we will create a sample weights vector for every training & validation set observation, and pass these to the model training and scoring functions respectively.
- The function will return the average CV score of the parameter set, and the number of training epochs that achieved it. It will also perform pruning if directed by the Optuna pruner:
- Unpromising trials (determined according to the validation scores of the first epochs) will be stopped early by the Optuna pruner. To allow pruning, the score of each epoch for the first CV fold will be reported to the Optuna study.
- The best number of epochs is not a hyperparameter to be suggested & tuned by Optuna, but we will observe the value it takes during a trial, and report it back.
The SGDClassifier is a general method for training a linear model with SGD. Its parameters determine what type of linear model is trained:
- The `loss` parameter sets the training loss function. For probabilistic logistic regression, we will use **log loss**, which compares the probability predictions (between 0-1) to the actual target labels (0 or 1). We will also use log loss to score our predictions on the validation set.
- The `penalty` parameter sets the regularization type. We'll use elastic net regularization.
- `learning_rate` is another key parameter in SGD training (especially for neural networks). It determines the magnitude of the coefficient updates in each training step.
- With a higher LR, the coefficients will be updated more aggressively in each training step. The models will train faster, may be more likely to find optimal solutions, but also more likely to overfit and suffer from erratic model updates.
- With a lower LR, the coefficients will be updated more conservatively in each training step. The models will train slower, may be less likely to reach optimal solutions, but will also be more robust against overfitting.
- In neural networks, a **learning rate scheduler** is often used to dynamically adjust the learning rate. We'll explore this in more detail for our NN model, but for SGDClassifier, we'll use a default heuristic based on the regularization strength.
- We have two hyperparameters to tune for the logistic regression model: `alpha` and `l1_ratio`.
- Alpha is the regularization strength. A higher value means the model coefficients will be penalized more heavily to combat overfitting. Generally, we'd need a higher alpha value with more redundancy in the feature set. In practice, the optimal value is usually a small fraction, and too high regularization leads to the opposite problem of underfitting.
- L1 ratio will determine what fraction of our elastic net regularization will be L1 regularization.
```{python ValLogistic}
#| code-fold: true
#| code-summary: "Show validation function definition"
#| eval: false
# Define model validation function for logistic regression trained with SGD
def validate_logistic(
alpha, # Regularization strength hyperparameter
l1_ratio, # Ratio of L1 regularization hyperparameter
trial, # Pass the Optuna trial of the parameter set
tol = 1e-4, # Min. required change in validation score to count as improvement
verbose = 0 # Set to 1 to print model training progress
):
# Record the validation scores of the parameter set on each CV fold
cv_scores = []
# Record best epoch number for each CV fold
best_epochs = []
# For each CV fold,
for i, (train_index, val_index) in enumerate(cv_indices):
# Split training-validation data
x_tr = x_train.iloc[train_index, ]
y_tr = y_train.iloc[train_index, ]
x_val = x_train.iloc[val_index, ]
y_val = y_train.iloc[val_index, ]
# Perform preprocessing
x_tr = pipe_process.fit_transform(x_tr, y_tr)
x_val = pipe_process.transform(x_val)
# Compute class weight & sample weight vectors
classes = list(set(y_tr))
class_weight = compute_class_weight("balanced", classes = classes, y = y_tr)
sample_weight = np.where(y_tr == 1, class_weight[1], class_weight[0])
sample_weight_val = np.where(y_val == 1, class_weight[1], class_weight[0])
# Define Logistic Regression classifier with SGD
model_logistic = SGDClassifier(
loss = "log_loss", # Log loss metric for probabilistic logistic regression
penalty = "elasticnet", # Elastic net regularization: Blend of L1 and L2
learning_rate = "optimal", # Dynamically adjusted learning rate, based on a heuristic that uses regularization strength
random_state = 1923, # Set random seed for replicable results
verbose = verbose, # Whether to print training progress or not
n_jobs = -1, # Use all CPU cores to parallelize
alpha = alpha, # Hyperparameters to validate
l1_ratio = l1_ratio # Hyperparameters to validate
)
# Perform epoch by epoch training with early stopping & pruning
epoch_scores = [] # Record val. score of each epoch
n_iter_no_change = 0 # Record epochs with no improvement
tol = tol # Min. change in val. score to count as improvement
for epoch in range(1000):
# Train model for 1 epoch
model_logistic.partial_fit(
x_tr, y_tr, classes = classes, sample_weight = sample_weight)
# Predict validation set & score predictions
y_pred = model_logistic.predict_proba(x_val)
epoch_score = log_loss(y_val, y_pred, sample_weight = sample_weight_val)
# For first CV fold, report intermediate score of Optuna trial
if i == 0):
trial.report(epoch_score, epoch)
# Prune trial if necessary
if trial.should_prune():
raise optuna.TrialPruned()
# Count epochs with no improvement after first 10 epochs
if (epoch > 9) and (epoch_score > min(epoch_scores) - tol):
n_iter_no_change += 1
# Reset epochs with no improvement if an improvement took place
if (epoch > 9) and (epoch_score <= min(epoch_scores) - tol):
n_iter_no_change = 0
# Append epoch score to list of epoch scores
epoch_scores.append(epoch_score)
# Early stop training if necessary
if n_iter_no_change >= 10:
print("\nEarly stopping at epoch " + str(epoch) + "\n")
break
# Append best epoch score to list of CV scores
cv_scores.append(min(epoch_scores))
# Append best epoch number to list of best epochs
best_epochs.append(epoch_scores.index(min(epoch_scores)) + 1)
# Return the average CV score & median best epochs
return np.mean(cv_scores), np.median(best_epochs)
```
Next, we define the Optuna objective function for logistic regression. Here, we decide the search space for each hyperparameter we want to tune.
- We'll tune `alpha` between a very small float value of 5e-5 and a relatively large value of 0.5, though the optimal value is likely close to 0. Therefore, we will instruct Optuna to sample from the log domain:
- The logarithm of the search space will be taken, values will be sampled from this space, and the exponent will be taken to return the original value. This will make lower values more likely to be suggested. We'll also do this for other hyperparameters in other models that usually take an optimal value closer to 0.
- `l1_ratio` is constrained between 0 and 1, so we will use that as our search space. A value of 0 will result in pure L2 regularization, and a value of 1 will result in pure L1 regularization. Anything in between is elastic net regularization, a blend of the two.
```{python ObjLogistic}
#| eval: false
# Define objective function for hyperparameter tuning with Optuna
def objective_logistic(trial):
# Define hyperparameter ranges to tune over, and suggest a value for
# each hyperparameter
alpha = trial.suggest_float(
"reg_strength", # Parameter name in trials result table
5e-5, # Minimum value to try
0.5, # Maximum value to try
log = True # Sampling from the log domain makes lower values more likely
)
l1_ratio = trial.suggest_float("l1_ratio", 0, 1)
# Validate the parameter set
score, epoch = validate_logistic(
alpha = alpha, l1_ratio = l1_ratio, trial = trial)
# Report best n. of epochs to Optuna
trial.set_user_attr("n_epochs", epoch)
return score
```
Now we create the Optuna study. This object will manage the tuning process and store the information from trials.
- We set a sampler algorithm: The sampler suggests hyperparameter values for trials intelligently, often based on the results of previous trials.
- In our case, we use the suggested default, the **Tree-structured Parzen Estimator** algorithm (see the [documentation](https://optuna.readthedocs.io/en/stable/reference/samplers/generated/optuna.samplers.TPESampler.html) for a list of papers related to TPE.). In short, the TPE algorithm fits Gaussian Mixture Models to estimate the optimal parameters for each trial.
- We set a pruner algorithm: The pruner decides whether to complete a trial, or prune it based on its intermediate validation scores. This saves us a great deal of time by stopping unpromising trials quickly.
- We use the **Hyperband** pruner algorithm, which is suggested for the TPE sampler. In short, hyperband allocates a finite budget equally to each hyperparameter configuration to be compared. There is the option to increase the total budget, at the expense of eliminating candidate configurations more quickly (see the [documentation](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.pruners.HyperbandPruner.html) and [original paper](https://www.jmlr.org/papers/volume18/16-558/16-558.pdf) for more detail).
- We name our study, and ensure we set the correct objective: Since a lower log loss value is better, we aim to minimize the reported metric.
```{python StudyLogistic}
#| eval: false
# Create Optuna study
study_logistic = optuna.create_study(
sampler = optuna.samplers.TPESampler(seed = 1923), # Sampling algorithm
pruner = optuna.pruners.HyperbandPruner(), # Pruning algorithm
study_name = "tune_logistic",
direction = "minimize" # Objective is to minimize the reported metric
)
```
We can finally run the hyperparameter optimization. It is recommended to run the TPE algorithm with 100 to 1000 trials. The models in the final edition of this report were tuned with 500 or 1000 trials.
- With pruning in place, even if improvement stops after a few trials, the remaining trials will mostly be pruned.
- For logistic regression, one study of 500 trials usually took around 10-15 minutes. The SGDClassifier was parallelized with `n_jobs = -1` and 12 CPU cores available.
```{python OptimizeLogistic}
#| eval: false
# Perform Optuna study