-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFinalReport_C1.qmd
4017 lines (3163 loc) · 142 KB
/
FinalReport_C1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Performance of Predictive Models - The Interpretability and Explainability"
subtitle: "Authors: Leona Hasani, Leona Hoxha, Nanmanat Disayakamonpan, Nastaran Mesgari"
format:
html:
standalone: true
embed-resources: true
code-fold: false
number-sections: true
toc: true
highlight-style: github
abstract: "Our project explores three diverse datasets sourced from Kaggle across different industries: health, environment, and business sectors, namely the Cardiovascular Dataset, Weather in Australia, and Hotel Reservation, respectively. Our primary focus is on evaluating the performance of supervised learning algorithms in predicting binary target variables. Key questions guiding the project include assessing the impact of sophisticated modeling methods, model transferability across datasets, and the effects of standardization techniques. Addressing issues such as imbalanced datasets and feature selection, the project delves into identifying optimal hyperparameters and mitigating overfitting issues. Through preprocessing, exploratory data analysis, and modeling phases, the project aims to provide insights into algorithm performance, generalization, and the trade-offs involved. Our results are presented through performance metrics tables and learning curve analyses, shedding light on algorithm behaviors and guiding future model selection."
---
# Project Overview
## Introduction
Our project covers the analysis of three distinct datasets sourced from Kaggle, each representing a different industry: **the Cardiovascular Dataset from the health sector, Weather in Australia from the environmental domain, and Hotel Reservation from the business field.**
The primary objective of our project is ***to evaluate the performance of various supervised learning algorithms in predicting binary target variables.*** In the following section, we outline the key questions guiding our project, which we will address throughout our analysis and present in our results and key findings.
Furthermore, we aim to explore how the most effective supervised machine learning algorithm adapts within a given dataset. To accomplish this, we will utilize learning curves, which provide insights into the algorithm's performance as it processes more training data. Additionally, significant attention will be devoted to hyperparameter tuning to optimize model performance. By adjusting these parameters, we aim to identify and mitigate any potential overfitting issues within the datasets. This analysis will involve visualizations showcasing the training and testing performance metrics across various hyperparameter settings.
The primary goal of this project is ***to enhance our understanding of supervised predictive models, with particular emphasis on overfitting.*** Overfitting, being a complex concept, can often lead to misconceptions. By delving into this topic, we aim to clarify its nuances and implications within the context of machine learning models. Through thorough examination and visualization of performance metrics, we aim to shed light on the factors contributing to overfitting and strategies for mitigating its effects.
## Questions and Problems
In this project, we tackle critical challenges aimed at enhancing both the performance and interpretability of our model. These questions are prioritized based on their significance and relevance:
<span style="color: slategrey">**1.** *Can the implementation of more sophisticated modeling methods within our dataset lead to enhanced model performance, and how can we interpret such improvements?* </span>
<span style="color: slategrey">**2.** *Does it mean that if one model performs the best in one particular dataset, it would be the same for another dataset with the same method?* </span>
<span style="color: slategrey">**3.** *What is the impact of standardization and normalization techniques on the performance scores of our models?*</span>
<span style="color: slategrey">**4.** *Do we have any imbalanced dataset? If yes, what approach could we use to balance the data?*</span>
<span style="color: slategrey">**5.** *How can we analyze the trade-off dynamics between including all available features and employing feature selection techniques?*</span>
<span style="color: slategrey">**6.** *What approach can be employed to identify the optimal hyperparameters of specific models?*</span>
<span style="color: slategrey">**7.** *Is there a risk of overfitting within our datasets, and what measures can be taken to assess and mitigate this risk effectively?* </span>
Following the preprocessing, exploratory data analysis, and modeling phases, our results and conclusions will address each of these research questions comprehensively.
```{python}
#| label: importing the libraries, packages-data
#| echo: false
#| message: false
#| include: false
#Importing the needed libraries only in this code chunk
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
import time
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, roc_curve, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, LearningCurveDisplay, ShuffleSplit
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.utils import resample
from itertools import cycle
from scipy.stats import randint
import math
```
```{python}
#| label: Loading the datasets
#| echo: false
#| message: false
#| include: false
weather = pd.read_csv('Datasets/weatherAUS.csv', sep=",", header=0, index_col=False)
cardio = pd.read_csv('Datasets/CVD_cleaned.csv', sep=",", header=0, index_col=False)
hotel = pd.read_csv('Datasets/Hotel Reservations.csv', sep=",", header=0, index_col=False)
```
# Datasets Overview
This section provides a brief overview of each dataset's structure and the number of features they contain. All three datasets utilized in this project were obtained from the Kaggle website *(for further information, please refer to the appendix, sections 2.1, 3.1, and 4.1)*
## *Business Sector: Hotel Reservation Dataset*
```{python}
#| label: hotel head description
#| echo: false
#| message: true
#| include: false
hotel.head(5)
```
The Hotel Reservation Dataset spans from July 2017 to December 2018, comprising 36,275 observations, each representing a unique booking. It encompasses 19 attributes offering insights into booking patterns, guest preferences, and hotel operations. The target variable aims to predict whether a specific reservation will be canceled in the future.
## *Environmental Sector: Weather in Australia Dataset*
```{python}
#| label: weather data head five
#| echo: false
#| message: true
#| include: false
weather.head(5)
```
The Weather in Australia Dataset contains 145,460 daily weather observations and 19 variables related to weather conditions, with 14 being numerical features and the remainder categorical or date types. The target variable seeks to predict whether rain will occur based on other meteorological features.
## *Health Sector: Cardiovascular Dataset*
```{python}
#| label: cardio head description
#| echo: false
#| message: true
#| include: false
cardio.head(5)
```
The Cardiovascular Dataset focuses on analyzing healthcare data to predict the presence of heart disease. It comprises 308,854 observations and 19 features, encompassing lifestyle factors, personal details, habits, and disease indicators. Among these features, there are 12 categorical variables and 7 numerical variables. The target variable aims to predict the likelihood of a patient developing cardiovascular disease.
# Preprocessing steps
```{python}
#| label: hotel head descriptionn
#| echo: false
#| message: true
#| include: false
hotel.head(5)
```
```{python}
#| label: hotel nunique
#| echo: false
#| message: true
#| include: false
hotel.nunique()
```
```{python}
#| label: hotel info
#| echo: false
#| message: true
#| include: false
hotel.info()
```
```{python}
#| label: hotel describe
#| echo: false
#| message: true
#| include: false
hotel.describe()
```
```{python}
#| label: hotel - dropping the 'Booking ID' column
#| echo: false
#| message: true
#| include: false
hotel.drop(columns=['Booking_ID'], inplace=True)
```
```{python}
#| label: hotel - checking for missing values
#| echo: false
#| include: false
hotel.isna().sum()
```
```{python}
#| label: hotel - label encoding
#| echo: false
#| message: false
#| include: false
meal_plan_mapping = {
"Not Selected": 0,
"Meal Plan 1": 1,
"Meal Plan 2": 2,
"Meal Plan 3": 3
}
room_reserved_mapping = {
"Room_Type 1": 1,
"Room_Type 2": 2,
"Room_Type 3": 3,
"Room_Type 4": 4,
"Room_Type 5": 5,
"Room_Type 6": 6,
"Room_Type 7": 7
}
market_segment_mapping = {
"Offline": 0,
"Online": 1,
"Corporate": 2,
"Aviation": 3,
"Complementary": 4
}
booking_status_mapping = {
"Not_Canceled": 0,
"Canceled": 1,
}
# mapping the values of the columns
hotel['type_of_meal_plan'] = hotel['type_of_meal_plan'].map(meal_plan_mapping)
hotel['room_type_reserved'] = hotel['room_type_reserved'].map(room_reserved_mapping)
hotel['market_segment_type'] = hotel['market_segment_type'].map(market_segment_mapping)
hotel['booking_status'] = hotel['booking_status'].map(booking_status_mapping)
# printing the updated unique values to verify the label encoding
print("Unique Values of type_of_meal_plan:")
print(hotel['type_of_meal_plan'].unique())
print("Unique Values of room_type_reserved:")
print(hotel['room_type_reserved'].unique())
print("Unique Values of market_segment_type:")
print(hotel['market_segment_type'].unique())
print("Unique Values of booking_status:")
print(hotel['booking_status'].unique())
```
```{python}
#| label: hotel - info numerical
#| echo: false
#| message: false
#| include: false
hotel.info()
```
```{python}
#| label: hotel - date to string and then remove date
#| #| echo: false
#| message: true
#| include: false
# converting 'arrival_year', 'arrival_month', and 'arrival_date' to string and concatenate them
date_str = hotel['arrival_date'].astype(str) + '-' + hotel['arrival_month'].astype(str) + '-' + hotel['arrival_year'].astype(str)
# errors='coerce' replaces invalid dates with NaT (Not a Time)
hotel['arrival_date_full'] = pd.to_datetime(date_str, format='%d-%m-%Y', errors='coerce')
# dropping the original date columns
hotel.drop(columns=['arrival_year', 'arrival_month', 'arrival_date'], inplace=True)
```
```{python}
#| label: hotel - head to 5
#| echo: false
#| include: false
hotel.head(5)
```
```{python}
#| label: weather data unique
#| echo: false
#| message: true
#| include: false
weather.nunique()
```
```{python}
#| label: weather data info
#| echo: false
#| message: true
#| include: false
weather.info()
```
```{python}
#| label: weather data describe
#| echo: false
#| message: true
#| include: false
weather.describe()
```
```{python}
#| label: weather data pr steps
#| echo: false
#| message: true
#| include: false
weather.isna().sum()
```
```{python}
#| label: weather data heatmap missing
#| echo: false
#| include: false
plt.figure(figsize=(10, 6))
sns.heatmap(weather.isnull(), cmap='viridis', yticklabels=False, cbar=False)
plt.title('Missing Values in the Weather Dataset')
plt.show()
```
```{python}
#| label: weather data cleaning
#| echo: false
#| include: false
weather.duplicated().sum()
weather.drop_duplicates()
weather.nunique()
```
```{python}
#| label: weather data column remove them
#| echo: false
#| message: true
#| include: false
columns_to_remove = ['Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm']
# Remove the specified columns from the DataFrame
weather.drop(columns=columns_to_remove, inplace=True)
```
```{python}
#| label: weather data pre
#| echo: false
#| message: true
#| include: false
weather.isna().sum()
```
```{python}
#| label: weather data prepro
#| echo: false
#| message: true
#| include: false
weather.head(5)
weather.dropna(axis=0, inplace=True)
```
```{python}
#| label: weather data infooo
#| echo: false
#| message: false
#| include: false
weather.info()
```
```{python}
#| label: cardio checking for missing values
#| echo: false
#| include: false
cardio.isnull().sum()
```
```{python}
#| label: cardio data duplicate
#| echo: false
#| include: false
cardio.duplicated().sum()
```
```{python}
#| label: cardio data remove duplicate
#| echo: false
#| include: false
cardio.drop_duplicates()
```
```{python}
#| label: cardio data unique
#| echo: false
#| include: false
cardio.nunique()
```
```{python}
#| label: cardio - handle outliers
#| echo: false
#| include: false
cardio = cardio.drop(cardio[(cardio['Height_(cm)'] < 140) | (cardio['Height_(cm)'] > 205)].index)
cardio = cardio.drop(cardio[(cardio['Weight_(kg)'] > 225)].index)
cardio = cardio.drop(cardio[(cardio['BMI'] < 13.4) | (cardio['BMI'] > 53.4)].index)
```
```{python}
#| label: cardio - data describe
#| echo: false
#| include: false
cardio.describe()
```
```{python}
#| label: cardio - data shape
#| echo: false
#| include: false
cardio.shape
```
## *Hotel Reservation Dataset*
In the Hotel dataset, we initially inspected the data and identified several categorical attributes, which we transformed into numerical values. Additionally, we removed the *'Booking_ID'* attribute as it was deemed non-essential. Subsequently, we checked for any missing data but found none. Next, we consolidated the columns related to arrival dates into a single date column for better organization *(for further information, please refer to the appendix, section 2.2).*
## *Weather in Australia Dataset*
For the Weather dataset, our preprocessing began with an examination for missing data, revealing notable missing values in four features. We decided to remove these attributes as they appeared less crucial for our analysis. Additionally, we addressed any duplicate entries to ensure data integrity. Furthermore, we identified missing values in the target variable *'RainTomorrow'* and excluded them from further analysis to avoid bias. Despite these exclusions, we retained a substantial amount of data *(for further information, please refer to the appendix, section 3.2).*
## *Cardiovascular Dataset*
In the Cardiovascular dataset, our preprocessing started with a check for missing data, followed by the removal of duplicate entries. We then assessed numerical attributes such as *'Height', 'Weight', and 'BMI'* for outliers and removed them to prevent skewing the analysis. After these steps, we still retained a substantial portion of the dataset for analysis *(for further information, please refer to the appendix, section 4.3).*
# Exploratory Data Analysis
```{python}
#| label: hotel - checking class proportions
#| message: false
#| echo: false
#| include: false
class_distribution = hotel['booking_status'].value_counts()
class_proportions = hotel['booking_status'].value_counts(normalize=True)
imbalance_ratio = class_distribution[1] / class_distribution[0]
print("Class Distribution:")
print(class_distribution)
print("\nClass Proportions:")
print(class_proportions)
print("\nImbalance Ratio (Class 1 / Class 0):", imbalance_ratio)
```
```{python}
#| label: hotel - date to string to visualize
#| echo: false
#| include: false
# extracting month from the arrival date and convert to string
hotel['arrival_month'] = hotel['arrival_date_full'].dt.strftime('%Y-%m')
# creating a dataframe with arrival month and booking status
booking_status_df = hotel[['arrival_month', 'booking_status']].copy()
# grouping by arrival month and booking status, counting occurrences, and unstacking to separate booking statuses
booking_status_count = booking_status_df.groupby(['arrival_month', 'booking_status']).size().unstack(fill_value=0)
# calculating total bookings (sum of bookings and cancellations) for each month
booking_status_count['Total Bookings'] = booking_status_count.sum(axis=1)
# plotting
fig = px.line(booking_status_count, x=booking_status_count.index, y=booking_status_count.columns,
title='Booking Status Over Time', labels={'arrival_month': 'Month', 'value': 'Count'},
template='plotly_dark')
# adding a line for the total bookings per month
fig.add_scatter(x=booking_status_count.index, y=booking_status_count['Total Bookings'],
mode='lines', name='Total Bookings', line_color='green')
# removing the duplicate legend entry for "Total Bookings"
fig.update_traces(showlegend=False, selector=dict(name='Total Bookings'))
# adding annotation to explain the green line
fig.add_annotation(xref='paper', yref='paper', x=0.95, y=0.05,
text='Total Bookings (Green line) = Sum of bookings and cancellations per month',
showarrow=False, font=dict(color='black', size=12), align='right',)
# the layout
fig.update_layout(xaxis_title='Month', yaxis_title='Count', legend_title='Booking Status',
width=1000, height=600, xaxis={'tickmode': 'array', 'tickvals': booking_status_count.index})
fig.show()
```
```{python}
#| label: hotel - correlation heatmap
#| echo: false
#| include: false
# calculating correlation matrix
correlation = hotel.corr().round(2)
# creating heatmap
fig = go.Figure(data=go.Heatmap(
z=correlation.values,
x=correlation.index,
y=correlation.columns,
colorscale='RdBu',
colorbar=dict(title='Correlation', tickvals=[-1, -0.5, 0, 0.5, 1]), # adjusting colorbar ticks for better readability
zmin=-1, # setting minimum value of the color range
zmax=1, # setting maximum value of the color range
))
# the layout
fig.update_layout(
title='Correlation Heatmap for the Hotel Dataset',
width=800,
height=700,
xaxis=dict(title='Features'),
yaxis=dict(title='Features'),
margin=dict(l=100, r=100, t=100, b=100),
)
fig.show()
```
```{python}
#| label: hotel - visualizing the boxplots for the numerical variables of the hotel dataset
#| echo: false
#| include: false
numerical_columns_hotel = hotel.select_dtypes(include=['int64', 'float64']).columns
num_plots_per_row = 3
num_rows = -(-len(numerical_columns_hotel) // num_plots_per_row)
plt.figure(figsize=(20, 4 * num_rows))
for i, column in enumerate(numerical_columns_hotel, start=1):
plt.subplot(num_rows, num_plots_per_row, i)
sns.boxplot(x=hotel[column], palette='Set3')
plt.title(f"Boxplot for {column}")
plt.tight_layout()
plt.show()
```
```{python}
#| label: hotel - removing rows where no_of_children equals 9 or 10
#| echo: false
#| include: false
hotel = hotel[(hotel['no_of_children'] != 9) & (hotel['no_of_children'] != 10)]
```
```{python}
#| label: hotel - histograms of numerical variables
#| echo: false
#| include: false
plt.figure(figsize=(20, 4 * num_rows))
for i, column in enumerate(numerical_columns_hotel, start=1):
plt.subplot(num_rows, num_plots_per_row, i)
sns.histplot(x=hotel[column], palette='Set3', kde=True)
plt.title(f"Histogram for {column}")
plt.tight_layout()
plt.show()
```
```{python}
#| label: hotel - lead_time log
#| echo: false
#| include: false
hotel['lead_time_log'] = np.log1p(hotel['lead_time'])
```
```{python}
#| label: hotel - lead_time log histogram
#| echo: false
#| include: false
plt.figure(figsize=(8, 6))
sns.histplot(hotel['lead_time_log'], kde=True, color='skyblue')
plt.title('Histogram of lead_time_log')
plt.xlabel('Lead Time (Log Transformed)')
plt.ylabel('Frequency')
plt.show()
```
```{python}
#| label: hotel - lead_time drop
#| echo: false
#| include: false
hotel.drop(columns=['lead_time'], inplace=True)
```
```{python}
#| label: weather data imbalanced or not
#| message: false
#| include: false
# Calculate class distribution
class_distribution = weather['RainTomorrow'].value_counts()
# Calculate class proportions
class_proportions = weather['RainTomorrow'].value_counts(normalize=True) * 100
# Create a bar plot
fig = go.Figure([go.Bar(x=class_distribution.index, y=class_distribution.values,
text=class_proportions.round(2), textposition='auto',
marker_color=['blue', 'orange'])])
# Update layout
fig.update_layout(title='Class Distribution of RainTomorrow',
xaxis=dict(title='RainTomorrow Class', tickvals=[0, 1], ticktext=['0', '1']),
yaxis=dict(title=''))
# Show plot
fig.show()
```
```{python}
#| label: Correlation heatmap for the hotel dataset
#| echo: false
#| include: false
correlation2 = weather.corr().round(2) # rounding it into 2 decimals
# Plotting with the Plotly library
fig = px.imshow(correlation2, x=correlation2.index, y=correlation2.columns,
color_continuous_scale='YlOrBr', labels={'color': 'Correlation'})
fig.update_layout(title='Correlation Heatmap for the Hotel Dataset', width=600, height=550)
fig.show()
```
```{python}
#| label: weather data infoo
#| echo: false
#| message: false
#| include: false
weather.info()
```
```{python}
#| label: weather headd
#| echo: false
#| message: false
#| include: false
weather.head(5)
```
```{python}
#| label: Visualizing the boxplots for the numerical variables of the weather's dataset
#| echo: false
#| include: false
numerical_columns2 = weather.select_dtypes(include=['int64', 'float64']).columns
num_plots_per_row = 3
num_rows = -(-len(numerical_columns2) // num_plots_per_row)
plt.figure(figsize=(20, 4 * num_rows))
for i, column in enumerate(numerical_columns2, start=1):
plt.subplot(num_rows, num_plots_per_row, i)
sns.boxplot(x=weather[column], palette='Set3')
plt.title(f"Boxplot for {column}")
plt.tight_layout()
plt.show()
```
```{python}
#| label: weather data histogram
#| echo: false
#| include: false
plt.figure(figsize=(20, 4 * num_rows))
for i, column in enumerate(numerical_columns2, start=1):
plt.subplot(num_rows, num_plots_per_row, i)
sns.histplot(x=weather[column], palette='Set3')
plt.title(f"Boxplot for {column}")
plt.tight_layout()
plt.show()
```
```{python}
#| label: cardio boxplot
#| echo: false
#| include: false
# Select only numerical columns
numeric_columns = cardio.select_dtypes(include=[np.number]).columns[~cardio.select_dtypes(include=[np.number]).columns.str.contains('Unnamed')]
# Calculate the number of rows and columns for subplots
num_columns = len(numeric_columns)
num_rows = (num_columns + 2) // 1 # Ensure at least 3 plots per row
num_cols = min(num_columns, 1)
# Set up the matplotlib figure and axes
fig, axs = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15 * num_cols, 5 * num_rows))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Loop through each numerical column and plot a boxplot
for i, column in enumerate(numeric_columns):
sns.boxplot(x=cardio[column], ax=axs[i], width=0.3)
axs[i].set_title(f'Boxplot of {column}')
axs[i].set_xlabel('')
# Remove empty subplots
for i in range(num_columns, num_rows * num_cols):
fig.delaxes(axs[i])
# Adjust layout
plt.tight_layout()
plt.show()
```
```{python}
#| label: cardio histogram only numerical
#| echo: false
#| include: false
numeric_columns = cardio.select_dtypes(include=['int64', 'float64']).columns
# Calculate the number of rows and columns for subplots
num_columns = len(numeric_columns)
num_rows = math.ceil(num_columns / 2) # Use ceil to round up and ensure enough rows
# Set up the matplotlib figure and axes
fig, axs = plt.subplots(nrows=num_rows, ncols=2, figsize=(30, 6 * num_rows))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Loop through each numerical column and plot a histogram
for i, column in enumerate(numeric_columns):
if i < len(axs): # Ensure we don't go out of bounds
sns.histplot(cardio[column], ax=axs[i], bins=50, kde=True, color='skyblue', edgecolor='black')
axs[i].set_title(f'Histogram of {column}')
axs[i].set_xlabel('Value')
axs[i].set_ylabel('Frequency')
else: # If there are more columns than subplots, break the loop
break
# Hide any unused axes if the number of columns is odd
if num_columns % 2 != 0:
axs[-1].set_visible(False) # Hide the last subplot if unused
# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()
```
```{python}
#| label: cardio histogram - target variable
#| echo: false
#| include: false
sns.histplot(cardio['Heart_Disease'], bins = 50, kde=False, color='skyblue', edgecolor='black', linewidth=1.2, alpha=0.7)
plt.title('Histogram of Heart Disease')
plt.ylabel('Count')
plt.xlabel('Heart Disease')
# Annotate each bar with its count
for rect in plt.gca().patches:
x = rect.get_x() + rect.get_width() / 2
y = rect.get_height()
plt.gca().annotate(f'{int(y)}', (x, y), xytext=(0, 5), textcoords='offset points', ha='center', color='black')
plt.tight_layout() # Adjust layout to prevent overlapping
plt.show()
```
```{python}
#| label: cardio label encoding
#| echo: false
#| include: false
# Create a copy of the DataFrame to avoid modifying the original
cardio_encoded = cardio.copy()
# Create a label encoder object
label_encoder = LabelEncoder()
# Iterate through each object column and encode its values
for column in cardio_encoded.select_dtypes(include='object'):
cardio_encoded[column] = label_encoder.fit_transform(cardio_encoded[column])
# Now, df_encoded contains the label-encoded categorical columns
cardio_encoded.head()
```
```{python}
#| label: cardio visualization - correlation matrix 1
#| echo: false
#| include: false
# Calculate the correlation matrix for Data
correlation_matrix = cardio_encoded.corr()
# Create a heatmap
plt.figure(figsize=(12, 10))
heatmap = sns.heatmap(correlation_matrix, annot=False, cmap='viridis') # Turn off automatic annotations
plt.title("Correlation Heatmap")
# Annotate each cell with the numeric value using matplotlib's `text` function
for i in range(correlation_matrix.shape[0]):
for j in range(correlation_matrix.shape[1]):
plt.text(j + 0.5, i + 0.5, f"{correlation_matrix.iloc[i, j]:.2f}",
ha='center', va='center', color='white')
plt.show()
```
```{python}
#| label: cardio visualization - correlation with target variable
#| echo: false
#| include: false
# Compute the correlation with 'Heart_Disease' for each numerical column
correlation_HD = cardio_encoded.corr()['Heart_Disease'].sort_values(ascending=False)
correlation_HD
# Plot the correlations
plt.figure(figsize=(14, 7))
correlation_HD.plot(kind='bar', color='skyblue')
plt.xlabel('Variables')
plt.ylabel('Correlation with Heart Disease')
plt.title('Correlation of Variables with Heart Disease')
plt.show()
```
```{python}
#| label: calculate the imbalance ratio
#| echo: false
#| include: false
#| eval: false
class_distribution1 = cardio['Heart_Disease'].value_counts()
class_proportions1 = cardio['Heart_Disease'].value_counts(normalize=True)
imbalance_ratio1 = class_distribution1[1] / class_distribution1[0]
# Print imbalance ratio
#imbalance_ratio1
# Plotting the bar chart
plt.figure(figsize=(8, 6))
bars = class_distribution1.plot(kind='bar', color=['blue', 'orange'])
plt.title('Class Distribution')
plt.xlabel('Heart Disease')
plt.ylabel('Number of Samples')
plt.xticks(rotation=0)
# Adding numbers above the bars
for bar in bars.patches:
plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 20, str(int(bar.get_height())), ha='center', va='bottom')
plt.show()
```
```{python}
#| label: improve imbalance - resampling (undersampling)
#| echo: false
#| include: false
#| eval: false
majority = cardio_encoded[cardio_encoded['Heart_Disease'] == 0]
minority = cardio_encoded[cardio_encoded['Heart_Disease'] == 1]
# Undersample majority class with 80:20 ratio
majority_undersampled = resample(majority,
replace=False, # Sample without replacement
n_samples=int(len(minority)*4),
# Match 80% of minority class
random_state=42)
# Combine minority class with undersampled majority class
undersampled = pd.concat([majority_undersampled, minority])
# Display new class counts
undersampled['Heart_Disease'].value_counts()
# Calculate class distribution after undersampling
undersampled = undersampled['Heart_Disease'].value_counts()
# Plotting the bar chart
plt.figure(figsize=(8, 6))
bars = undersampled.plot(kind='bar', color=['blue', 'orange'])
plt.title('Class Distribution after Undersampling')
plt.xlabel('Heart Disease')
plt.ylabel('Number of Samples')
plt.xticks(rotation=0)
# Adding numbers above the bars
for bar in bars.patches:
plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 20, str(int(bar.get_height())), ha='center', va='bottom')
plt.show()
```
In the preliminary exploration of all three datasets, we addressed the issue of class imbalance to ensure unbiased analysis. In the Hotel dataset and Weather in Australia Dataset, the class proportions were moderately balanced.
However, the Cardiovascular dataset exhibited class imbalance, with a ratio of less than 8% for the heart disease class. In response, we employed undersampling techniques to mitigate the disproportionate representation of individuals with and without heart disease. Initially, we partitioned the dataset into majority (no heart disease) and minority (heart disease present) classes. Subsequently, a portion of the majority class was randomly selected for downsampling to achieve a balanced class distribution, maintaining an 80:20 ratio between the majority and minority classes. As a result, the majority class, representing individuals without heart disease, now comprises 99,204 observations, while the minority class, representing individuals with heart disease, comprises 24,801 observations *(for further information, please refer to the appendix, sections 1.3.1 and 4.3.4).*
After addressing the class imbalance, boxplots were used to identify potential outliers in the numerical features. While the Hotel dataset and Weather in Australia Dataset showed outliers in some features, they were retained due to their potential significance in predicting cancellation or weather patterns. However, in the Cardiovascular Dataset, outliers, particularly extreme values in height, weight, and BMI, were identified and removed to maintain data integrity.
Following the initial analysis, we examined the distribution of numerical features. While the Weather and Cardiovascular datasets generally exhibited a normal distribution, the Hotel Dataset required log-transformation specifically for the 'lead_time' feature to achieve normality. Subsequently, we visualized correlation matrices to explore relationships among numerical variables within each dataset.
# Modelling
During the modeling phase, we employed six supervised machine learning algorithms—logistic regression, decision trees, random forest, AdaBoost, gradient boosting, and KNN classifier—across three distinct datasets: Hotel, Weather, and Cardiovascular. Here's a comprehensive summary of the modeling process:
## Data Preprocessing before modelling
- Each dataset underwent splitting into training and testing sets with a ratio of 80/20.
- We applied StandardScaler to normalize and scale the datasets, ensuring a mean of zero and standard deviation of one.
- Feature selection was performed using the SelectKBest method to identify the top ten features for modeling (*for further information, please refer to the appendix, 1.3.2 section*).
## Model Application
- Supervised machine learning algorithms were applied to each dataset using various combinations:
- Original dataset with all features.
- Scaled dataset with all features.
- Original dataset with the top 10 features selected by SelectKBest.
- Scaled dataset with the top 10 features selected by SelectKBest.
- For the Cardiovascular dataset, models were trained and tested on both the imbalanced dataset and the dataset where undersampling was used to address the imbalance.
### Evaluation of Model Performance
Performance metrics such as accuracy, precision, recall, F1-score, and ROC AUC score were computed for each combination of dataset and algorithm *(for further information, please refer to the appendix, 1.3.3 section).*
Models were compared based on their performance across these metrics to identify the best performing one for each dataset.
```{python}
#| label: hotel data all results table save it
#| echo: false
#| message: true
#| include: false
results_hotel = pd.read_csv("photoDF/results_hotel_randomforest.csv")
results_hotel
```
```{python}
#| label: hotel - modelling with all models
#| echo: false
#| include: false
#| eval: false
# CODE:
X = hotel.drop(columns=['booking_status', 'arrival_date_full', 'arrival_month'])
y = hotel['booking_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# initializing the StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# converting scaled data back to DataFrames
X_train_scaled_hotel = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled_hotel = pd.DataFrame(X_test_scaled, columns=X.columns)
```
```{python}
#| label: hotel - initializing seleckbest
#| echo: false
#| include: false
#| eval: false
target_variable = 'booking_status'
k = 10
X_feature_hotel = hotel.drop(columns=['booking_status', 'arrival_date_full', 'arrival_month'])
y_feature_hotel = hotel[target_variable]
selector = SelectKBest(score_func=f_classif, k=k)
X_selected = selector.fit_transform(X_feature_hotel, y_feature_hotel)
selected_feature_indices = selector.get_support(indices=True)
selected_feature_names = X_feature_hotel.columns[selected_feature_indices].tolist()