-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFinal-Project-DS-UA-112.py
1903 lines (1442 loc) · 63.2 KB
/
Final-Project-DS-UA-112.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# %% [markdown]
# # Background
# %% [markdown]
# ## Introduction to Data Science - DS UA 112 Capstone Project
# The purpose of this capstone project is to tie everything we learned in this class
# together. This might be challenging in the short term, but is consistently rated as being
# extremely valuable in the long run. The cover story is that you are working as a Data
# Scientist for an auction house that works with a major art gallery. You want to better
# understand this space with the help of Data Science. Historically, this domain was
# dominated by art historians, but is increasingly informed by data. This is where you – a
# budding Data Scientist – come in. Can you provide the value that justifies your rather
# high salary?
#
# __Mission command preamble:__ As in general, we won’t tell you *how* to do something. That
# is up to you and your creative problem solving skills. However, we will pose the questions
# that you should answer by interrogating the data. Importantly, we do expect you to do this
# work yourself, so it reflects your intellectual contribution – not that of third parties.
# By doing this assignment, you certify that it indeed reflects your individual intellectual
# work.
#
# __Format:__ The project consist of your answers to 10 (equally-weighed, grade-wise.
# questions. Each answer *must* include some text (describing both what you did and what you
# found, i.e. the answer to the question), a figure that illustrates the findings and some
# numbers(e.g. test statistics, confidence intervals, p-values or the like.. Please save it
# as a pdf document. This document should be 4-6 pages long (arbitrary font size and
# margins). About half a page per question is reasonable. In addition, open your document
# with a brief statement as to how you handled preprocessing (e.g. dimension reduction, data
# cleaning and data transformations), as this will apply to all answers. Include your name.
#
# __Academic integrity:__ You are expected to do this project by yourself, individually, so
# that we are able to determine a grade for you personally. There are enough degrees of
# freedom (e.g. how to clean the data, what variables to compare, aesthetic choices in the
# figures, etc.) that no two reports will be alike. We’ll be on the lookout for suspicious
# similarities, so please refrain from collaborating. To prevent cheating (please don’t do
# this – it is easily detected), it is very important that you – at the beginning of the
# code file – seed the random number generator with your N-number. That way, the correct
# answers will be keyed to your own solution (as this matters, e.g. for the specific train/
# test split or bootstrapping). As N-numbers are unique, this will also protect your work
# from plagiarism. __Failure to seed the RNG in this way will also result in the loss of
# grade points.__
#
#
# __Deliverables:__ Upload two files to the Brightspace portal by the due date in the
# sittyba:
#
# * A pdf (the “project report”) that contains your answers to the 10 questions, as well as
# an introductory paragraph about preprocessing.
#
# * A .py file with the code that performed the data analysis and created the figures.
#
# We do wish you all the best in executing on these instructions. We aimed at an optimal
# balance between specificity and implementation leeway, while still allowing us to grade
# the projects in a consistent and fair manner.
#
# Everything should be doable from what was covered in this course. If you take this project
# seriously and do a quality job, you can easily use it as an item in your DS portfolio.
# Former students told us that they secured internships and even jobs by well executed
# capstone projects that impressed recruiters and interviewers.
# ___
# %% [markdown]
# ## Dataset Description
# Description of dataset: This dataset consists of data from 300 users and 91 art pieces.
# The art pieces are described in the file `theArt.csv`. Here is how this file is structured:
#
# **Rows**
# * 1st row: Headers (e.g. number, title, etc.)
# * Rows 2-92: Information about the 91 individual art piece
#
# **Columns**
# * Column 1: “Number” (the ID number of the art piece)
# * Column 2: Artist
# * Column 3: Title
# * Column 4: Artistic style
# * Column 5: Year of completion
# * Column 6: Type code – 1 = classical art, 2 = modern art, 3 = non-human art
# * Column 7: computer or animal code – 0 = human, 1 = computer generated art, 2 = animal art
# * Column 8: Intentionally created? 0 = no, 1 = yes
#
# You can also take a look at the actual art by looking at the files in the `artPieces`
# folder on Brightspace.
#
# The user data is contained in the file `theData.csv`. Here is how this file is structured:
#
# **Rows**
# * Rows 1-300: Responses from individual users
#
# **Columns**
# * Columns 1-91: Preference ratings (liking) of the 91 art pieces.
#
# The column number in this file corresponds to the number of the art piece in column 1 of “theArt.csv” file described
# above. For instance, ratings of art piece 27 (“the woman at the window”) is in column 27. Numbers represent
# preference ratings from 1 (“hate it”) to 7 (“love it”).
#
# * Columns 92-182: “Energy” ratings of the same 91 art pieces (in the same order as the
# preference ratings above).
#
# Numbers represent ratings from 1 (“it calms me down a lot”) to
# 7 (“it agitates me a lot”).
#
# * Columns 183-194: “Dark” personality traits.
#
# Numbers represent how much a user agrees with a statement, from 1 (strongly disagree) to 5 (strongly agree). Here are
# the 12 statements, in column order:
#
# 1. I tend to manipulate others to get my way
# 2. I have used deceit or lied to get my way
# 3. I have used flattery to get my way
# 4. I tend to exploit others towards my own end
# 5. I tend to lack remorse
# 6. I tend to be unconcerned with the morality of my actions
# 7. I can be callous or insensitive
# 8. I tend to be cynical
# 9. I tend to want others to admire me
# 10. I tend to want others to pay attention to me
# 11. I tend to seek prestige and status
# 12. I tend to expect favors from others
#
# * Columns 195-205: Action preferences. Numbers represent how much a user agrees with a
# statement, from 1 (strongly disagree) to 5 (strongly agree). Here are the 11 actions, in
# column order:
#
# 1. I like to play board games
# 2. I like to play role playing (e.g. D&D) games
# 3. I like to play video games
# 4. I like to do yoga
# 5. I like to meditate
# 6. I like to take walks in the forest
# 7. I like to take walks on the beach
# 8. I like to hike
# 9. I like to ski
# 10. I like to do paintball
# 11. I like amusement parks
#
#
# * Columns 206-215: Self-image/self-esteem. Numbers represent how much a user agrees with a
# statement. Note that if a statement has “reverse polarity”, e.g. statement 2 “at times I
# feel like I am no good at all”, it has already been re-coded/inverted by the professor
# such that higher numbers represent higher self-esteem. Here are the 10 items, in column
# order:
#
# 1. On the whole, I am satisfied with myself
# 2. At times I think I am no good at all
# 3. I feel that I have a number of good qualities
# 4. I am able to do things as well as most other people
# 5. I feel I do not have much to be proud of
# 6. I certainly feel useless at times
# 7. I feel that I'm a person of worth, at least on an equal plane with others
# 8. I wish I could have more respect for myself
# 9. All in all, I am inclined to feel that I am a failure
# 10. I take a positive attitude toward myself
#
# * Column 216: User age
# * Column 217: User gender (1 = male, 2 = female, 3 = non-binary)
# * Column 218: Political orientation (1 = progressive, 2 = liberal, 3 = moderate, 4 =
# conservative, 5 = libertarian, 6 = independent)
#
# * Column 219: Art education (The higher the number, the more: 0 = none, 3 = years of art
# education)
#
# * Column 220: General sophistication (The higher, the more: 0 = not going to the opera, etc.
# 3 = doing everything – opera, art galleries, etc.)
#
# * Column 221: Being somewhat of an artist myself? (0 = no, 1 = sort of, 2 = yes, I see
# myself as an artist)
#
# Note that we did most of the data munging and coding for you already but you still need to
# handle missing data in some way (e.g. row-wise removal, element-wise removal, imputation).
# Extreme values might also have to be handled.
# ___
# %% [markdown]
# ## Questions Management Would Like You to Answer
# 1. Is classical art more well liked than modern art?
# 2. Is there a difference in the preference ratings for modern art vs. non-human (animals and computers) generated
# art?
# 3. Do women give higher art preference ratings than men?
# 4. Is there a difference in the preference ratings of users with some art background (some art education) vs. none?
# 5. Build a regression model to predict art preference ratings from energy ratings only. Make sure to use
# cross-validation methods to avoid overfitting and characterize how well your model predicts art preference ratings.
#
# 6. Build a regression model to predict art preference ratings from energy ratings and demographic information. Make
# sure to use cross-validation methods to avoid overfitting and comment on how well your model predicts relative to the
# “energy ratings only” model.
#
# 7. Considering the 2D space of average preference ratings vs. average energy rating (that contains the 91 art pieces
# as elements), how many clusters can you – algorithmically - identify in this space? Make sure to comment on the
# identity of the clusters – do they correspond to particular types of art?
#
# 8. Considering only the first principal component of the self-image ratings as inputs to a regression model – how
# well can you predict art preference ratings from that factor alone?
#
# 9. Consider the first 3 principal components of the “dark personality” traits – use these as inputs to a regression
# model to predict art preference ratings. Which of these components significantly predict art preference ratings?
# Comment on the likely identity of these factors (e.g. narcissism, manipulativeness, callousness, etc.).
#
# 10. Can you determine the political orientation of the users (to simplify things and avoid gross class imbalance
# issues, you can consider just 2 classes: “left” (progressive & liberal) vs. “non-left” (everyone else)) from all the
# other information available, using any classification model of your choice? Make sure to comment on the
# classification quality of this model.
#
# __Extra credit:__ Tell us something interesting about this dataset that is not trivial and not already part of
# an answer (implied or explicitly) to these enumerated questions.
#
# __Hints:__
# * Beware of off-by-one errors. This document and the csv data files index from 1, but Python indexes from 0. Make
# sure to keep track of this.
# * In order to answer some of these questions, you might have to apply a dimension reduction method first. For
# instance, “dark personality traits” and “self-image” are characterized by 10-12 variables each. Similarly, you might
# have to reduce variables to their summary statistics.
#
# * In order to do some analyses, you will have to clean the data first, either by removing or imputing missing data
# (either is fine, but explain and justify what you did)
#
# * If you encounter skewed data, you might want to transform the data first, e.g. by z-scoring
# * To clarify: When talking about “principal components” above, we mean the transformed data, rotated into the new
# coordinate system by the PCA.
#
# * Avoid overfitting with cross-validation methods.
#
# * How well your model predicts can be assessed with RMSE or $R^2$ for regression models, or AUC for classification
# models.
#
# * You can use conventional choices of alpha (e.g. 0.05) or confidence intervals (e.g. 95%) throughout
# ___
# %% [markdown]
# # Coding Portion
# %% [markdown]
# ## Initial Setup
# %% [markdown]
# ### Imports
#
# %%
from sklearn.model_selection import StratifiedKFold
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import requests
import random
import warnings
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression, ElasticNet
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score, f1_score, plot_confusion_matrix, roc_auc_score, roc_curve, auc, silhouette_samples, r2_score, silhouette_score, explained_variance_score
from scipy.spatial import Voronoi, voronoi_plot_2d
from scipy import stats
from scipy import stats as st
from mlxtend.plotting import plot_decision_regions
from sklearn.manifold import TSNE
import xgboost as xgb
import os
from tqdm import tqdm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import LinearSegmentedColormap
from xgboost import XGBRegressor
warnings.filterwarnings("ignore")
# %% [markdown]
# ### Seeding the random number generator with my student ID
# %%
N_NUMBER = 0
random.seed(N_NUMBER)
# %% [markdown]
# ### Loading the data & Helper Functions
# %%
user_ratings = pd.read_csv('Data/theData.csv', sep=',',
header=None, engine='python')
art_ratings = pd.read_csv('Data/theArt.csv', sep=',')
def select_df(df, columns):
return df[columns].dropna()
# Helper function to join, drop na, then separate ratings from two indicies
def join_drop_na_separate_ratings(df, indicies1, indicies2):
# Combine both indicies
indicies = np.concatenate((indicies1, indicies2))
# Get ratings for both indicies
ratings = select_df(df, indicies)
# Drop NA rows for both indicies
ratings_no_na = ratings.dropna()
# Select ratings with index1
ratings1 = select_df(ratings_no_na, indicies1)
# Select ratings with index2
ratings2 = select_df(ratings_no_na, indicies2)
return ratings1, ratings2
# %%
# Plotting functions
def plot_ratings_histogram_sbs(ratings, title, id):
plt.subplot(1, 2, id)
plt.hist(ratings.values.flatten(), bins=7,
density=False, histtype='bar', ec='black')
plt.title(title)
plt.xlabel('Rating')
plt.ylabel('Count')
def plot_ratings_histogram(title1, title2, ratings1, ratings2):
common_str = "\n(#users = {}, #art pieces = {})"
plot_ratings_histogram_sbs(
ratings1, title1 + common_str.format(ratings1.shape[0], ratings1.shape[1]), 1)
plot_ratings_histogram_sbs(
ratings2, title2 + common_str.format(ratings2.shape[0], ratings2.shape[1]), 2)
plt.tight_layout()
plt.show()
def plot_ratings_histogram_wl(title, ratings1, label1, ratings2, label2):
x = [ratings1.values.flatten(
), ratings2.values.flatten()]
colors = ['blue', 'red']
plt.plot(figsize=(10, 5))
plt.hist(x, density=True, alpha=0.75)
# Ensure plots side by side middle matches with tick
# Include kde plot
g_kde_modern = st.gaussian_kde(x[0])
g_kde_classical = st.gaussian_kde(x[1])
# linespace for kde
x_kde = np.linspace(0, 7, 1000)
plt.plot(x_kde, g_kde_modern(x_kde), color='blue', alpha=0.5)
plt.plot(x_kde, g_kde_classical(x_kde), color='red', alpha=0.5)
plt.title(title)
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.legend([label1, label2,
label1, label2])
plt.tight_layout()
plt.tight_layout()
plt.show()
# %% [markdown]
# #### 3d Plotting Function
# %%
def plot_3d(ax, x_test, y_test, y_pred, num_range, user_num):
num_ = [x for x in range(0, num_range)]
ax.scatter(num_, x_test[:num_range, user_num],
y_test[:num_range, user_num], c='r', marker='o', label='Actual')
ax.scatter(num_, x_test[:num_range, user_num],
y_pred[:num_range, user_num], c='b', marker='o', label='Predicted')
ax.set_xlabel('Art Piece Number')
ax.set_ylabel('Energy Rating')
ax.set_zlabel('Preference Rating')
# Draw lines between the predicted and actual values
for i in range(0, num_range):
ax.plot([num_[i], num_[i]], [x_test[i, user_num], x_test[i, user_num]], [
y_test[i, user_num], y_pred[i, user_num]], c='g', label='residual' if i == 0 else "")
ax.legend(loc='upper right')
ax.set_title('User {}'.format(user_num))
def plot_3d_prf(x_test, y_test, y_pred, rmse, tmp_str, num_range=90, user_num=0):
fig = plt.figure(figsize=(30, 15))
fig.suptitle('Actual vs. Predicted Preference Ratings for {} Art Pieces in test dataset.\n'.format(
num_range)+tmp_str+'\nRMSE={:.3f}'.format(rmse))
axes = []
for i in range(2):
for j in range(4):
# Create an Axes3D object for each subplot
ax = fig.add_subplot(2, 4, i*4 + j + 1, projection='3d')
axes.append(ax)
for i in range(2):
for j in range(4):
plot_3d(axes[i*4 + j], x_test, y_test,
y_pred, num_range, i*4+j)
plt.tight_layout()
plt.show()
# %%
# Plot importances for extreme gradient boosting
def plot_importances(importances, num_x, title):
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 5))
plt.title(title)
plt.bar(range(num_x), importances[indices])
plt.xticks(range(num_x), indices+1, rotation=90, fontsize=6)
plt.xlim([-1, num_x])
plt.xlabel('Art Piece Number')
plt.ylabel('Importance')
plt.show()
# %% [markdown]
# #### Data Cleaning
# %%
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
sns.heatmap(user_ratings.isnull(), ax=ax[0])
ax[0].set_title('User Ratings Dataset')
ax[0].set_xlabel('Response Information Column')
ax[0].set_ylabel('User ID')
sns.heatmap(art_ratings.isnull(), ax=ax[1])
ax[1].set_title('Art Information Dataset')
ax[1].set_xlabel('Art Information Column')
ax[1].set_xticklabels([i for i in range(art_ratings.shape[1])])
ax[1].set_ylabel('Art ID')
# %% [markdown]
# # 1. Is classical art more well liked than modern art?
# %% [markdown]
# ## Data Loading & Cleaning
# %%
# modern art indicies
modern_indicies = np.where(
art_ratings['Source (1 = classical, 2 = modern, 3 = nonhuman)'] == 2)[0]
# classical art indicies
classical_indicies = np.where(
art_ratings['Source (1 = classical, 2 = modern, 3 = nonhuman)'] == 1)[0]
# Get ratings for modern and classical art
modern_art_ratings, classical_art_ratings = join_drop_na_separate_ratings(
user_ratings, modern_indicies, classical_indicies)
# %% [markdown]
# ## Data Visualization
# %%
# PLOTS
plot_ratings_histogram('Modern Art Ratings', 'Classical Art Ratings',
modern_art_ratings, classical_art_ratings)
# %%
plot_ratings_histogram_wl('Modern Art Ratings vs Classical Art Ratings', modern_art_ratings,
'Modern Art Ratings', classical_art_ratings, 'Classical Art Ratings')
# %% [markdown]
# ## Significance Testing
# %%
# Do a mann whitney test on modern art ratings vs classical art ratings (is classical more well liked than modern)
# H0: classical art is less or equally liked as modern art
# H1: classical art is more liked than modern art
# alpha = 0.05
u, p = stats.mannwhitneyu(classical_art_ratings.values.flatten(
), modern_art_ratings.values.flatten(), alternative='greater')
print('p-value = ' + str(p) + ' u = ' + str(u))
if p < 0.05:
print('Reject H0, evidence suggests that classical art is more liked than modern art')
else:
print('Fail to reject H0, evidence suggests that classical art is less or equally liked as modern art')
# %%
# Use KS test to see if modern art ratings and classical art ratings are from the same distribution
# H0: classical art is less or equally liked as modern art
# H1: classical art is more liked than modern art
# alpha = 0.05
ks, p = st.kstest(classical_art_ratings.values.flatten(),
modern_art_ratings.values.flatten())
# Less -> test modern < classical.
print('p-value = ' + str(p) + ' ks = ' + str(ks))
if p < 0.05:
print('Reject H0, evidence suggests that classical art is more liked than modern art')
else:
print('Fail to reject H0, evidence suggests that classical art is less or equally liked as modern art')
# %% [markdown]
# # 2. Is there a difference in the preference ratings for modern art vs. non-human (animals and computers) generated art?
# %% [markdown]
# ## Data Loading & Cleaning
# %%
# Modern Art indicies
modern_art_indicies = np.where(
art_ratings['Source (1 = classical, 2 = modern, 3 = nonhuman)'] == 2)[0]
# Non-human art indicies
non_human_indicies = np.where(
art_ratings['Source (1 = classical, 2 = modern, 3 = nonhuman)'] == 3)[0]
# Get ratings for modern and non-human art
ratings_for_modern_art, ratings_for_non_human_art = join_drop_na_separate_ratings(
user_ratings, modern_art_indicies, non_human_indicies)
# %% [markdown]
# ## Data Visualization
# %%
plot_ratings_histogram('Modern Art Ratings', 'Non-Human Art Ratings',
ratings_for_modern_art, ratings_for_non_human_art)
# %%
plot_ratings_histogram_wl('Modern Art Ratings vs Non-Human Art Ratings', ratings_for_modern_art,
'Modern Art Ratings', ratings_for_non_human_art, 'Non-Human Art Ratings')
# %% [markdown]
# ## Significance testing
# %%
# We can see that the distributions look very different, and that the non-human art has < 35 art pieces, so we will use the KS test to test the null hypothesis
# H0: There is not difference in the preferences for modern art vs. non-human art
# H1: There is a difference in the preferences for modern art vs. non-human art
# Use the mann whitney test to test the null hypothesis
u, p = stats.mannwhitneyu(ratings_for_modern_art.values.flatten(
), ratings_for_non_human_art.values.flatten(), alternative='two-sided')
print('u = ' + str(u) + ', p = ' + str(p))
if p < 0.05:
print('Reject the null hypothesis, evidence suggests that there is a difference in the preferences for modern art vs. non-human art')
else:
print('Fail to reject the null hypothesis, evidence suggests that there is not a difference in the preferences for modern art vs. non-human art')
# %%
# Use the KS test to test the null hypothesis
# H0: There is not difference in the preferences for modern art vs. non-human art
# H1: There is a difference in the preferences for modern art vs. non-human art
ks, p = st.kstest(ratings_for_modern_art.values.flatten(),
ratings_for_non_human_art.values.flatten())
print('ks = ' + str(ks) + ', p = ' + str(p))
if p < 0.05:
print('Reject the null hypothesis, evidence suggests that there is a difference in the preferences for modern art vs. non-human art')
else:
print('Fail to reject the null hypothesis, evidence suggests that there is not a difference in the preferences for modern art vs. non-human art')
# %% [markdown]
# # 3. Do women give higher art preference ratings than men?
# %% [markdown]
# ## Data Loading & Cleaning
# %%
# Is there a positivie difference in the ratings between women and men (women-men) > 0
# Male ratings (user gender is at column 216, 1=male)
# get males (ratings from columns 0-90)
male_ratings = user_ratings[user_ratings[216]
== 1][user_ratings.columns[0:91]].dropna()
# Female ratings
female_ratings = user_ratings[user_ratings[216] ==
2][user_ratings.columns[0:91]].dropna()
# %% [markdown]
# #### Data Visualization
# %%
# Male vs female ratings
plot_ratings_histogram('Male Ratings', 'Female Ratings',
male_ratings, female_ratings)
# %%
plot_ratings_histogram_wl('Male Ratings vs Female Ratings for Art',
male_ratings, 'Male Ratings', female_ratings, 'Female Ratings')
# %% [markdown]
# ## Significance Testing
# %%
# H0: The difference in art preference ratings between women and men (women-men) is less than or equal to 0
# HA: The difference in art preference ratings between women and men (women-min) is greater than 0
# Use the Mann Whitney U test to test the null hypothesis
u, p = stats.mannwhitneyu(female_ratings.values.flatten(),
male_ratings.values.flatten(), alternative='greater')
print('u = ' + str(u) + ', p = ' + str(p))
if p < 0.05:
print('Reject the null hypothesis, evidence suggests that the difference in art preference ratings between women and men (women-men) is greater than 0')
else:
print('Fail to reject the null hypothesis, evidence suggests that the difference in art preference ratings between women and men (women-men) is less than or equal to 0')
# %%
# Use the KS test to test the null hypothesis
ks, p = stats.ks_2samp(female_ratings.values.flatten(),
male_ratings.values.flatten(), alternative='two-sided')
print('ks = ' + str(ks) + ', p = ' + str(p))
if p < 0.05:
print('Reject the null hypothesis, evidence suggests that the difference in art preference ratings between women and men (women-men) is greater than 0')
else:
print('Fail to reject the null hypothesis, evidence suggests that the difference in art preference ratings between women adn men (women-men) is less than or equal to 0')
# %% [markdown]
# # 4. Is there a difference in the preference ratings of users with some art background (some art education) vs. none?
# %% [markdown]
# ## Data Loading & Cleaning
# %%
no_art_edu_ratings = user_ratings[user_ratings[218] ==
0][user_ratings.columns[0:91]].dropna()
# !=0 may include na values, so we use >
some_art_edu_ratings = user_ratings[user_ratings[218] >
0][user_ratings.columns[0:91]].dropna()
# %% [markdown]
# ## Data Visualization
# %%
plot_ratings_histogram('No Art Education Ratings',
'Some Art Education Ratings', no_art_edu_ratings, some_art_edu_ratings)
# %%
plot_ratings_histogram_wl('No Art Education Ratings vs Some Art Education Ratings', no_art_edu_ratings,
'No Art Education Ratings', some_art_edu_ratings, 'Some Art Education Ratings')
# %% [markdown]
# ## Significance Testing
# %%
# H0: There is no difference in the preference ratings of users with some art background vs none
# HA: There is a difference in the preference ratings of users with some art background vs none
# Use mannwhitneyu test
u, p = stats.mannwhitneyu(no_art_edu_ratings.values.flatten(
), some_art_edu_ratings.values.flatten(), alternative='two-sided')
print('u = ' + str(u) + ', p = ' + str(p))
if p < 0.05:
print('Reject the null hypothesis, evidence suggest that there is a difference in the preference ratings of users with some art background vs none')
else:
print('Fail to reject the null hypothesis, evidence suggest that there is no difference in the preference ratings of users with some art background vs none')
# %%
ks, p = stats.ks_2samp(no_art_edu_ratings.values.flatten(
), some_art_edu_ratings.values.flatten(), alternative='two-sided')
print('ks = ' + str(ks) + ', p = ' + str(p))
if p < 0.05:
print('Reject the null hypothesis, evidence suggest that there is a difference in the preference ratings of users with some art background vs none')
else:
print('Fail to reject the null hypothesis, evidence suggest that there is no difference in the preference ratings of users with some art background vs none')
# %% [markdown]
# # 5. Build a regression model to predict art preference ratings from energy ratings only. Make sure to use cross-validation methods to avoid overfitting and characterize how well your model predicts art preference ratings.
# %% [markdown]
# ## Data Loading & Cleaning
# %%
# X -> energy rating of art piece
# Y -> preference rating of art piece
both_ratings = user_ratings[user_ratings.columns[0:182]].dropna()
# Get X (energy rating of each art piece for all users)
energy_ratings = both_ratings[both_ratings.columns[91:182]]
energy_ratings.index -= 91
# Get Y (preference rating of each art piece for all users)
preference_ratings = both_ratings[both_ratings.columns[0:91]]
x = energy_ratings.values
y = preference_ratings.values
# Split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=N_NUMBER)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
# %% [markdown]
# ## Elastic-Net Regularized Regression
# %%
es_net = ElasticNet(alpha=0.002, normalize=True)
es_net.fit(x_train, y_train)
# Predict
y_pred = es_net.predict(x_test)
slope = es_net.coef_ # B1 (slope)
intercept = es_net.intercept_ # B0 (intercept)
# Print Summary Statistics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE: ', rmse)
tmp_str = 'predicted preference = '
# show equation
for i in range(0, len(slope)):
tmp_str += str(slope[i]) + ' * x' + str(i) + ' + '
tmp_str += str(intercept)
# print(tmp_str)
# %% [markdown]
# ### Visualizing the Results
# %%
plot_3d_prf(x_test, y_test, y_pred, rmse, 'Elastic-Net, (Energy Ratings')
# %% [markdown]
# ## Extreme Gradient Boosting
# %%
x = energy_ratings.values
y = preference_ratings.values
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=N_NUMBER)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
# %%
xgbr_model = XGBRegressor(
booster='gbtree', objective='reg:squarederror',
seed=N_NUMBER, missing=0, n_estimators=200, max_depth=5)
xgbr_model = xgbr_model.fit(x_train, y_train, early_stopping_rounds=5,
eval_metric='rmse', eval_set=[(x_test, y_test)], verbose=True)
score = xgbr_model.score
y_hat = xgbr_model.predict(x_test)
# %%
k_fold = KFold(n_splits=5, shuffle=True, random_state=N_NUMBER)
cv_rmse = np.mean(cross_val_score(xgbr_model, x, y, cv=k_fold,
scoring='neg_root_mean_squared_error'))*-1
print('CV-RMSE:', cv_rmse)
# %% [markdown]
# ### Visualizing the Results
# %%
importances = xgbr_model.feature_importances_
plot_importances(
importances, x.shape[1], 'Feature importances of Energy Ratings on predicted preferences')
# %%
plot_3d_prf(x_test, y_test, y_hat, rmse,
'XGBoost Regression with Energy Ratings')
# %% [markdown]
# # 6. Build a regression model to predict art preference ratings from energy ratings and demographic information. Make sure to use cross-validation methods to avoid overfitting and comment on how well your model predicts relative to the “energy ratings only” model.
# %% [markdown]
# ## Data Loading & Cleaning
# %%
# X -> energy rating of art piece
# Y -> preference rating of art piece
all_columns = user_ratings[user_ratings.columns[0:219]]
# Drop columns between 182 and 215
all_columns = all_columns.drop(all_columns.columns[182:215], axis=1).dropna()
# Get X1 (energy rating of each art piece for all users)
energy_ratings = all_columns[all_columns.columns[91:182]]
#energy_ratings.index -= 91
# Get X2 demographic data
demographic_data = all_columns[all_columns.columns[182:]]
# Get Y (preference rating of each art piece for all users)
preference_ratings = all_columns[all_columns.columns[0:91]]
x = np.concatenate((energy_ratings, demographic_data), axis=1)
y = preference_ratings.values
# Split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=N_NUMBER)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
# %% [markdown]
# ## Multi-Regression
# %%
es_net = ElasticNet(alpha=0.06, random_state=N_NUMBER)
es_net.fit(x_train, y_train)
y_pred = es_net.predict(x_test)
slope = es_net.coef_ # B1 (slope)
intercept = es_net.intercept_ # B0 (intercept)
# Print Summary Statistics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print('RMSE: ', rmse)
# %%
tmp_str = 'predicted preference = '
for i in range(0, len(slope) - 4):
tmp_str += str(slope[i]) + ' * x' + str(i) + ' + '
tmp_h = {
len(slope)-4: 'age',
len(slope)-3: 'gender',
len(slope)-2: 'political-orientation',
len(slope)-1: 'art-education'
}
for i in range(len(slope) - 4, len(slope)):
tmp_str += str(slope[i]) + ' * ' + tmp_h[i] + ' + '
tmp_str += str(intercept)
# print(tmp_str)
# %% [markdown]
# ### Visualizing the Results
# %%
plot_3d_prf(x_test, y_test, y_pred, rmse,
'Elastic-Net Regression with Energy Ratings and Demographic Data', num_range=84)
# %% [markdown]
# ## Extreme Gradient Boosting Regression
# %%
x = np.concatenate((energy_ratings, demographic_data), axis=1)
y = preference_ratings .values
# Split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=N_NUMBER)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
# %%
xgbr_model = XGBRegressor(
booster='gbtree', objective='reg:squarederror',
seed=N_NUMBER, missing=0, n_estimators=200, max_depth=5)
xgbr_model = xgbr_model.fit(x_train, y_train, early_stopping_rounds=5,
eval_metric='rmse', eval_set=[(x_test, y_test)], verbose=True)
score = xgbr_model.score
y_hat = xgbr_model.predict(x_test)
# %%
k_fold = KFold(n_splits=5, shuffle=True, random_state=N_NUMBER)
cv_rmse = np.mean(cross_val_score(xgbr_model, x, y, cv=k_fold,
scoring='neg_root_mean_squared_error'))*-1
print('CV-RMSE:', cv_rmse)
# %% [markdown]
# ### Visualize the results
# %%
importances = xgbr_model.feature_importances_
plot_importances(
importances, x.shape[1], 'Feature importances of Energy Ratings and Demographic info. on predicted preferences')
# %%
plot_3d_prf(x_test, y_test, y_hat, rmse,
'XGBoost Regression with Energy Ratings and Demographic Data', num_range=84)
# %% [markdown]
# # 7. Considering the 2D space of average preference ratings vs. average energy rating (that contains the 91 art pieces as elements), how many clusters can you – algorithmically - identify in this space? Make sure to comment on the identity of the clusters – do they correspond to particular types of art?
# %% [markdown]
# ## Data Preprocessing & Visualization
# %%
both_ratings = user_ratings[user_ratings.columns[0:182]].dropna()
# X1 (energy rating of each art piece for all users)
energy_ratings = both_ratings[both_ratings.columns[91:182]]
average_energy_ratings = energy_ratings.mean(axis=0)
average_energy_ratings.index -= 91
# Y (preference rating of each art piece for all users)
preference_ratings = both_ratings[both_ratings.columns[0:91]]
average_preference_ratings = preference_ratings.mean(axis=0)
# %%
# plot all the data inlcuding the outliers
plt.plot(average_preference_ratings, average_energy_ratings,
color='r', marker='o', markersize=1, linestyle='None')
plt.xlabel('Average Preference Rating')
plt.xlim(0, 6)
plt.ylabel('Average Energy Rating Rating')
plt.ylim(0, 6)
plt.title(
'Average Preference Rating vs. Average Preference Rating\nfor All Art Pieces')
plt.show()
# %%
# Remove painting 46 (outlier)
average_energy_ratings = average_energy_ratings.drop(45).T
average_preference_ratings = average_preference_ratings.drop(45).T
data = np.column_stack((average_preference_ratings, average_energy_ratings))
# Format data:
x = np.column_stack((data[:, 0], data[:, 1]))
# %% [markdown]
# ## Attempt with DBSCAN
# %%
# Fit model to our data:
dbscanModel = DBSCAN(min_samples=3, eps=0.18).fit(
x) # Default eps = 0.5, min_samples = 5
# # Get our labels for each data point:
labels = dbscanModel.labels_
# # Plot the color-coded data:
numClusters = len(set(labels)) - (1 if -1 in labels else 0)
for ii in range(numClusters):
labelIndex = np.argwhere(labels == ii)
plt.plot(data[labelIndex, 0], data[labelIndex, 1], 'o',
markersize=2, label='Cluster {}'.format(ii+1))