-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathCustomer Behavioural Analytics in the Retail sector.Rmd
1390 lines (1013 loc) · 73.4 KB
/
Customer Behavioural Analytics in the Retail sector.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
Project title: "Customer Behavioural Analytics in the Retail sector"
output:
rmarkdown::html_document:
toc: true
toc_depth: 3
number_sections: false
theme: lumen
df_print: paged
toc_float:
collapsed: true
smooth_scroll: true
bibliography: bibliography.bib
---
<style>
body {
text-align: justify}
</style>
#Customer Behavioural Analytics in the Retail Sector
</br>
<font color ="black"><h4><b>Team CuBA</b></h4>
Rutuja Shivraj Pawar (220051, rutuja.pawar@ovgu.de) <br />
Nadiia Honcharenko (220681, nadiia.honcharenko@st.ovgu.de) <br />
Shivani Jadhav (223856, shivani.jadhav@st.ovgu.de) <br />
Sumit Kundu (217453, sumit.kundu@st.ovgu.de)
<h4><b>Under the Guidance of</b></h4>
M.Sc. Uli Niemann
#Overview and Motivation
<font color ="black"> A customer is a key-centric factor for any business to be successful. Conventional wisdom tells us that the cost of retaining an existing customer is far less than acquiring a new one. In order that a business has a sustainable growth, the retention of its old customer base and expansion of the new customer base is very critical. This demands from a business to understand the behaviour of its customers in relation to the business. Therefore obtaining a 360° view of its customers is crucial for a business looking for a competitive edge in the market. In such a scenario, Customer Behavioural Analytics plays an important role in leveraging data analytics to find meaningful behavioural patterns in the customer-specific business data. <br/>
Consequently, this project aims to understand the consumer behaviour in the retail sector. Decoding the consumer behaviour will be based on understanding how consumers make purchase decisions and what factors influence those decisions. This project also aims to discover the existence of dependencies between customers, products and shops to highlight further insights about their behaviour. These meaningful insights will further help a business to implement strategies leading to an increased revenue through customer satisfaction.
#Project Objective
This project aims to address the problem of understanding the behaviour of customers of an Italian retail distribution company _Coop_ in a single Italian city. The project intends to discover different analytical insights about the purchase behaviour of the customers through answering different formulated Research Questions (RQ)
#Data Source
Supermarket aggr.Customer^[https://bigml.com/user/czuriaga/gallery/dataset/5559c2c6200d5a6570000084] <br />
The dataset to be used is the retail market data of one of the largest Italian retail distribution company called _Coop_ for a single Italian city.<br />
The Supermarket aggr.Customer dataset used for the analysis contains data aggregated from customer and information from shops^[http://www.michelecoscia.com/?page_id=379] [@pennacchioli2013explaining] and pivoted to new columns. The dataset thus contains 40 features with 60,366 instances and is approximately 14.0 MB in size.
#Related Work
The paper [@pennacchioli2013explaining] studies and highlights the product range effect illustrating that the customers are focussed more on the needs satisfied by the product. The range effect highlights that if the satisfaction is higher, customers are ready to travel long distances to purchase them and thus pay more than the price of the product itself. The study is based on the creation of a data mining framework and the introduction of a new measurement called "product sophistication" to study the range effect. The paper also studies the effect of the introduced measure on the accuracy of predicting the shop selected by the customer to buy a particular product. Hence this paper studies the customer behaviour in the retail sector majorly based on this range effect.<br />
However, on the contrary, our analysis is not carried out on the same datasets as used in the paper but our dataset contains data aggregated from customer and information from shops and pivoted to new columns. As we have new columns in our dataset, the perspective of our analysis becomes different. Based on our data and understanding its different features, we framed different Research Questions to be answered from the data. These research questions do not focus only on one aspect (as the range effect in the paper), but attempt to decode the customer behaviour in different ways. Unlike the methodology used in the paper, this project obtains customer behavioural insights through the application of machine learning algorithms on the data. Through this, the project thus studies the consumer nature to determine whether customers are ready to travel long distances in spite of the high average price in a shop and what are the factors that contribute to their long-distance travel. Additionally, the likelihood of a customer to select a particular shop, different customer segments and discovering the top-100 profitable customers is also studied in the scope of the project.
#Initial Questions
Below are the RQs which were formulated at the initial stages of the project based on a primary understanding of the data but without a detailed Exploratory Data Analysis,
__1. Are customers willing to travel long distances to purchase products?__ <br />
_Relevance:_ This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions. <br />
__2. Which are the products for which customers are ready to travel long distances and for which products they select the closest shop?__ <br />
_Relevance:_ This will help to understand the nature of the products in the context of proximity. It is assumed that customers will select closest shops to buy daily products like grocery but may travel long distances to buy one-time-purchase products like kitchen equipment. As Data Science is results-driven and not based solely on intuition, this question can help to verify this assumption.<br />
__3. What is the maximum likelihood of a customer to select a particular shop to purchase a particular product?__ <br />
_Relevance:_ This will help to understand that which shops in the retail chain are in demand for a particular product. This can further facilitate better stock management to meet customer demands.<br />
__4. What is the ranking of the shops in terms of attracting more customers and thus generating more revenue and what is their individual customer base?__<br />
_Relevance:_ This will help to understand the most popular shops in the retail chain and target different shop-level marketing schemes to the appropriate customers through customer segmentation. <br />
__5. Which are the customers that are most profitable in terms of revenue generation?__<br />
_Relevance:_ This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.
#Data Wrangling
Data Wrangling consists of different steps transforming data from the raw form into a clean form which is appropriate and accurate for data analysis. Below are the different steps which were carried out as a part of Data Wrangling,
###Examination of Input Dataset
__Visualize the input dataset__
```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(DataExplorer)
# Read data from the input csv file
filepath<- "Input Dataset/Supermarket aggr.Customer.csv"
supermarket_data <- read_csv(filepath)
# Converts data to tbl class. as tbl’s are easier to examine than data frames and View dataset in a spreadsheet-like display
supermarket_tbl<-tbl_df(supermarket_data)
```
```{r echo=TRUE, message=FALSE}
# Check the dimension of the input dataset and the variables through a plot
plot_str(supermarket_tbl)
```
__Generate the input dataset statistics__
```{r echo=TRUE, message=FALSE}
# Data Statistics
gather(introduce(supermarket_tbl))
```
__Generate the input dataset summary__
```{r echo=TRUE, message=FALSE}
# Data Summary, p0 = min value, p100 = max value
library(skimr)
skim_with(integer = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim_with(numeric = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim(supermarket_tbl) %>% skimr::kable()
```
###Dataset Cleaning and Processing
__Eliminate the missing values in the input dataset__
```{r echo=TRUE, message=FALSE}
# Eliminate the missing values in the dataset
supermarket_tbl_Clean1<-na.omit(supermarket_tbl)
na.action(supermarket_tbl_Clean1)
# Percentage of data missing (Should be 0%)
missing_data_count = sum(!complete.cases(supermarket_tbl_Clean1))
total_data = dim(supermarket_tbl_Clean1)[1]
missing_data_percent = (missing_data_count/total_data) * 100
missing_data_percent
```
__Eliminate the duplicate rows in the input dataset__
```{r echo=TRUE, message=FALSE}
# Eliminate duplicate rows
distinct(supermarket_tbl_Clean1)
```
__Round the decimal values in the input dataset__
```{r echo=TRUE, message=FALSE}
# Round the decimal value columns upto 4 decimal places
is.num <- sapply(supermarket_tbl_Clean1, is.numeric)
supermarket_tbl_Clean1[is.num] <- lapply(supermarket_tbl_Clean1[is.num], round, 4)
```
__Rename the column names in the input dataset__
```{r echo=TRUE, message=FALSE}
# Rename column names
## From products_purchased to products_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'products_purchased'] <- 'products_purchased_total'
## From shops_used to shops_used_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'shops_used'] <- 'shops_used_total'
## From amount_purchased to amount_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'amount_purchased'] <- 'amount_purchased_total'
## From min_distance_to_shops to min_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'min_distance_to_shops'] <- 'min_dist_to_custSel_shops'
## From max_distance_to_shops to max_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'max_distance_to_shops'] <- 'max_dist_to_custSel_shops'
## From unique_products_purchased to unique_products_purchased_total_exclCommon
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'unique_products_purchased'] <- 'unique_products_purchased_total_exclCommon'
## From avg_distance_to_shops to avg_distance_to_all_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_distance_to_shops'] <- 'avg_distance_to_all_shops'
## From avg_price_shop_1 to avg_product_price_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_1'] <- 'avg_product_price_shop_1'
## From avg_price_shop_2 to avg_product_price_shop_2
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_2'] <- 'avg_product_price_shop_2'
## From avg_price_shop_3 to avg_product_price_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_3'] <- 'avg_product_price_shop_3'
## From avg_price_shop_4 to avg_product_price_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_4'] <- 'avg_product_price_shop_4'
## From avg_price_shop_5 to avg_product_price_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_5'] <- 'avg_product_price_shop_5'
## From avg_price to avg_purchased_product_price_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price'] <- 'avg_purchased_product_price_allShops'
## From avg_purchase_shop_1 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_1'] <- 'avg_purchase_amount_shop_1'
## From avg_purchase_shop_2 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_2'] <- 'avg_purchase_amount_shop_2'
## From avg_purchase_shop_3 to avg_purchase_amount_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_3'] <- 'avg_purchase_amount_shop_3'
## From avg_purchase_shop_4 to avg_purchase_amount_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_4'] <- 'avg_purchase_amount_shop_4'
## From avg_purchase_shop_5 to avg_purchase_amount_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_5'] <- 'avg_purchase_amount_shop_5'
## From avg_purchase to avg_purchase_amount_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase'] <- 'avg_purchase_amount_allShops'
```
__Reorder the columns in the input dataset__
```{r echo=TRUE, message=FALSE}
# Reorder Columns
supermarket_tbl_Clean1 <- supermarket_tbl_Clean1[c(1,10,11,12,13,14,15,3,4,2,16,17,18,19,20,5,21,22,23,24,25,6,36,37,38,39,40,9,26,27,28,29,30,7,31,32,33,34,35,8)]
```
__Write the cleaned dataset tbl to a CSV file__
```{r echo=TRUE, message=FALSE}
# Write the cleaned data tbl to csv
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
write.csv(supermarket_tbl_Clean1, file = clean_filepath, row.names = FALSE)
```
###Exploration of Cleaned Dataset
__Visualize the cleaned dataset__
```{r echo=TRUE}
# Check the dimension of the cleaned dataset and the variables
plot_str(supermarket_tbl_Clean1)
```
__Generate the cleaned dataset statistics__
```{r echo=TRUE, message=FALSE}
# Cleaned data Statistics
gather(introduce(supermarket_tbl_Clean1))
```
__Analyze the Continuous Variables in the cleaned dataset as a Histogram__
```{r echo=TRUE, message=FALSE}
# Analyze Continuous Variables in the cleaned dataset (Univariate Analysis)
plot_histogram(supermarket_tbl_Clean1)
```
_Interpretation:_ The plotted histograms depict the distribution of each continuous variable in the dataset. These plots can be used to understand the data spread, whether the data is symmetric or skewed and graphically summarize the univariate dataset distribution.
__Examine the correlated features in the cleaned dataset through Correlation Analysis__
```{r echo=TRUE, message=FALSE}
# Correlation analysis (Multivariate Analysis, On Continuous features only) to examine corelated features in the cleaned dataset
library(ggcorrplot)
corr <- round(cor(supermarket_tbl_Clean1), 1)
ggcorrplot(corr, outline.col = "white") +geom_tile(height=1.8, width=1.8) +
scale_fill_gradient2(low="blue", mid="white", high="red") +
theme_minimal() +
coord_equal() +
labs(x="",y="",fill="Correlation coefficient") +
theme(axis.text.x=element_text(size=7, angle=90, vjust=1, hjust=1,
margin=margin(-3,0,0,0)),
axis.text.y=element_text(size=7, margin=margin(0,-3,0,0)),
panel.grid.major=element_blank())
```
_Interpretation:_ The Correlation plot depicts the association between the variables in the dataset and the degree of association between them is displayed by the variation in the correlation coefficient color. The plot helps to understand the relationship between the different variables in the dataset.
__Visualize the deviation from a specific probability distribution in the cleaned dataset through Quantile-Quantile plot__
```{r echo=TRUE, message=FALSE}
# Quantile-Quantile plot to visualize the deviation from a specific probability distribution in the cleaned dataset
plot_qq(supermarket_tbl_Clean1)
```
_Interpretation:_ The Quantile-Quantile plot compares two probability distributions through plotting their quantiles against each other. This plot depicts if both sets of quantiles are from the same distribution (points form a roughly straight line), helping to further understand the data distribution.
###Data Preparation
__Data Preparation for RQ1__
The unimportant and redundant features providing no meaningful information given the context are first removed from the dataset. The dataset also does not have any ground truth associated with it. In order to work on RQ1, this was required and which was created based on certain assumptions. It is assumed that the shop to which the customers visit the most is their most preferred shop. New columns were generated, to find the closest shop (a), shops from which most products are purchased (b), shops from which most unique products are purchased (c), shops offering least average product price (d) and shops at which the maximum amount of money is spent by a customers (e). The most frequent shop amongst them (b, c, d and e) is assigned as the most preferred shop (f). The customers were categorized by comparing (b) with (f). If the values matched, class - 'no' is assigned based on the assumption that customers have choosen the closest shop as their most preferred shop as they do not like to travel long distances and class - 'yes' otherwise. Further the dataset is re-arranged and stored along with the newly created class label.
```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(dplyr)
## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
## Read data from a file
supermarket_data_clean <- read_csv(file_path)
## List of all features present in the data frame
all_features <- colnames(supermarket_data_clean)
## List of selected features
sel_features <- all_features[-c(1,2,8,9,10,16,22,28,34,35,36,37,38,39,40)]
## Create a data frame with only the selected features
supermarket_data_model <- supermarket_data_clean %>% select(sel_features)
## Generate new columns to find the closest shop, shops from which most products are purchased, shops from which most unique products are purchased, shops offfering least average product price and shops at which maximum amount of money is spent by a customers.
supermarket_data_model$min_dist <- str_sub(colnames(supermarket_data_model[,1:5]),-1,-1)[apply(supermarket_data_model[,1:5],1,which.min)]
supermarket_data_model$most_prod_purch_from <- str_sub(names(supermarket_data_model[,6:10]),-1,-1)[max.col(supermarket_data_model[,6:10], "last")]
supermarket_data_model$most_uni_prod_purch_from <- str_sub(names(supermarket_data_model[,11:15]),-1,-1)[max.col(supermarket_data_model[,11:15], "last")]
supermarket_data_model$least_avg_prod_pri <- str_sub(colnames(supermarket_data_model[,16:20]),-1,-1)[apply(supermarket_data_model[,16:20],1,which.min)]
supermarket_data_model$max_amt_purch <- str_sub(names(supermarket_data_model[,21:25]),-1,-1)[max.col(supermarket_data_model[,21:25], "last")]
## Create a data frame having only the newly generated columns
test <- supermarket_data_model[,26:30]
## Gnereate new columns - most preferred shop and class to which the customer belongs
for (row in 1:nrow(test)){
## Create a vector for each row
vec <- c(test[row, "most_prod_purch_from"], test[row, "most_uni_prod_purch_from"], test[row, "least_avg_prod_pri"], test[row, "max_amt_purch"])
## Sort and find the most preferred shop
supermarket_data_model[row, "most_pref"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
## Assign lables to customers (0 or 'no' - 'Not willing to travel far for shopping' and 1 or 'yes'- 'Willing to travel far for shopping')
if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref"]){
supermarket_data_model[row, "class"] <- 'No'
} else{
supermarket_data_model[row, "class"] <- 'Yes'
}
}
## Re-order columns
supermarket_data_class <- supermarket_data_model[c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,32)]
## Set file path
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_Data_Classification.csv"
## Write the dataframe to csv file
write.csv(supermarket_data_class, file = clean_filepath, row.names = FALSE)
```
__Data Preparation for RQ3__
The unimportant and redundant features providing no meaningful information given the context are first removed from the dataset. The most preferred shop (e) for each customer based on the same assumption used for RQ1 is generated and used as the class label. New columns are generated, to find the average distance travelled (a), average number of products purchased (b), average number of unique products purchased (c), average product price (d) and the average amount spent (e) for each customer. Further, a dataset using these newly created columns - (a), (b), (c), (d) and (e) is generated to be used for creating a model that would predict the shop most likely to be selected by a customer as their most preferred shop.
```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(dplyr)
## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
## Read data from a file
supermarket_data_clean <- read_csv(file_path)
## Generate columns to find the average distance travelled, average number of products purchased, average number of unique products purchased, average product price
## and average amount spent by each customer
supermarket_data_clean$distance_avg <- with(supermarket_data_clean, avg_distance_to_all_shops)
supermarket_data_clean$products_purchased_avg <- with(supermarket_data_clean, round(products_purchased_total/shops_used_total))
supermarket_data_clean$unique_products_purchased_avg <- with(supermarket_data_clean, round(unique_products_purchased_total_exclCommon/shops_used_total))
supermarket_data_clean$product_price_avg<- with(supermarket_data_clean, avg_purchased_product_price_allShops)
supermarket_data_clean$amount_purchased_avg<- with(supermarket_data_clean, avg_purchase_amount_allShops)
## Generate new columns to find shops from which most products are purchased, shops from which most unique products are purchased,
## shops offfering least average product price and shops at which maximum amount of money is spent by a customers.
supermarket_data_clean$most_prod_purch_from <- str_sub(names(supermarket_data_clean[,11:15]),-6,-1)[max.col(supermarket_data_clean[,11:15], "random")]
supermarket_data_clean$most_uni_prod_purch_from <- str_sub(names(supermarket_data_clean[,17:21]),-6,-1)[max.col(supermarket_data_clean[,17:21], "random")]
supermarket_data_clean$least_avg_prod_pri <- str_sub(colnames(supermarket_data_clean[,23:27]),-6,-1)[apply(supermarket_data_clean[,23:27],1,which.min)]
supermarket_data_clean$max_amt_purch <- str_sub(names(supermarket_data_clean[,29:33]),-6,-1)[max.col(supermarket_data_clean[,29:33], "random")]
## Create a data frame having only the newly generated columns
gen_df <- supermarket_data_clean[,46:49]
## Gnereate new columns - most preferred shop and class to which the customer belongs
for (row in 1:nrow(gen_df)){
## Create a vector for each row
vec <- c(gen_df[row, "most_prod_purch_from"], gen_df[row, "most_uni_prod_purch_from"], gen_df[row, "max_amt_purch"])
## Sort and find the most preferred shop
supermarket_data_clean[row, "most_pref_shop"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
}
## Re-order columns
supermarket_data_predict <- supermarket_data_clean[c(41,42,43,44,45,50)]
## Set file path
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
## Write the dataframe to csv file
write.csv(supermarket_data_predict, file = clean_filepath, row.names = FALSE)
```
#Exploratory Data Analysis
Exploratory Data Analysis (EDA) is the process of visualizing the main characteristics in the data before the formal modelling on the data to discover data patterns and verify the initial primary assumptions made on the data. Below are the visualizations of the EDA carried out,
__1) Visualize the highest revenue generating shops, shops selling the highest number of products, shops selling the highest number of unique products and the relation between them__
```{r echo=TRUE, message=FALSE}
library(ggplot2)
library(RColorBrewer)
library(scales)
file_path<- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)
Shop<- c(1,2,3,4,5)
# Revenue generation by Shops 1-5
# columns selected: amount_purchased_shop_1, 2, 3, 4, 5
slice1<-select(supermarket_data_clean, 29,30,31,32,33)
amountS1<-sum(slice1$amount_purchased_shop_1)
amountS2<-sum(slice1$amount_purchased_shop_2)
amountS3<-sum(slice1$amount_purchased_shop_3)
amountS4<-sum(slice1$amount_purchased_shop_4)
amountS5<-sum(slice1$amount_purchased_shop_5)
# create data frame for Revenue generation
Revenue_Generated<- c(amountS1,amountS2,amountS3,amountS4,amountS5)
Revenue<- data.frame(Shop, Revenue_Generated)
rownames(Revenue) <- NULL
# Products Sold by Shops 1-5
# columns selected: products_purchased_shop_1, 2, 3, 4, 5
slice2<-select(supermarket_data_clean, 11,12,13,14,15)
productsS1<-sum(slice2$products_purchased_shop_1)
productsS2<-sum(slice2$products_purchased_shop_2)
productsS3<-sum(slice2$products_purchased_shop_3)
productsS4<-sum(slice2$products_purchased_shop_4)
productsS5<-sum(slice2$products_purchased_shop_5)
# create data frame for Products Sold
Products_Purchased<- c(productsS1,productsS2,productsS3,productsS4,productsS5)
ProductsSold<- data.frame(Shop, Products_Purchased)
rownames(ProductsSold) <- NULL
# Unique products Sold by Shops 1-5
# columns selected: unique_products_purchased_shop_1,2,3,4,5
slice3<-select(supermarket_data_clean, 17,18,19,20,21)
uproductsS1<-sum(slice3$unique_products_purchased_shop_1)
uproductsS2<-sum(slice3$unique_products_purchased_shop_2)
uproductsS3<-sum(slice3$unique_products_purchased_shop_3)
uproductsS4<-sum(slice3$unique_products_purchased_shop_4)
uproductsS5<-sum(slice3$unique_products_purchased_shop_5)
# create data frame for Unique products Sold
UProducts_Purchased<- c(uproductsS1,uproductsS2,uproductsS3,uproductsS4,uproductsS5)
UProductsSold<- data.frame(Shop, UProducts_Purchased)
rownames(UProductsSold) <- NULL
# Plot a Bar graph to depict the above calculated data
Legends <-c(rep("Revenue Generated", 5), rep("Products Sold", 5), rep("Unique Products Sold", 5))
values <-c(Revenue_Generated, Products_Purchased, UProducts_Purchased)
mydata <-data.frame(Shop, values)
p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", aes(fill = Legends), position = "dodge") +
xlab("Shop") + ylab("Total") +
ggtitle("Relation between Revenue and Products Sold") +
theme_bw() + scale_y_continuous(labels = scales::comma)
```
_Analysis:_ As visualized, the shops ordered based on their highest Revenue Generated is Shop 1, 2, 3, 5, 4.
The shops ordered based on their highest amount of Products Sold is Shop 1, 2, 3, 5, 4.
The shops ordered based on their highest amount of Unique Products Sold is Shop 1, 2, 3, 4 & 5 (are on the same level).
The relation between these parameters as visualized based on the shop order can be determined as the shop generating the highest revenue has the highest amount of products sold (unique included). So in the dataset, the relation between the revenue generated and the products sold is directly proportional to each other. The ordering of the shops is mostly stable here at Shop 1, 2, 3, 5, 4.
__2) Visualize the approximate customer base count for the different shops__
```{r echo=TRUE, message=FALSE}
# Approximate Customer Base for Shops 1-5
C1<-slice2$products_purchased_shop_1
custS1<-length(which(C1 !=0))
C2<-slice2$products_purchased_shop_2
custS2<-length(which(C2 !=0))
C3<-slice2$products_purchased_shop_3
custS3<-length(which(C3 !=0))
C4<-slice2$products_purchased_shop_4
custS4<-length(which(C4 !=0))
C5<-slice2$products_purchased_shop_5
custS5<-length(which(C5 !=0))
# create data frame for Approximate Customer Base
Customers<- c(custS1,custS2,custS3,custS4,custS5)
TotalCustomers<- data.frame(Shop, Customers)
rownames(TotalCustomers) <- NULL
# Plot a Bar graph to depict the approximate Customer Base
values <-c(Customers)
mydata <-data.frame(Shop, values)
p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", fill = "gray" , position = "dodge", color = "black") +
xlab("Shop") + ylab("Total") +
ggtitle("Customer Base") +
theme_bw()
```
_Analysis:_ As visualized, the shops ordered based on their highest approximate customer base is Shop 1, 2, 3, 5, 4. So the highest approximate customer base for a shop determines its popularity in terms of a customer's product purchase from the shop. The ordering of the shops is here as Shop 1, 2, 3, 5, 4.
__3) Visualize the relationship between average prices and distances to the shop__
```{r echo=TRUE, message=FALSE, warning=FALSE}
library(modelr)
library(gridExtra)
cleared_supermarket_data<-read_csv(file_path)
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)
shop_ordered_slice <- select(cleared_supermarked_tbl, 3,23,4,24,5,25,6,26,7,27) %>%
bind_cols(cleared_supermarked_tbl[,8:10], cleared_supermarked_tbl[,28])
# Splitting the data
get_slice_for_shop <- function(col1, col2){
shop_slice <- shop_ordered_slice[,col1:col2]
colnames(shop_slice) <- c("distance","price")
return(shop_slice)
}
shop_1_data <- get_slice_for_shop(1,2)
shop_2_data <- get_slice_for_shop(3,4)
shop_3_data <- get_slice_for_shop(5,6)
shop_4_data <- get_slice_for_shop(7,8)
shop_5_data <- get_slice_for_shop(9,10)
shop_avg_data <- get_slice_for_shop(13,14)
shop_agg_min_data <- get_slice_for_shop(11,14)
shop_agg_max_data <- get_slice_for_shop(12,14)
# Combine data to the one mutated table to show all shops at the one graph
joined_shops_data <- mutate(shop_1_data, Shop="1") %>%
union_all(mutate(shop_2_data, Shop="2")) %>%
union_all(mutate(shop_3_data, Shop="3")) %>%
union_all(mutate(shop_4_data, Shop="4")) %>%
union_all(mutate(shop_5_data, Shop="5"))
# Create base for plots
get_base_for_plot <- function(dataset, caption){
plot_base <- ggplot(data = dataset, mapping = aes(x = distance, y = price)) + ggtitle(caption)
return(plot_base)
}
# Colours list
colours_shema <- c("Red", "Green", "Yellow", "Pink", "Blue", "Purple", "steelblue1", "tomato1")
#create scatter plot
add_geom_point <- function(colorNum){
geom_p <- geom_point(colour=colours_shema[colorNum], alpha=0.3)
return(geom_p)
}
#draw scatter plot with plot base
draw_cov_point_plot <- function(dataset, colorNum, caption){
cov_geom_plot <- get_base_for_plot(dataset, caption) + add_geom_point(colorNum) +
scale_y_continuous(trans="log2")+
geom_smooth(stat = 'smooth', color = 'Black', method = 'gam', formula = y ~ s(x, bs = "cs"))
return(cov_geom_plot)
}
p1_1 <- draw_cov_point_plot(shop_1_data, 1, "Shop 1") + theme_bw()
p2_1 <- draw_cov_point_plot(shop_2_data, 2, "Shop 2") + theme_bw()
p3_1 <- draw_cov_point_plot(shop_3_data, 3, "Shop 3") + theme_bw()
p4_1 <- draw_cov_point_plot(shop_4_data, 4, "Shop 4") + theme_bw()
p5_1 <- draw_cov_point_plot(shop_5_data, 5, "Shop 5") + theme_bw()
pavg_1 <- draw_cov_point_plot(shop_avg_data, 6, "Average price with average distance") + theme_bw()
pmin_1 <- draw_cov_point_plot(shop_agg_min_data, 7, "Average price with min distance") + theme_bw()
pmax_1 <- draw_cov_point_plot(shop_agg_max_data, 8, "Average price with max distance") + theme_bw()
pall_1 <- get_base_for_plot(joined_shops_data, "All shops") + geom_point(mapping = aes(colour = Shop), alpha=0.3) + theme_bw()
comb_cov_shops <- grid.arrange(p1_1, p2_1, p3_1, p4_1, p5_1,
nrow=2, ncol=3,
top="Covariation between distances and average prices")
comb_cov_aggrs <- grid.arrange(pmin_1, pmax_1,
nrow=2,
top= "Covariation between min/max distances and average prices")
comb_cov_avg <- grid.arrange(pall_1, pavg_1,
nrow=2,
top= "Covariation between distances and average prices (aggregated)")
```
_Analysis:_ There are strong dependencies between long average distance and the average price in a shop. Also, the average price is close enough to zero, therefore it makes a sense to check for zero values for the price in the current dataset and its influence.
__4) Visualize data gap for the average price in each shop__
```{r echo=TRUE, message=FALSE, warning=FALSE}
prices <- shop_ordered_slice[,seq(2, 10 ,2)]
names(prices) <- c("Shop 1", "Shop 2", "Shop 3", "Shop 4", "Shop 5")
dataset_with_na <- data.frame(sapply(prices, function(x) {
na_if(x,0)
} ))
plot_missing(dataset_with_na)
```
_Analysis:_ As visualized, there is a data gap for the value of the average price in each shop. The value for average price in a shop for a customer is only filled in the dataset if the particular customer prefers the shop, else it is left as zero. This can be considered as a data gap but at the same time, it is an information which is never utilized during analysis, meaning that the customer does not choose that particular shop in the first place. But this data gap does not affect on the tendency of the relationship between price and distance.
__5) Visualize patterns for the average price__
```{r echo=TRUE, message=FALSE}
joined_shops_without_null <- filter(joined_shops_data, price != 0)
mod <- lm(log(price) ~ distance, data = joined_shops_without_null)
joined_shops_data2 <- joined_shops_without_null %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
pall_4 <- ggplot(data = joined_shops_data2, aes(x = Shop, y = resid)) +
geom_bar(stat = "identity", fill = colours_shema[6]) + ggtitle("Average price pattern") +
theme_bw()
pall_4
```
_Analysis:_ As visualized, the residuals gave us a view of the average price after removing the distance effect. Once the strong relationship between distance and price has been removed, relationship to other external factors become noticeable.
__6) Visualize the revenue generated by each shop__
```{r echo=TRUE, message=FALSE}
## Call the packages
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(plotly)
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)
### Loyality Score based on Revenue ###
## Calculate the revenue generated for each shop
revenue_shop_1 <- sum(supermarket_data_clean$amount_purchased_shop_1)
revenue_shop_2 <- sum(supermarket_data_clean$amount_purchased_shop_2)
revenue_shop_3 <- sum(supermarket_data_clean$amount_purchased_shop_3)
revenue_shop_4 <- sum(supermarket_data_clean$amount_purchased_shop_4)
revenue_shop_5 <- sum(supermarket_data_clean$amount_purchased_shop_5)
## Create a vector
revenue <- c(revenue_shop_1, revenue_shop_2, revenue_shop_3, revenue_shop_4, revenue_shop_5)
shops <- c("shop 1", "shop 2", "shop 3", "shop 4", "shop 5")
## Create a data frame to store the vectors
revenue_per_shop <- data.frame(shops, revenue)
## Generate a plot
rps_plot <- ggplot(revenue_per_shop, aes(shops, revenue))
## Add featurs to the plot
rps_plot + geom_bar(stat = "identity", width = 0.6, position = "dodge2") +
xlab("Shops") + ylab("Revenue") +
ggtitle("Revenue Generated") +
theme_bw() + scale_y_continuous(labels = scales::comma)
```
_Analysis:_ As visualized, the revenue generated by each shop can be calculated and used to generate a list of top N customers based on their contribution.
__7) Visualize the most preferred shop by the customer with respect to average unique products purchased and average products purchased__
```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(readr)
library(dplyr)
file_path<- "Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
supermarket_data_predict <- read_csv(file_path)
ggplot(data=supermarket_data_predict,aes(x=supermarket_data_predict$products_purchased_avg,y=supermarket_data_predict$unique_products_purchased_avg))+
geom_point(aes(colour=factor(supermarket_data_predict$most_pref_shop))) +labs(colour="Most_Pref_Shop")+theme_bw()+ylab("Avg unique products purchased")+xlab("Products purchased average")
```
_Analysis:_ As visualized, it can be inferred that, given the data it is possible to predict which shop the customer would select.
#Final Research Questions
Exploratory Data Analysis provided a feasibility check on the Initial RQs formulated. The EDA phase helped to get a better understanding of the data in relation to the project objective. Hence this leads to the modification, removal or addition of new RQ. Below is the final set of RQ formulated which will be answered through this project,
__1. Are customers willing to travel long distances to purchase products?__ <br />
_Relevance:_ This will help to understand the majority of the customer trends towards long distance travel to purchase products.<br />
__2. What are the factors that contribute towards the long distance travel of the customer to purchase products?__ <br />
_Relevance:_ This will help to understand the important factors that contribute towards the majority of the customers willing to travel long distances to purchase products, in turn better understanding the purchase behaviour of the customers.<br />
__3. What is the maximum likelihood of a customer to select a particular shop?__ <br />
_Relevance:_ This will help to understand which shops in the retail chain which are most likely to be preferred by new customers. This can further facilitate better stock management to meet increasing customer demands.<br />
__4. What are the different customer segments based on their purchase behaviour?__<br />
_Relevance:_ This will help to understand the groups of the similar customer based on their purchase behaviour and target different shop-level marketing schemes to the appropriate customers.<br />
__5. Which are the Top 100 customers that are most profitable in terms of revenue generation for each shop?__<br />
_Relevance:_ This will help to understand the top profitable customers for the business and help to target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.
#Final Analysis
__1. Are customers willing to travel long distances to purchase products?__ <br />
__Algorithms selected:__ Support Vector Machine (SVM), K-nearest neighbor (k-NN), Random Forest
__Reason for Algorithm Selections:__ In the event given that the relationship between two variables is non-linear and we are handling a two-class classification problem, SVM is the most accurate choice [@hsu2003practical].
k-NN algorithm, runs generally slower and has lower accuracy in comparison with that of SVM, but exhibits certain practical qualities. It is easy to train k-NN as it is a lazy algorithm. Consequently, it is easy to use and eventually easy to understand the results [@soucy2001simple].
In comparison with k-NN classification, Random Forest is a great algorithm to train early in the model development process, to see how it performs. Considering the context of the difference between Forest and Vectors algorithms it should be mentioned, that with Random Forest data can be used as it is whereas SVM maximizes the "margin" and thus relies on the concept of "distance" between different points. This tree-algorithm is also very hard to beat in terms of performance. Moreover, in contrast with SVM and k-NN, Random Forest does not demand parameter tuning to reach a high accuracy [@liaw2002classification].
__Features Selected:__ distance_shop_1, distance_shop_2, distance_shop_3, distance_shop_4, distance_shop_5, products_purchased_shop_1, products_purchased_shop_2, products_purchased_shop_3, products_purchased_shop_4, products_purchased_shop_5, unique_products_purchased_shop_1, unique_products_purchased_shop_2, unique_products_purchased_shop_3, unique_products_purchased_shop_4, unique_products_purchased_shop_5, avg_product_price_shop_1, avg_product_price_shop_2, avg_product_price_shop_3, avg_product_price_shop_4, avg_product_price_shop_5, amount_purchased_shop_1, amount_purchased_shop_2, amount_purchased_shop_3, amount_purchased_shop_4, amount_purchased_shop_5, class are of importance and are selected for this RQ.
__Analysis__
__1) Data preparation for classification__
```{r echo=TRUE, message=FALSE}
library(caret)
library(randomForest)
library(e1071)
library(ggplot2)
library(dplyr)
library(tidyverse)
# Creating of useful functions
create_conf_matrix <- function(refLabels, predictLabels, positiveLabel){
conf_matrix <- confusionMatrix(
refLabels, # reference labels
predictLabels, # predicted labels
positive = positiveLabel, # label that corresponds to a "positive" results (optional)
dnn = c("actual", "predicted") # names of the confusion matrix dimensions (optional)
)
return (conf_matrix)
}
get_evaluation <- function(refLabels, predictLabels, positiveLabel){
conf_matrix <- create_conf_matrix(refLabels, predictLabels, positiveLabel)
conf_matrix
print(conf_matrix$overall["Accuracy"])
print(conf_matrix$byClass["Sensitivity"])
print(conf_matrix$byClass["Specificity"])
}
# Data preparation
cleared_supermarket_data <- read_csv("Input Dataset/Cleaned Dataset/Supermarket_Data_Classification.csv")
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)
cleared_supermarked_tbl$class <- as.factor(cleared_supermarked_tbl$class)
```
__2) k-fold cross validation__
```{r echo=TRUE, message=FALSE}
flds <- createFolds(factor(cleared_supermarked_tbl$class), k = 5, list = FALSE, returnTrain = TRUE)
comb_factor <- tbl_df(cbind(cleared_supermarked_tbl, flds))
train_folders <- c(1,3,4)
test_folders <- c(2,5)
train_data <- cleared_supermarked_tbl[comb_factor$flds %in% train_folders,]
test_data <- cleared_supermarked_tbl[comb_factor$flds %in% test_folders,]
# splitting data to test and training
train_ds <- train_data[, -26]
y_train <- train_data %>%
pull(class)
y_test <- test_data %>%
pull(class)
test_ds <- test_data[, -26]
```
__3) Functions for classifying and plotting__
```{r echo=TRUE, message=FALSE}
classify_with_fit <- function(fit, title){
train_predicted <- predict(fit, train_ds, type = "class")
print("Evaluation for the training")
get_evaluation(y_train, train_predicted, "Yes")
predicted <- predict(fit, test_ds, type = "class")
print("Evaluation for the tests")
get_evaluation(y_test, predicted, "Yes")
draw_plot_for_classes(test_ds, predicted, title)
}
draw_plot_for_classes <- function(data, predicted, title){
plot_data <- cbind(data, predicted)
ggplot(plot_data, aes(x = predicted, fill = predicted)) +
geom_bar() +
xlab("Prediction") + ylab("Customer count") +
theme_bw() +
theme(legend.title = element_blank()) +
ggtitle(title)
}
```
Classification
__4) Classification with SVM __
```{r echo=TRUE, message=FALSE}
#tune SVM
tuneSvm <- tune(svm, class ~ ., data = train_data, ranges = list(gamma = 2^(-1:1)),
cost = 2^(2:4), tunecontrol = tune.control(sampling = "fix"))
summary(tuneSvm)
plot(tuneSvm)
#classify with best params
svmFit <- svm(class ~ ., data = train_data, kernel = "radial",
cost = 1, gamma = 0.5,
scale=TRUE, cachesize=95)
#plot(svmFit, train_data)
classify_with_fit(svmFit, "SVM classification")
```
__5) Classification with k-NN __
```{r echo=TRUE, message=FALSE}
# train knn classifier
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnTrain <- train(class ~ ., data = train_data, method = "knn",
trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
plot(knnTrain)
#classify with best params
knnFit <- knn3(train_ds, y_train, k = 7)
classify_with_fit(knnFit, "k-NN classification")
```
__6) Classification with Random Forest __
```{r echo=TRUE, message=FALSE}
randomFit <- randomForest(class ~ ., train_data, ntree=500)
classify_with_fit(randomFit, "Random Forest")
```
__Observations:__ It was found out that Yes the majority of the customers are ready to travel long distances to purchase products and which is affected by certain factors. Binary classification gives an opportunity to divide the data into two separate classes, which will help to understand whether the buyer will travel a long distance to the store or not, based on certain factors. Thus, the research question is answered by classifying the customers based on certain features to identify whether or not the majority of them are ready to travel long distances to purchase products.
__Applications:__ These insights obtained can be further utilized by the business to understand the behaviour trend of the majority of the customers related to long-distance travel to purchase products. This further paves way for the business to understand the reasons behind such majority trends. Eventually, this helps the business to devise strategies in the context of the store locations coupled with enhancing the factors influencing such trends and thus generating more revenue for the business with increased customer satisfaction..
__2. What are the factors that contribute towards the long distance travel of the customer to purchase products?__ <br />
__Algorithms selected:__ Custom Algorithm and as a follow-up investigation for RQ1
__Features Selected:__ Same as those selected for RQ1
__Analysis__
__1) Determine the responsible factors__
Out of all the 5 shops in the chain, every customer has a shop which they prefer the most. To find out the factors that are responsible for the customers selecting a particular shop as the most preferred shop, 7 new columns were generated. New columns were generated, to find out the shop which are closest to every customer (a), the shop from which they buy the most number of products (b), the shops from which they buy the most number of unique products (c), the shops at which they have the least average price (d), the shops at which they spend the most amount of money (e), the most preferred shop (f) and the factor based on which they choose their most preferred shop (g). To generate the values for the column - (f), the most frequent value in the columns - (b), (c), (d) and (e) is selected and assigned to the customers as the most preferred shop. Based on the most preferred shop for each customer, they are assigned different factors. The customers who have the closest shop, shop with least average price or shop with most unique products purchased as the most preferred shop, they are assigned 'dist', 'price' or 'satisf' respectively as the factor. If they have a combination of these three factors, they are assigned 'dist_price', 'dist_satisf', 'price_satisf' or 'dist_price_satisf' respectively.
Every product has a sophistication value attached with it and is meant to satisfy the needs of a customer. Higher the sophistication value of a product, higher satisfaction they provide to the buyer. The products needed for daily use such as bread, water, etc. are bought in bulk and are considered less sophisticated than the ones that are bought in comparatively lesser quantities such as DVD player, coffee flask, etc. So, here in this RQ it is assumed that, higher the number of unique products bought, higher is the satisfaction of the customer and more cost they will be ready to pay in terms of either distance to travel or in terms of the price of the items.
```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(dplyr)
## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
## Read data from a file
supermarket_data_clean <- read_csv(file_path)
## List of all features present in the data frame
all_features <- colnames(supermarket_data_clean)
## List of selected features
sel_features <- all_features[-c(1,2,8,9,10,16,22,28,34,35,36,37,38,39,40)]
## Create a data frame with only the selected features
supermarket_data_model <- supermarket_data_clean %>% select(sel_features)
## Generate new columns to find the closest shop, shops from which most products are purchased, shops from which most unique products are purchased, shops offfering least average product price and shops at which maximum amount of money is spent by a customers.
supermarket_data_model$min_dist <- str_sub(colnames(supermarket_data_model[,1:5]),-1,-1)[apply(supermarket_data_model[,1:5],1,which.min)]
supermarket_data_model$most_prod_purch_from <- str_sub(names(supermarket_data_model[,6:10]),-1,-1)[max.col(supermarket_data_model[,6:10], "last")]
supermarket_data_model$most_uni_prod_purch_from <- str_sub(names(supermarket_data_model[,11:15]),-1,-1)[max.col(supermarket_data_model[,11:15], "last")]
supermarket_data_model$least_avg_prod_pri <- str_sub(colnames(supermarket_data_model[,16:20]),-1,-1)[apply(supermarket_data_model[,16:20],1,which.min)]
supermarket_data_model$max_amt_purch <- str_sub(names(supermarket_data_model[,21:25]),-1,-1)[max.col(supermarket_data_model[,21:25], "last")]
## Create a data frame having only the newly generated columns
test <- supermarket_data_model[,26:30]
## Gnereate new columns - most preferred shop and categorise the customers based on the factors
for (row in 1:nrow(test)){
## Create a vector for each row
vec <- c(test[row, "most_prod_purch_from"], test[row, "most_uni_prod_purch_from"], test[row, "least_avg_prod_pri"], test[row, "max_amt_purch"])
## Sort and find the most preferred shop
supermarket_data_model[row, "most_pref_shop"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
## Assign lables to customers (0 or 'dist' - 'distance', 1 or 'price' - 'price', 2 or 'satisf' - 'satisfaction', 3 or 'dist_price' - 'distance and price', 4 or 'dist_satisf' - 'distance and satisfaction', 5 or 'price_satisf' - 'price and satisfaction' and 6 or 'dist_price_satisf' - 'distance, price and satisfaction')
if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 6
}
else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 3
}
else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 4
}
else if (supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 5
}
else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 0
}
else if (supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 1
}
else if (supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
supermarket_data_model[row, "factor"] <- 2
}
else{
supermarket_data_model[row, "factor"] <- 7
}
}
## Re-order columns
supermarket_data_clus <- supermarket_data_model[c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,32)]
```
__2) Visualize the responsible factors__
```{r echo=TRUE, message=FALSE}
## Create the legend for the plot
Legend <- c(rep("Distance", 1), rep("Price", 1), rep("Distance and Price", 1), rep("Satisfaction", 1), rep("Distance and Satisfaction", 1), rep("Price and Satisfaction", 1), rep("Distance, Price and Satisfaction", 1), rep("Others", 1))
## Generate the count of columns assigned to each factor
dist <- length(which(supermarket_data_model$factor == 0))
price <- length(which(supermarket_data_model$factor == 1))
dist_price <- length(which(supermarket_data_model$factor == 3))
satisf <- length(which(supermarket_data_model$factor == 2))
dist_satisf <- length(which(supermarket_data_model$factor == 4))
price_satisf <- length(which(supermarket_data_model$factor == 5))
dist_price_satisf <- length(which(supermarket_data_model$factor == 6))
others <- length(which(supermarket_data_model$factor == 7))
## Create a vector of values to be shown in the plot
values <- c(dist, price, dist_price, satisf, dist_satisf, price_satisf, dist_price_satisf, others)
## Create a vector of labels to be shown in the plot
labels <- c("no", "yes")
## Create a data frame to store the vectors
factor_count <- data.frame(labels, values)
## Generate a plot
fc_plot <- ggplot(factor_count, aes(labels, values))
## Add featurs to the plot
fc_plot + geom_bar(stat = "identity", aes(fill = Legend)) +
xlab("Class") + ylab("Total") +
ggtitle("Customer Classification") +
theme_bw() + scale_y_continuous(labels = scales::comma)
```
__Observations:__ It was observed that out of 42499 customers who are willing to travel, 42001 customers choose their most preferred shop based on their 'Satisfaction' or 'Shopping experience', whereas others decide based on 'Price' (4), 'Price and Satisfaction' (19) and Others (475). Out of the 17866 customers who do not like to travel long distances and select the shop closest to them as the most preferred shop, 17221 customers opted based on the factor 'Distance and Satisfaction', whereas others decided based on 'Distance' (620), 'Distance and Price' (0) and 'Distance, Price and Satisfaction' (25). Thus, the research question is answered by determining the responsible factors for the majority of the customer trend towards long-distance travel. Additionally, it was also found out that 'Satisfaction' is a key role factor affecting a customer's decision-making process.
__Applications:__ These insights obtained can be further utilized by the business to devise strategies to enhance the observed most important factors generating customer satisfaction & retention and eventually a steady growth for the business.
__3. What is the maximum likelihood of a customer to select a particular shop?__ <br />
__Algorithms selected:__ Naive Bayes, Decision tree
__Reason for Algorithm Selections:__ Naive Bayes is a supervised classifier which works on the assumption that all attributes are independent of each other. Because of this, all attributes can be learned separately which results in faster performance. But its accuracy rate is less than that of Decision tree [@john1995estimating].
Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. A small change in the data can cause a large change in the final estimated tree. However, they are intuitively very easy to explain. They closely mirror human decision-making compared to other regression and classification approaches [@safavian1991survey].
__Features Selected:__ distance_avg, products_purchased_avg, unique_products_purchased_avg, product_price_avg, amount_purchased_avg.
__Analysis__
__1) Data preparation for classification__
```{r echo=TRUE, message=FALSE}
library(caret)
library(dplyr) # Used by caret
library(e1071)
library(rpart)
library(readr)
library(tidyverse)
library(rpart.plot)
file_path<- "Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
supermarket_data_predict <- read_csv(file_path)
supermarket_data_predict$most_pref_shop=factor(supermarket_data_predict$most_pref_shop)
```
__2) Stratified k-fold cross validation__
```{r echo=TRUE, message=FALSE}
#stratified k-fold(5)
set.seed(123)