Customer Behavioural Analytics in the Retail sector.Rmd

---
Project title: "Customer Behavioural Analytics in the Retail sector"
output:
  rmarkdown::html_document:
      toc: true
      toc_depth: 3  
      number_sections: false 
      theme: lumen  
      df_print: paged
      toc_float: 
        collapsed: true
        smooth_scroll: true 
    
bibliography: bibliography.bib 
---

<style>
body {
text-align: justify}
</style>

#Customer Behavioural Analytics in the Retail Sector
</br>
<font color ="black"><h4><b>Team CuBA</b></h4>
Rutuja Shivraj Pawar (220051, rutuja.pawar@ovgu.de) <br />
Nadiia Honcharenko (220681, nadiia.honcharenko@st.ovgu.de) <br />
Shivani Jadhav (223856, shivani.jadhav@st.ovgu.de) <br />
Sumit Kundu (217453, sumit.kundu@st.ovgu.de) 

<h4><b>Under the Guidance of</b></h4> 
M.Sc. Uli Niemann

#Overview and Motivation

<font color ="black"> A customer is a key-centric factor for any business to be successful. Conventional wisdom tells us that the cost of retaining an existing customer is far less than acquiring a new one. In order that a business has a sustainable growth, the retention of its old customer base and expansion of the new customer base is very critical. This demands from a business to understand the behaviour of its customers in relation to the business. Therefore obtaining a 360&deg; view of its customers is crucial for a business looking for a competitive edge in the market. In such a scenario, Customer Behavioural Analytics plays an important role in leveraging data analytics to find meaningful behavioural patterns in the customer-specific business data. <br/>
Consequently, this project aims to understand the consumer behaviour in the retail sector. Decoding the consumer behaviour will be based on understanding how consumers make purchase decisions and what factors influence those decisions. This project also aims to discover the existence of dependencies between customers, products and shops to highlight further insights about their behaviour. These meaningful insights will further help a business to implement strategies leading to an increased revenue through customer satisfaction. 

#Project Objective

This project aims to address the problem of understanding the behaviour of customers of an Italian retail distribution company _Coop_ in a single Italian city. The project intends to discover different analytical insights about the purchase behaviour of the customers through answering different formulated Research Questions (RQ)

#Data Source

Supermarket aggr.Customer^[https://bigml.com/user/czuriaga/gallery/dataset/5559c2c6200d5a6570000084] <br />
The dataset to be used is the retail market data of one of the largest Italian retail distribution company called _Coop_ for a single Italian city.<br />
The Supermarket aggr.Customer dataset used for the analysis contains data aggregated from customer and information from shops^[http://www.michelecoscia.com/?page_id=379] [@pennacchioli2013explaining] and pivoted to new columns. The dataset thus contains 40 features with 60,366 instances and is approximately 14.0 MB in size. 

#Related Work

The paper [@pennacchioli2013explaining] studies and highlights the product range effect illustrating that the customers are focussed more on the needs satisfied by the product. The range effect highlights that if the satisfaction is higher, customers are ready to travel long distances to purchase them and thus pay more than the price of the product itself. The study is based on the creation of a data mining framework and the introduction of a new measurement called "product sophistication" to study the range effect. The paper also studies the effect of the introduced measure on the accuracy of predicting the shop selected by the customer to buy a particular product. Hence this paper studies the customer behaviour in the retail sector majorly based on this range effect.<br />
However, on the contrary, our analysis is not carried out on the same datasets as used in the paper but our dataset contains data aggregated from customer and information from shops and pivoted to new columns. As we have new columns in our dataset, the perspective of our analysis becomes different. Based on our data and understanding its different features, we framed different Research Questions to be answered from the data. These research questions do not focus only on one aspect (as the range effect in the paper), but attempt to decode the customer behaviour in different ways. Unlike the methodology used in the paper, this project obtains customer behavioural insights through the application of machine learning algorithms on the data. Through this, the project thus studies the consumer nature to determine whether customers are ready to travel long distances in spite of the high average price in a shop and what are the factors that contribute to their long-distance travel. Additionally, the likelihood of a customer to select a particular shop, different customer segments and discovering the top-100 profitable customers is also studied in the scope of the project.

#Initial Questions

Below are the RQs which were formulated at the initial stages of the project based on a primary understanding of the data but without a detailed Exploratory Data Analysis, 

__1. Are customers willing to travel long distances to purchase products?__ <br />
_Relevance:_ This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions. <br />

__2. Which are the products for which customers are ready to travel long distances and for which products they select the closest shop?__ <br />
_Relevance:_ This will help to understand the nature of the products in the context of proximity. It is assumed that customers will select closest shops to buy daily products like grocery but may travel long distances to buy one-time-purchase products like kitchen equipment. As Data Science is results-driven and not based solely on intuition, this question can help to verify this assumption.<br />

__3. What is the maximum likelihood of a customer to select a particular shop to purchase a particular product?__ <br />
_Relevance:_ This will help to understand that which shops in the retail chain are in demand for a particular product. This can further facilitate better stock management to meet customer demands.<br />

__4. What is the ranking of the shops in terms of attracting more customers and thus generating more revenue and what is their individual customer base?__<br />
_Relevance:_ This will help to understand the most popular shops in the retail chain and target different shop-level marketing schemes to the appropriate customers through customer segmentation. <br />

__5. Which are the customers that are most profitable in terms of revenue generation?__<br />
_Relevance:_ This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.

#Data Wrangling

Data Wrangling consists of different steps transforming data from the raw form into a clean form which is appropriate and accurate for data analysis. Below are the different steps which were carried out as a part of Data Wrangling,  

###Examination of Input Dataset 

__Visualize the input dataset__ 

```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(DataExplorer)

# Read data from the input csv file
filepath<- "Input Dataset/Supermarket aggr.Customer.csv"
supermarket_data <- read_csv(filepath)

# Converts data to tbl class. as tbl’s are easier to examine than data frames and View dataset in a spreadsheet-like display
supermarket_tbl<-tbl_df(supermarket_data)
```

```{r echo=TRUE, message=FALSE}
# Check the dimension of the input dataset and the variables through a plot
plot_str(supermarket_tbl)
``` 

__Generate the input dataset statistics__

```{r echo=TRUE, message=FALSE}
# Data Statistics
gather(introduce(supermarket_tbl))
```

__Generate the input dataset summary__

```{r echo=TRUE, message=FALSE}
# Data Summary, p0 = min value, p100 = max value
library(skimr)
skim_with(integer = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim_with(numeric = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim(supermarket_tbl) %>%  skimr::kable()
```

###Dataset Cleaning and Processing

__Eliminate the missing values in the input dataset__

```{r echo=TRUE, message=FALSE}
# Eliminate the missing values in the dataset
supermarket_tbl_Clean1<-na.omit(supermarket_tbl)
na.action(supermarket_tbl_Clean1)

# Percentage of data missing (Should be 0%) 
missing_data_count = sum(!complete.cases(supermarket_tbl_Clean1))
total_data = dim(supermarket_tbl_Clean1)[1]
missing_data_percent = (missing_data_count/total_data) * 100
missing_data_percent
```

__Eliminate the duplicate rows in the input dataset__

```{r echo=TRUE, message=FALSE}
# Eliminate duplicate rows
distinct(supermarket_tbl_Clean1)
```

__Round the decimal values in the input dataset__

```{r echo=TRUE, message=FALSE}
# Round the decimal value columns upto 4 decimal places
is.num <- sapply(supermarket_tbl_Clean1, is.numeric)
supermarket_tbl_Clean1[is.num] <- lapply(supermarket_tbl_Clean1[is.num], round, 4)
```

__Rename the column names in the input dataset__

```{r echo=TRUE, message=FALSE}
# Rename column names

## From products_purchased to products_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'products_purchased'] <- 'products_purchased_total'

## From shops_used to shops_used_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'shops_used'] <- 'shops_used_total'

## From amount_purchased to amount_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'amount_purchased'] <- 'amount_purchased_total'

## From min_distance_to_shops to min_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'min_distance_to_shops'] <- 'min_dist_to_custSel_shops'

## From max_distance_to_shops to max_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'max_distance_to_shops'] <- 'max_dist_to_custSel_shops'

## From unique_products_purchased to unique_products_purchased_total_exclCommon
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'unique_products_purchased'] <- 'unique_products_purchased_total_exclCommon'

## From avg_distance_to_shops to avg_distance_to_all_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_distance_to_shops'] <- 'avg_distance_to_all_shops'

## From avg_price_shop_1 to avg_product_price_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_1'] <- 'avg_product_price_shop_1'

## From avg_price_shop_2 to avg_product_price_shop_2
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_2'] <- 'avg_product_price_shop_2'

## From avg_price_shop_3 to avg_product_price_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_3'] <- 'avg_product_price_shop_3'

## From avg_price_shop_4 to avg_product_price_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_4'] <- 'avg_product_price_shop_4'

## From avg_price_shop_5 to avg_product_price_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_5'] <- 'avg_product_price_shop_5'

## From avg_price to avg_purchased_product_price_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price'] <- 'avg_purchased_product_price_allShops'

## From avg_purchase_shop_1 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_1'] <- 'avg_purchase_amount_shop_1'

## From avg_purchase_shop_2 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_2'] <- 'avg_purchase_amount_shop_2'

## From avg_purchase_shop_3 to avg_purchase_amount_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_3'] <- 'avg_purchase_amount_shop_3'

## From avg_purchase_shop_4 to avg_purchase_amount_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_4'] <- 'avg_purchase_amount_shop_4'

## From avg_purchase_shop_5 to avg_purchase_amount_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_5'] <- 'avg_purchase_amount_shop_5'

## From avg_purchase to avg_purchase_amount_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase'] <- 'avg_purchase_amount_allShops'
```

__Reorder the columns in the input dataset__

```{r echo=TRUE, message=FALSE}
# Reorder Columns
supermarket_tbl_Clean1 <- supermarket_tbl_Clean1[c(1,10,11,12,13,14,15,3,4,2,16,17,18,19,20,5,21,22,23,24,25,6,36,37,38,39,40,9,26,27,28,29,30,7,31,32,33,34,35,8)]
```

__Write the cleaned dataset tbl to a CSV file__ 

```{r echo=TRUE, message=FALSE}
# Write the cleaned data tbl to csv
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
write.csv(supermarket_tbl_Clean1, file = clean_filepath, row.names = FALSE)
```

###Exploration of Cleaned Dataset

__Visualize the cleaned dataset__
 
```{r echo=TRUE}
# Check the dimension of the cleaned dataset and the variables
plot_str(supermarket_tbl_Clean1)
```

__Generate the cleaned dataset statistics__

```{r echo=TRUE, message=FALSE}
# Cleaned data Statistics 
gather(introduce(supermarket_tbl_Clean1))
```

__Analyze the Continuous Variables in the cleaned dataset as a Histogram__

```{r echo=TRUE, message=FALSE}
# Analyze Continuous Variables in the cleaned dataset (Univariate Analysis)
plot_histogram(supermarket_tbl_Clean1)
```

_Interpretation:_ The plotted histograms depict the distribution of each continuous variable in the dataset. These plots can be used to understand the data spread, whether the data is symmetric or skewed and graphically summarize the univariate dataset distribution.

__Examine the correlated features in the cleaned dataset through Correlation Analysis__

```{r echo=TRUE, message=FALSE}
# Correlation analysis (Multivariate Analysis, On Continuous features only) to examine corelated features in the cleaned dataset
library(ggcorrplot)
corr <- round(cor(supermarket_tbl_Clean1), 1)

ggcorrplot(corr, outline.col = "white") +geom_tile(height=1.8, width=1.8) +
  scale_fill_gradient2(low="blue", mid="white", high="red") +
  theme_minimal() +
  coord_equal() +
  labs(x="",y="",fill="Correlation coefficient") +
  theme(axis.text.x=element_text(size=7, angle=90, vjust=1, hjust=1, 
                                 margin=margin(-3,0,0,0)),
        axis.text.y=element_text(size=7, margin=margin(0,-3,0,0)),
        panel.grid.major=element_blank()) 

```

_Interpretation:_ The Correlation plot depicts the association between the variables in the dataset and the degree of association between them is displayed by the variation in the correlation coefficient color. The plot helps to understand the relationship between the different variables in the dataset.

__Visualize the deviation from a specific probability distribution in the cleaned dataset through Quantile-Quantile plot__

```{r echo=TRUE, message=FALSE}
# Quantile-Quantile plot to visualize the deviation from a specific probability distribution in the cleaned dataset
plot_qq(supermarket_tbl_Clean1)
```

_Interpretation:_ The Quantile-Quantile plot compares two probability distributions through plotting their quantiles against each other. This plot depicts if both sets of quantiles are from the same distribution (points form a roughly straight line), helping to further understand the data distribution.

###Data Preparation

__Data Preparation for RQ1__

The unimportant and redundant features providing no meaningful information given the context are first removed from the dataset. The dataset also does not have any ground truth associated with it. In order to work on RQ1, this was required and which was created based on certain assumptions. It is assumed that the shop to which the customers visit the most is their most preferred shop. New columns were generated, to find the closest shop (a), shops from which most products are purchased (b), shops from which most unique products are purchased (c), shops offering least average product price (d) and shops at which the maximum amount of money is spent by a customers (e). The most frequent shop amongst them (b, c, d and e) is assigned as the most preferred shop (f). The customers were categorized by comparing (b) with (f). If the values matched, class - 'no' is assigned based on the assumption that customers have choosen the closest shop as their most preferred shop as they do not like to travel long distances and class - 'yes' otherwise. Further the dataset is re-arranged and stored along with the newly created class label. 

```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(dplyr)

## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"

## Read data from a file
supermarket_data_clean <- read_csv(file_path)

## List of all features present in the data frame
all_features <- colnames(supermarket_data_clean)

## List of selected features
sel_features <- all_features[-c(1,2,8,9,10,16,22,28,34,35,36,37,38,39,40)]

## Create a data frame with only the selected features
supermarket_data_model <- supermarket_data_clean %>% select(sel_features)

## Generate new columns to find the closest shop, shops from which most products are purchased, shops from which most unique products are purchased, shops offfering least average product price and shops at which maximum amount of money is spent by a customers.  
supermarket_data_model$min_dist <- str_sub(colnames(supermarket_data_model[,1:5]),-1,-1)[apply(supermarket_data_model[,1:5],1,which.min)]
supermarket_data_model$most_prod_purch_from <- str_sub(names(supermarket_data_model[,6:10]),-1,-1)[max.col(supermarket_data_model[,6:10], "last")]
supermarket_data_model$most_uni_prod_purch_from <- str_sub(names(supermarket_data_model[,11:15]),-1,-1)[max.col(supermarket_data_model[,11:15], "last")]
supermarket_data_model$least_avg_prod_pri <- str_sub(colnames(supermarket_data_model[,16:20]),-1,-1)[apply(supermarket_data_model[,16:20],1,which.min)]
supermarket_data_model$max_amt_purch <- str_sub(names(supermarket_data_model[,21:25]),-1,-1)[max.col(supermarket_data_model[,21:25], "last")]

## Create a data frame having only the newly generated columns
test <- supermarket_data_model[,26:30]

## Gnereate new columns - most preferred shop and class to which the customer belongs
for (row in 1:nrow(test)){
  ## Create a vector for each row
  vec <- c(test[row, "most_prod_purch_from"], test[row, "most_uni_prod_purch_from"], test[row, "least_avg_prod_pri"], test[row, "max_amt_purch"])
  
  ## Sort and find the most preferred shop
  supermarket_data_model[row, "most_pref"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
  
  ## Assign lables to customers (0 or 'no' - 'Not willing to travel far for shopping' and 1 or 'yes'- 'Willing to travel far for shopping')
  if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref"]){
    supermarket_data_model[row, "class"] <- 'No'
  } else{
    supermarket_data_model[row, "class"] <- 'Yes'
  }
}

## Re-order columns
supermarket_data_class <- supermarket_data_model[c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,32)]

## Set file path 
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_Data_Classification.csv"

## Write the dataframe to csv file
write.csv(supermarket_data_class, file = clean_filepath, row.names = FALSE)
```

__Data Preparation for RQ3__

The unimportant and redundant features providing no meaningful information given the context are first removed from the dataset. The most preferred shop (e) for each customer based on the same assumption used for RQ1 is generated and used as the class label. New columns are generated, to find the average distance travelled (a), average number of products purchased (b), average number of unique products purchased (c), average product price (d) and the average amount spent (e) for each customer. Further, a dataset using these newly created columns - (a), (b), (c), (d) and (e) is generated to be used for creating a model that would predict the shop most likely to be selected by a customer as their most preferred shop.

```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(dplyr)

## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"

## Read data from a file
supermarket_data_clean <- read_csv(file_path)

## Generate columns to find the average distance travelled, average number of products purchased, average number of unique products purchased, average product price
## and average amount spent by each customer
supermarket_data_clean$distance_avg <- with(supermarket_data_clean, avg_distance_to_all_shops)
supermarket_data_clean$products_purchased_avg <- with(supermarket_data_clean, round(products_purchased_total/shops_used_total))
supermarket_data_clean$unique_products_purchased_avg <- with(supermarket_data_clean, round(unique_products_purchased_total_exclCommon/shops_used_total))
supermarket_data_clean$product_price_avg<- with(supermarket_data_clean, avg_purchased_product_price_allShops)
supermarket_data_clean$amount_purchased_avg<- with(supermarket_data_clean, avg_purchase_amount_allShops)

## Generate new columns to find shops from which most products are purchased, shops from which most unique products are purchased, 
## shops offfering least average product price and shops at which maximum amount of money is spent by a customers.  
supermarket_data_clean$most_prod_purch_from <- str_sub(names(supermarket_data_clean[,11:15]),-6,-1)[max.col(supermarket_data_clean[,11:15], "random")]
supermarket_data_clean$most_uni_prod_purch_from <- str_sub(names(supermarket_data_clean[,17:21]),-6,-1)[max.col(supermarket_data_clean[,17:21], "random")]
supermarket_data_clean$least_avg_prod_pri <- str_sub(colnames(supermarket_data_clean[,23:27]),-6,-1)[apply(supermarket_data_clean[,23:27],1,which.min)]
supermarket_data_clean$max_amt_purch <- str_sub(names(supermarket_data_clean[,29:33]),-6,-1)[max.col(supermarket_data_clean[,29:33], "random")]

## Create a data frame having only the newly generated columns
gen_df <- supermarket_data_clean[,46:49]

## Gnereate new columns - most preferred shop and class to which the customer belongs
for (row in 1:nrow(gen_df)){
  ## Create a vector for each row
  vec <- c(gen_df[row, "most_prod_purch_from"], gen_df[row, "most_uni_prod_purch_from"], gen_df[row, "max_amt_purch"])
  
  ## Sort and find the most preferred shop
  supermarket_data_clean[row, "most_pref_shop"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
}

## Re-order columns
supermarket_data_predict <- supermarket_data_clean[c(41,42,43,44,45,50)]

## Set file path 
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"

## Write the dataframe to csv file
write.csv(supermarket_data_predict, file = clean_filepath, row.names = FALSE)
```

#Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of visualizing the main characteristics in the data before the formal modelling on the data to discover data patterns and verify the initial primary assumptions made on the data. Below are the visualizations of the EDA carried out,  

__1) Visualize the highest revenue generating shops, shops selling the highest number of products, shops selling the highest number of unique products and the relation between them__

```{r echo=TRUE, message=FALSE}
library(ggplot2)
library(RColorBrewer)
library(scales)

file_path<- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)

Shop<- c(1,2,3,4,5)

# Revenue generation by Shops 1-5
# columns selected: amount_purchased_shop_1, 2, 3, 4, 5
slice1<-select(supermarket_data_clean, 29,30,31,32,33)

amountS1<-sum(slice1$amount_purchased_shop_1)
amountS2<-sum(slice1$amount_purchased_shop_2)
amountS3<-sum(slice1$amount_purchased_shop_3)
amountS4<-sum(slice1$amount_purchased_shop_4)
amountS5<-sum(slice1$amount_purchased_shop_5)

# create data frame for Revenue generation
Revenue_Generated<- c(amountS1,amountS2,amountS3,amountS4,amountS5)
Revenue<- data.frame(Shop, Revenue_Generated)
rownames(Revenue) <- NULL

# Products Sold by Shops 1-5
# columns selected: products_purchased_shop_1, 2, 3, 4, 5
slice2<-select(supermarket_data_clean, 11,12,13,14,15)

productsS1<-sum(slice2$products_purchased_shop_1)
productsS2<-sum(slice2$products_purchased_shop_2)
productsS3<-sum(slice2$products_purchased_shop_3)
productsS4<-sum(slice2$products_purchased_shop_4)
productsS5<-sum(slice2$products_purchased_shop_5)

# create data frame for Products Sold
Products_Purchased<- c(productsS1,productsS2,productsS3,productsS4,productsS5)
ProductsSold<- data.frame(Shop, Products_Purchased)
rownames(ProductsSold) <- NULL

# Unique products Sold by Shops 1-5
# columns selected: unique_products_purchased_shop_1,2,3,4,5
slice3<-select(supermarket_data_clean, 17,18,19,20,21)

uproductsS1<-sum(slice3$unique_products_purchased_shop_1)
uproductsS2<-sum(slice3$unique_products_purchased_shop_2)
uproductsS3<-sum(slice3$unique_products_purchased_shop_3)
uproductsS4<-sum(slice3$unique_products_purchased_shop_4)
uproductsS5<-sum(slice3$unique_products_purchased_shop_5)

# create data frame for Unique products Sold
UProducts_Purchased<- c(uproductsS1,uproductsS2,uproductsS3,uproductsS4,uproductsS5)
UProductsSold<- data.frame(Shop, UProducts_Purchased)
rownames(UProductsSold) <- NULL

# Plot a Bar graph to depict the above calculated data
Legends <-c(rep("Revenue Generated", 5), rep("Products Sold", 5), rep("Unique Products Sold", 5))
values <-c(Revenue_Generated, Products_Purchased, UProducts_Purchased)
mydata <-data.frame(Shop, values)

p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", aes(fill = Legends), position = "dodge") +
  xlab("Shop") + ylab("Total") +
  ggtitle("Relation between Revenue and Products Sold") +
  theme_bw() + scale_y_continuous(labels = scales::comma)

```

_Analysis:_ As visualized, the shops ordered based on their highest Revenue Generated is Shop 1, 2, 3, 5, 4.
The shops ordered based on their highest amount of Products Sold is Shop 1, 2, 3, 5, 4.
The shops ordered based on their highest amount of Unique Products Sold is Shop 1, 2, 3, 4 & 5 (are on the same level).
The relation between these parameters as visualized based on the shop order can be determined as the shop generating the highest revenue has the highest amount of products sold (unique included). So in the dataset, the relation between the revenue generated and the products sold is directly proportional to each other. The ordering of the shops is mostly stable here at Shop 1, 2, 3, 5, 4.

__2) Visualize the approximate customer base count for the different shops__

```{r echo=TRUE, message=FALSE}
# Approximate Customer Base for Shops 1-5

C1<-slice2$products_purchased_shop_1
custS1<-length(which(C1 !=0))

C2<-slice2$products_purchased_shop_2
custS2<-length(which(C2 !=0))

C3<-slice2$products_purchased_shop_3
custS3<-length(which(C3 !=0))

C4<-slice2$products_purchased_shop_4
custS4<-length(which(C4 !=0))

C5<-slice2$products_purchased_shop_5
custS5<-length(which(C5 !=0))

# create data frame for Approximate Customer Base
Customers<- c(custS1,custS2,custS3,custS4,custS5)
TotalCustomers<- data.frame(Shop, Customers)
rownames(TotalCustomers) <- NULL

# Plot a Bar graph to depict the approximate Customer Base
values <-c(Customers)
mydata <-data.frame(Shop, values)

p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", fill = "gray" , position = "dodge", color = "black") +
  xlab("Shop") + ylab("Total") +
  ggtitle("Customer Base") + 
  theme_bw()

```

_Analysis:_ As visualized, the shops ordered based on their highest approximate customer base is Shop 1, 2, 3, 5, 4. So the highest approximate customer base for a shop determines its popularity in terms of a customer's product purchase from the shop. The ordering of the shops is here as Shop 1, 2, 3, 5, 4.  

__3) Visualize the relationship between average prices and distances to the shop__

```{r echo=TRUE, message=FALSE, warning=FALSE}
library(modelr)
library(gridExtra)

cleared_supermarket_data<-read_csv(file_path)
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)

shop_ordered_slice <- select(cleared_supermarked_tbl, 3,23,4,24,5,25,6,26,7,27) %>% 
  bind_cols(cleared_supermarked_tbl[,8:10], cleared_supermarked_tbl[,28])

# Splitting the data
get_slice_for_shop <- function(col1, col2){
  shop_slice <- shop_ordered_slice[,col1:col2]
  colnames(shop_slice) <- c("distance","price")
  return(shop_slice)
}

shop_1_data <- get_slice_for_shop(1,2)
shop_2_data <- get_slice_for_shop(3,4)
shop_3_data <- get_slice_for_shop(5,6)
shop_4_data <- get_slice_for_shop(7,8)
shop_5_data <- get_slice_for_shop(9,10)
shop_avg_data <- get_slice_for_shop(13,14)
shop_agg_min_data <- get_slice_for_shop(11,14)
shop_agg_max_data <- get_slice_for_shop(12,14)

# Combine data to the one mutated table to show all shops at the one graph
joined_shops_data <- mutate(shop_1_data, Shop="1") %>%
  union_all(mutate(shop_2_data, Shop="2")) %>%
  union_all(mutate(shop_3_data, Shop="3")) %>%
  union_all(mutate(shop_4_data, Shop="4")) %>%
  union_all(mutate(shop_5_data, Shop="5")) 

# Create base for plots
get_base_for_plot <- function(dataset, caption){
  plot_base <- ggplot(data = dataset, mapping = aes(x = distance, y = price)) + ggtitle(caption)
  return(plot_base)
}

# Colours list
colours_shema <- c("Red", "Green", "Yellow", "Pink", "Blue", "Purple", "steelblue1", "tomato1")

#create scatter plot
add_geom_point <- function(colorNum){
  geom_p <- geom_point(colour=colours_shema[colorNum], alpha=0.3)
  return(geom_p)
}

#draw scatter plot with plot base
draw_cov_point_plot <- function(dataset, colorNum, caption){
  cov_geom_plot <- get_base_for_plot(dataset, caption) + add_geom_point(colorNum) +       
  scale_y_continuous(trans="log2")+ 
  geom_smooth(stat = 'smooth', color = 'Black', method = 'gam', formula = y ~ s(x, bs = "cs")) 
  return(cov_geom_plot)
} 

p1_1 <- draw_cov_point_plot(shop_1_data, 1, "Shop 1") + theme_bw()
p2_1 <- draw_cov_point_plot(shop_2_data, 2, "Shop 2") + theme_bw()
p3_1 <- draw_cov_point_plot(shop_3_data, 3, "Shop 3") + theme_bw()
p4_1 <- draw_cov_point_plot(shop_4_data, 4, "Shop 4") + theme_bw()
p5_1 <- draw_cov_point_plot(shop_5_data, 5, "Shop 5") + theme_bw()
pavg_1 <- draw_cov_point_plot(shop_avg_data, 6, "Average price with average distance") + theme_bw()
pmin_1 <- draw_cov_point_plot(shop_agg_min_data, 7, "Average price with min distance") + theme_bw()
pmax_1 <- draw_cov_point_plot(shop_agg_max_data, 8, "Average price with max distance") + theme_bw()

pall_1 <- get_base_for_plot(joined_shops_data, "All shops") + geom_point(mapping = aes(colour = Shop), alpha=0.3) + theme_bw()

comb_cov_shops <- grid.arrange(p1_1, p2_1, p3_1, p4_1, p5_1, 
                               nrow=2, ncol=3, 
                               top="Covariation between distances and average prices")
comb_cov_aggrs <- grid.arrange(pmin_1, pmax_1,
                               nrow=2,
                               top= "Covariation between min/max distances and average prices")
comb_cov_avg <- grid.arrange(pall_1, pavg_1,
                               nrow=2,
                               top= "Covariation between distances and average prices (aggregated)")

```

_Analysis:_ There are strong dependencies between long average distance and the average price in a shop. Also, the average price is close enough to zero, therefore it makes a sense to check for zero values for the price in the current dataset and its influence.

__4) Visualize data gap for the average price in each shop__

```{r echo=TRUE, message=FALSE, warning=FALSE}
prices <- shop_ordered_slice[,seq(2, 10 ,2)]
names(prices) <- c("Shop 1", "Shop 2", "Shop 3", "Shop 4", "Shop 5")
dataset_with_na <- data.frame(sapply(prices, function(x) { 
    na_if(x,0)
            } ))
plot_missing(dataset_with_na)
```

_Analysis:_ As visualized, there is a data gap for the value of the average price in each shop. The value for average price in a shop for a customer is only filled in the dataset if the particular customer prefers the shop, else it is left as zero. This can be considered as a data gap but at the same time, it is an information which is never utilized during analysis, meaning that the customer does not choose that particular shop in the first place. But this data gap does not affect on the tendency of the relationship between price and distance.

__5) Visualize patterns for the average price__

```{r echo=TRUE, message=FALSE}
joined_shops_without_null <- filter(joined_shops_data, price != 0)
mod <- lm(log(price) ~ distance, data = joined_shops_without_null)

joined_shops_data2 <- joined_shops_without_null %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))

pall_4 <- ggplot(data = joined_shops_data2, aes(x = Shop, y = resid)) + 
  geom_bar(stat = "identity", fill = colours_shema[6]) + ggtitle("Average price pattern") + 
  theme_bw()
pall_4
```

_Analysis:_ As visualized, the residuals gave us a view of the average price after removing the distance effect. Once the strong relationship between distance and price has been removed, relationship to other external factors become noticeable.

__6) Visualize the revenue generated by each shop__

```{r echo=TRUE, message=FALSE}
## Call the packages
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(plotly)

file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)

### Loyality Score based on Revenue ###

## Calculate the revenue generated for each shop
revenue_shop_1 <- sum(supermarket_data_clean$amount_purchased_shop_1)
revenue_shop_2 <- sum(supermarket_data_clean$amount_purchased_shop_2)
revenue_shop_3 <- sum(supermarket_data_clean$amount_purchased_shop_3)
revenue_shop_4 <- sum(supermarket_data_clean$amount_purchased_shop_4)
revenue_shop_5 <- sum(supermarket_data_clean$amount_purchased_shop_5)

## Create a vector 
revenue <- c(revenue_shop_1, revenue_shop_2, revenue_shop_3, revenue_shop_4, revenue_shop_5)
shops <- c("shop 1", "shop 2", "shop 3", "shop 4", "shop 5")

## Create a data frame to store the vectors
revenue_per_shop <- data.frame(shops, revenue)

## Generate a plot
rps_plot <- ggplot(revenue_per_shop, aes(shops, revenue))

## Add featurs to the plot
rps_plot + geom_bar(stat = "identity", width = 0.6, position = "dodge2") +
  xlab("Shops") + ylab("Revenue") +
  ggtitle("Revenue Generated") +
  theme_bw() + scale_y_continuous(labels = scales::comma)

```

_Analysis:_ As visualized, the revenue generated by each shop can be calculated and used to generate a list of top N customers based on their contribution. 

__7) Visualize the most preferred shop by the customer with respect to average unique products purchased and average products purchased__

```{r echo=TRUE, message=FALSE} 
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(readr)
library(dplyr)


file_path<- "Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
supermarket_data_predict <- read_csv(file_path)

ggplot(data=supermarket_data_predict,aes(x=supermarket_data_predict$products_purchased_avg,y=supermarket_data_predict$unique_products_purchased_avg))+
  geom_point(aes(colour=factor(supermarket_data_predict$most_pref_shop))) +labs(colour="Most_Pref_Shop")+theme_bw()+ylab("Avg unique products purchased")+xlab("Products purchased average")
```

_Analysis:_ As visualized, it can be inferred that, given the data it is possible to predict which shop the customer would select.

#Final Research Questions

Exploratory Data Analysis provided a feasibility check on the Initial RQs formulated. The EDA phase helped to get a better understanding of the data in relation to the project objective. Hence this leads to the modification, removal or addition of new RQ. Below is the final set of RQ formulated which will be answered through this project,

__1. Are customers willing to travel long distances to purchase products?__ <br />
_Relevance:_ This will help to understand the majority of the customer trends towards long distance travel to purchase products.<br />

__2. What are the factors that contribute towards the long distance travel of the customer to purchase products?__ <br />
_Relevance:_ This will help to understand the important factors that contribute towards the majority of the customers willing to travel long distances to purchase products, in turn better understanding the purchase behaviour of the customers.<br />

__3. What is the maximum likelihood of a customer to select a particular shop?__ <br />
_Relevance:_ This will help to understand which shops in the retail chain which are most likely to be preferred by new customers. This can further facilitate better stock management to meet increasing customer demands.<br />

__4. What are the different customer segments based on their purchase behaviour?__<br />
_Relevance:_ This will help to understand the groups of the similar customer based on their purchase behaviour and target different shop-level marketing schemes to the appropriate customers.<br />

__5. Which are the Top 100 customers that are most profitable in terms of revenue generation for each shop?__<br />
_Relevance:_ This will help to understand the top profitable customers for the business and help to target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.

#Final Analysis

__1. Are customers willing to travel long distances to purchase products?__ <br />

__Algorithms selected:__ Support Vector Machine (SVM), K-nearest neighbor (k-NN), Random Forest

__Reason for Algorithm Selections:__ In the event given that the relationship between two variables is non-linear and we are handling a two-class classification problem, SVM is the most accurate choice [@hsu2003practical]. 

k-NN algorithm, runs generally slower and has lower accuracy in comparison with that of SVM, but exhibits certain practical qualities. It is easy to train k-NN as it is a lazy algorithm. Consequently, it is easy to use and eventually easy to understand the results [@soucy2001simple]. 

In comparison with k-NN classification, Random Forest is a great algorithm to train early in the model development process, to see how it performs. Considering the context of the difference between Forest and Vectors algorithms it should be mentioned, that with Random Forest data can be used as it is whereas SVM maximizes the "margin" and thus relies on the concept of "distance" between different points. This tree-algorithm is also very hard to beat in terms of performance. Moreover, in contrast with SVM and k-NN, Random Forest does not demand parameter tuning to reach a high accuracy [@liaw2002classification].

__Features Selected:__ distance_shop_1, distance_shop_2, distance_shop_3, distance_shop_4, distance_shop_5, products_purchased_shop_1, products_purchased_shop_2, products_purchased_shop_3, products_purchased_shop_4, products_purchased_shop_5, unique_products_purchased_shop_1, unique_products_purchased_shop_2, unique_products_purchased_shop_3, unique_products_purchased_shop_4, unique_products_purchased_shop_5, avg_product_price_shop_1, avg_product_price_shop_2, avg_product_price_shop_3, avg_product_price_shop_4, avg_product_price_shop_5, amount_purchased_shop_1, amount_purchased_shop_2, amount_purchased_shop_3, amount_purchased_shop_4, amount_purchased_shop_5, class are of importance and are selected for this RQ.

__Analysis__

__1) Data preparation for classification__

```{r echo=TRUE, message=FALSE}
library(caret)
library(randomForest)
library(e1071)
library(ggplot2)
library(dplyr)
library(tidyverse)

# Creating of useful functions
create_conf_matrix <- function(refLabels, predictLabels, positiveLabel){
  conf_matrix <- confusionMatrix(
    refLabels, # reference labels
    predictLabels, # predicted labels
    positive = positiveLabel, # label that corresponds to a "positive" results (optional)
    dnn = c("actual", "predicted") # names of the confusion matrix dimensions  (optional)
  )
  return (conf_matrix)
}

get_evaluation <- function(refLabels, predictLabels, positiveLabel){
  conf_matrix <- create_conf_matrix(refLabels, predictLabels, positiveLabel)
  conf_matrix
  print(conf_matrix$overall["Accuracy"])
  print(conf_matrix$byClass["Sensitivity"])
  print(conf_matrix$byClass["Specificity"])
}

# Data preparation
cleared_supermarket_data <- read_csv("Input Dataset/Cleaned Dataset/Supermarket_Data_Classification.csv")
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)

cleared_supermarked_tbl$class <- as.factor(cleared_supermarked_tbl$class)
```

__2) k-fold cross validation__

```{r echo=TRUE, message=FALSE}
flds <- createFolds(factor(cleared_supermarked_tbl$class), k = 5, list = FALSE, returnTrain = TRUE)

comb_factor <- tbl_df(cbind(cleared_supermarked_tbl, flds))
train_folders <- c(1,3,4)
test_folders <- c(2,5)
train_data <- cleared_supermarked_tbl[comb_factor$flds %in% train_folders,]
test_data <- cleared_supermarked_tbl[comb_factor$flds %in% test_folders,]

# splitting data to test and training
train_ds <- train_data[, -26]
y_train <- train_data %>% 
  pull(class)
y_test <- test_data %>% 
  pull(class)
test_ds <- test_data[, -26]
```

__3) Functions for classifying and plotting__

```{r echo=TRUE, message=FALSE}

classify_with_fit <- function(fit, title){
  train_predicted <- predict(fit, train_ds, type = "class")
  print("Evaluation for the training")
  get_evaluation(y_train, train_predicted, "Yes")
  predicted <- predict(fit, test_ds, type = "class")
  print("Evaluation for the tests")
  get_evaluation(y_test, predicted, "Yes")
  draw_plot_for_classes(test_ds, predicted, title)
}

draw_plot_for_classes <- function(data, predicted, title){
  plot_data <- cbind(data, predicted)
  ggplot(plot_data, aes(x = predicted, fill = predicted)) +
    geom_bar() +
    xlab("Prediction") + ylab("Customer count") +
    theme_bw() +
    theme(legend.title = element_blank()) +
    ggtitle(title)
}

```

Classification

__4) Classification with SVM __

```{r echo=TRUE, message=FALSE}
#tune SVM
tuneSvm <- tune(svm, class ~ ., data = train_data, ranges = list(gamma = 2^(-1:1)),
     cost = 2^(2:4), tunecontrol = tune.control(sampling = "fix"))
summary(tuneSvm)
plot(tuneSvm)
#classify with best params
svmFit <- svm(class ~ ., data = train_data, kernel = "radial", 
           cost = 1, gamma = 0.5,
           scale=TRUE, cachesize=95)

#plot(svmFit, train_data)
classify_with_fit(svmFit, "SVM classification")
```

__5) Classification with k-NN __

```{r echo=TRUE, message=FALSE}
# train knn classifier

set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knnTrain <- train(class ~ ., data = train_data, method = "knn", 
                trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
plot(knnTrain)

#classify with best params
knnFit <- knn3(train_ds, y_train, k = 7)
classify_with_fit(knnFit, "k-NN classification")
```

__6) Classification with Random Forest __

```{r echo=TRUE, message=FALSE}
randomFit <- randomForest(class ~ ., train_data, ntree=500)
classify_with_fit(randomFit, "Random Forest")
```

__Observations:__ It was found out that Yes the majority of the customers are ready to travel long distances to purchase products and which is affected by certain factors. Binary classification gives an opportunity to divide the data into two separate classes, which will help to understand whether the buyer will travel a long distance to the store or not, based on certain factors. Thus, the research question is answered by classifying the customers based on certain features to identify whether or not the majority of them are ready to travel long distances to purchase products.

__Applications:__ These insights obtained can be further utilized by the business to understand the behaviour trend of the majority of the customers related to long-distance travel to purchase products. This further paves way for the business to understand the reasons behind such majority trends. Eventually, this helps the business to devise strategies in the context of the store locations coupled with enhancing the factors influencing such trends and thus generating more revenue for the business with increased customer satisfaction..

__2. What are the factors that contribute towards the long distance travel of the customer to purchase products?__ <br />

__Algorithms selected:__ Custom Algorithm and as a follow-up investigation for RQ1 

__Features Selected:__ Same as those selected for RQ1

__Analysis__

__1) Determine the responsible factors__

Out of all the 5 shops in the chain, every customer has a shop which they prefer the most. To find out the factors that are responsible for the customers selecting a particular shop as the most preferred shop, 7 new columns were generated. New columns were generated, to find out the shop which are closest to every customer (a), the shop from which they buy the most number of products (b), the shops from which they buy the most number of unique products (c), the shops at which they have the least average price (d), the shops at which they spend the most amount of money (e), the most preferred shop (f) and the factor based on which they choose their most preferred shop (g). To generate the values for the column - (f), the most frequent value in the columns - (b), (c), (d) and (e) is selected and assigned to the customers as the most preferred shop. Based on the most preferred shop for each customer, they are assigned different factors. The customers who have the closest shop, shop with least average price or shop with most unique products purchased as the most preferred shop, they are assigned 'dist', 'price' or 'satisf' respectively as the factor. If they have a combination of these three factors, they are assigned 'dist_price', 'dist_satisf', 'price_satisf' or 'dist_price_satisf' respectively. 

Every product has a sophistication value attached with it and is meant to satisfy the needs of a customer. Higher the sophistication value of a product, higher satisfaction they provide to the buyer. The products needed for daily use such as bread, water, etc. are bought in bulk and are considered less sophisticated than the ones that are bought in comparatively lesser quantities such as DVD player, coffee flask, etc. So, here in this RQ it is assumed that, higher the number of unique products bought, higher is the satisfaction of the customer and more cost they will be ready to pay in terms of either distance to travel or in terms of the price of the items.

```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(dplyr)

## Set file path
file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"

## Read data from a file
supermarket_data_clean <- read_csv(file_path)

## List of all features present in the data frame
all_features <- colnames(supermarket_data_clean)

## List of selected features
sel_features <- all_features[-c(1,2,8,9,10,16,22,28,34,35,36,37,38,39,40)]

## Create a data frame with only the selected features
supermarket_data_model <- supermarket_data_clean %>% select(sel_features)

## Generate new columns to find the closest shop, shops from which most products are purchased, shops from which most unique products are purchased, shops offfering least average product price and shops at which maximum amount of money is spent by a customers.  
supermarket_data_model$min_dist <- str_sub(colnames(supermarket_data_model[,1:5]),-1,-1)[apply(supermarket_data_model[,1:5],1,which.min)]
supermarket_data_model$most_prod_purch_from <- str_sub(names(supermarket_data_model[,6:10]),-1,-1)[max.col(supermarket_data_model[,6:10], "last")]
supermarket_data_model$most_uni_prod_purch_from <- str_sub(names(supermarket_data_model[,11:15]),-1,-1)[max.col(supermarket_data_model[,11:15], "last")]
supermarket_data_model$least_avg_prod_pri <- str_sub(colnames(supermarket_data_model[,16:20]),-1,-1)[apply(supermarket_data_model[,16:20],1,which.min)]
supermarket_data_model$max_amt_purch <- str_sub(names(supermarket_data_model[,21:25]),-1,-1)[max.col(supermarket_data_model[,21:25], "last")]

## Create a data frame having only the newly generated columns
test <- supermarket_data_model[,26:30]

## Gnereate new columns - most preferred shop and categorise the customers based on the factors
for (row in 1:nrow(test)){
  ## Create a vector for each row
  vec <- c(test[row, "most_prod_purch_from"], test[row, "most_uni_prod_purch_from"], test[row, "least_avg_prod_pri"], test[row, "max_amt_purch"])
  
  ## Sort and find the most preferred shop
  supermarket_data_model[row, "most_pref_shop"] <- names(sort(summary(as.factor(unlist(vec))), decreasing=T)[1:1])
  
  ## Assign lables to customers (0 or 'dist' - 'distance', 1 or 'price' - 'price', 2 or 'satisf' - 'satisfaction', 3 or 'dist_price' - 'distance and price', 4 or 'dist_satisf' - 'distance and satisfaction', 5 or 'price_satisf' - 'price and satisfaction' and 6 or 'dist_price_satisf' - 'distance, price and satisfaction')
  if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
    supermarket_data_model[row, "factor"] <- 6
  }
  else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"]){
    supermarket_data_model[row, "factor"] <- 3
  }
  else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
    supermarket_data_model[row, "factor"] <- 4
  }
  else if (supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"] && supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
    supermarket_data_model[row, "factor"] <- 5
  }
  else if (supermarket_data_model[row, "min_dist"] == supermarket_data_model[row, "most_pref_shop"]){
    supermarket_data_model[row, "factor"] <- 0
  }
  else if (supermarket_data_model[row, "least_avg_prod_pri"] == supermarket_data_model[row, "most_pref_shop"]){
    supermarket_data_model[row, "factor"] <- 1
  }
  else if (supermarket_data_model[row, "most_uni_prod_purch_from"] == supermarket_data_model[row, "most_pref_shop"]){
    supermarket_data_model[row, "factor"] <- 2
  }
  else{
    supermarket_data_model[row, "factor"] <- 7
  }
}

## Re-order columns
supermarket_data_clus <- supermarket_data_model[c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,32)]

```

__2) Visualize the responsible factors__

```{r echo=TRUE, message=FALSE}
## Create the legend for the plot
Legend <- c(rep("Distance", 1), rep("Price", 1), rep("Distance and Price", 1), rep("Satisfaction", 1), rep("Distance and Satisfaction", 1), rep("Price and Satisfaction", 1), rep("Distance, Price and Satisfaction", 1), rep("Others", 1))

## Generate the count of columns assigned to each factor
dist <- length(which(supermarket_data_model$factor == 0))
price <- length(which(supermarket_data_model$factor == 1))
dist_price <- length(which(supermarket_data_model$factor == 3))
satisf <- length(which(supermarket_data_model$factor == 2))
dist_satisf <- length(which(supermarket_data_model$factor == 4))
price_satisf <- length(which(supermarket_data_model$factor == 5))
dist_price_satisf <- length(which(supermarket_data_model$factor == 6))
others <- length(which(supermarket_data_model$factor == 7))

## Create a vector of values to be shown in the plot
values <- c(dist, price, dist_price, satisf, dist_satisf, price_satisf, dist_price_satisf, others)

## Create a vector of labels to be shown in the plot
labels <- c("no", "yes")

## Create a data frame to store the vectors
factor_count <- data.frame(labels, values)

## Generate a plot
fc_plot <- ggplot(factor_count, aes(labels, values))

## Add featurs to the plot
fc_plot + geom_bar(stat = "identity", aes(fill = Legend)) +
  xlab("Class") + ylab("Total") +
  ggtitle("Customer Classification") +
  theme_bw() + scale_y_continuous(labels = scales::comma)

```

__Observations:__ It was observed that out of 42499 customers who are willing to travel, 42001 customers choose their most preferred shop based on their 'Satisfaction' or 'Shopping experience', whereas others decide based on 'Price' (4), 'Price and Satisfaction' (19) and Others (475). Out of the 17866 customers who do not like to travel long distances and select the shop closest to them as the most preferred shop, 17221 customers opted based on the factor 'Distance and Satisfaction', whereas others decided based on 'Distance' (620), 'Distance and Price' (0) and 'Distance, Price and Satisfaction' (25). Thus, the research question is answered by determining the responsible factors for the majority of the customer trend towards long-distance travel. Additionally, it was also found out that 'Satisfaction' is a key role factor affecting a customer's decision-making process.

__Applications:__ These insights obtained can be further utilized by the business to devise strategies to enhance the observed most important factors generating customer satisfaction & retention and eventually a steady growth for the business.

__3. What is the maximum likelihood of a customer to select a particular shop?__ <br />

__Algorithms selected:__ Naive Bayes, Decision tree

__Reason for Algorithm Selections:__ Naive Bayes is a supervised classiﬁer which works on the assumption that all attributes are independent of each other. Because of this, all attributes can be learned separately which results in faster performance. But its accuracy rate is less than that of Decision tree [@john1995estimating].

Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. A small change in the data can cause a large change in the final estimated tree. However, they are intuitively very easy to explain. They closely mirror human decision-making compared to other regression and classification approaches [@safavian1991survey].

__Features Selected:__ distance_avg, products_purchased_avg, unique_products_purchased_avg, product_price_avg, amount_purchased_avg.

__Analysis__

__1) Data preparation for classification__

```{r echo=TRUE, message=FALSE}
library(caret)
library(dplyr)         # Used by caret
library(e1071)
library(rpart)
library(readr)
library(tidyverse)
library(rpart.plot)

file_path<- "Input Dataset/Cleaned Dataset/Supermarket_Data_Prediction.csv"
supermarket_data_predict <- read_csv(file_path)
supermarket_data_predict$most_pref_shop=factor(supermarket_data_predict$most_pref_shop)
```

__2) Stratified k-fold cross validation__

```{r echo=TRUE, message=FALSE}
#stratified k-fold(5)
set.seed(123)
folds <- cut(seq(1,nrow(supermarket_data_predict)),breaks=5,labels=FALSE)
```

__3)  Accuracy calculation__

```{r echo=TRUE, message=FALSE}
#to store accuracy,sensitivity ,specificity for naive bayes and decision tree
resnb<-matrix(ncol=3, nrow=5)
resdec<-matrix(ncol=3, nrow=5)
#Prints 5 folds 5 different train-test dataset combinations
for(i in 1:5){
  #Segement your data by fold using the which() function 
  
  testIndexes <- which(folds==i,arr.ind=TRUE)
  test <- supermarket_data_predict[testIndexes, ]
  train <- supermarket_data_predict[-testIndexes, ]
  ytrain<-train%>%pull(most_pref_shop)
  ytest<-test%>%pull(most_pref_shop)
  #function to calculate accuracy of different models
  printALL=function(model,name,result){
    print(name)
    testPred=predict(model, newdata=test, type="class")
    conftest<-confusionMatrix(ytest,testPred,"YES")
    print(conftest$overall["Accuracy"])
    #print(conftest)
    result[i,1]= conftest$overall["Accuracy"]
    print("Sensitivity")
    print(max(conftest$byClass[,"Sensitivity"]))
    result[i,2]=max(conftest$byClass[,"Sensitivity"])
    print("Specificity")
    print(max(conftest$byClass[,"Specificity"]))
    result[i,3]=max(conftest$byClass[,"Specificity"])
    return(result)
  }
  
  NBclassfier=naiveBayes(most_pref_shop ~., data=train,laplace=3)
  modelr<-rpart(most_pref_shop ~., data=train, method="class",control=rpart.control(cp=0.0001))
  resnb=printALL(NBclassfier,"naive bayes",resnb)
  resdec=printALL(modelr,"decision trees",resdec)
}
```

__4) Plotting model's performance__

```{r echo=TRUE, message=FALSE}
#cross-validation accuracy plot
par(mfrow=c(2,3))
y<-list(resnb[1,1],resnb[2,1],resnb[3,1],resnb[4,1],resnb[5,1])
ynew<-list(resdec[1,1],resdec[2,1],resdec[3,1],resdec[4,1],resdec[5,1])
x<-list(1,2,3,4,5)
xaxis<-unlist(x)
accuracyplot<-plot(xaxis,unlist(y),type="l" ,lwd=2,col="red",xlab="Folds",ylab="Accuracy",ylim=range( c(y, ynew) ),main="Cross Validation - Accuracy")
lines(xaxis,unlist(ynew),type="l",lwd=2,col="green")
legend("topleft", 
       legend = c("Naive Bayes", "Decision Tree"),lwd=2,col=c("red","green"))

#cross-validation sensitivity plot
y<-list(resnb[1,2],resnb[2,2],resnb[3,2],resnb[4,2],resnb[5,2])
ynew<-list(resdec[1,2],resdec[2,2],resdec[3,2],resdec[4,2],resdec[5,2])
x<-list(1,2,3,4,5)
xaxis<-unlist(x)
sensitivityplot<-plot(xaxis,unlist(y),type="l" ,lwd=2,col="red",xlab="Folds",ylab="Sensitivity",ylim=range( c(y, ynew) ),main="Cross Validation - Sensitivity")
lines(xaxis,unlist(ynew),type="l",lwd=2,col="green")
legend("topleft", 
       legend = c("Naive Bayes", "Decision Tree"),lwd=2,col=c("red","green"))

#cross-validation specificity plot
y<-list(resnb[1,3],resnb[2,3],resnb[3,3],resnb[4,3],resnb[5,3])
ynew<-list(resdec[1,3],resdec[2,3],resdec[3,3],resdec[4,3],resdec[5,3])
x<-list(1,2,3,4,5)
xaxis<-unlist(x)
specificityplot<-plot(xaxis,unlist(y),type="l" ,lwd=2,col="red",xlab="Folds",ylab="Specificity",ylim=range( c(y, ynew) ),main="Cross Validation - Specificity")
lines(xaxis,unlist(ynew),type="l",lwd=2,col="green")
legend("topleft", 
       legend = c("Naive Bayes", "Decision Tree"),lwd=2,col=c("red","green"))

```

Classification

__5) Naive Bayes __

```{r echo=TRUE, message=FALSE}
#hypertuning isn't recommended for naive bayes because the overall performance is affected by just one parameter i.e laplace
NBclassfier=naiveBayes(most_pref_shop ~., data=train,laplace=3)
print(NBclassfier)
```

__6) Decision tree (rpart) __

```{r echo=TRUE, message=FALSE}
#hypertuning rpart
obj3 <- tune.rpart(most_pref_shop~., data =train, minsplit = c(5,10,15))
summary(obj3)
#decision tree classifier
modelr<-rpart(most_pref_shop ~., data=train, method="class",control=rpart.control(cp=0.001))
print(modelr$cptable)
plotcp(modelr)
prp(modelr)

#the relative error is reduced but the graph is overplotted
modelr<-rpart(most_pref_shop ~., data=train, method="class",control=rpart.control(cp=0.0001))
print(modelr$cptable)
```

__Observations:__ The graph for the overall performance of the model with respect to accuracy, specificity and sensitivity of each fold suggest that Decision tree have better accuracy than Naive Bayes on the input dataset used. Thus, the research question is answered by visualizing the predictive model performance with algorithms Naive Bayes and Decision tree, suggesting that the predictive model based on Decision tree will output better results in predicting the maximum likelihood of a new customer to select a particular shop.

__Applications:__ These insights obtained can be further utilized by the business to understand which shops in the retail chain are most likely to be preferred by new customers. Predicting the maximum likelihood of new customers towards shop selection will further facilitate towards better stock management to meet the increasing customer demands. Relevantly, different strategies to increase profit and attract new customers in different shops can be built upon accordingly. 

__4. What are the different customer segments based on their purchase behaviour?__<br />

__Algorithms selected:__ K-means, Principal Component Analysis (PCA)

__Reason for Algorithm Selections:__ K-means clustering is a very simple and fast algorithm [@hartigan1979algorithm]. It is the popular method used for customer segmentation and especially for numerical data. K-means also has computational advantages in terms of it scaling well with large datasets. Hierarchical and model-based clustering methods require to calculate a full distance matrix exhibiting limited scalability and large memory requirements for computation on a large dataset. Comparably, K-means clustering is more run-time efficient. Considering these facts and given that the input dataset is large and mainly contains numerical data, k-means was an ideal choice for customer segmentation. 

PCA is a dimensionality reduction algorithm which visualizes the essence of the dataset through data decomposition and transformation into principal components (PC) maximizing the linear variance of the data (More variance indicates more understanding of the data) [@ding2004k]. As compared to K-means PCA is not a direct solution thus finding it from a different perspective and can help to detect customer clusters not found by K-means. PCA is used here for this research question as a valuable cross-check to K-means number of clusters determination.

__Features Selected:__ The RQ wants to identify the clusters of customers based on their purchase behaviour i.e. the shops which the customer prefers for the purchase. Hence the features, customer_id, amount_purchased_shop_1, amount_purchased_shop_2, amount_purchased_shop_3, amount_purchased_shop_4, amount_purchased_shop_5 are of importance and are selected for this RQ.

__Analysis__

__1) Estimating the optimal number of clusters__

K-means requires to specify the number of clusters prior to the algorithm start. Determining the optimal number of clusters is crucial to output better results.

```{r echo=TRUE, message=FALSE}
library(cluster)
library(factoextra)
library("metricsgraphics")

# Read file contents
supermarket_data_clean <- read.csv("Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv")

# Prepare data frames for clustering
# Select only the rows customer_id, amount_purchased_shop_1, 2, 3, 4, 5
cluster.slice.temp <- supermarket_data_clean[,c(1,29,30,31,32,33)]
# Remover customer_id from the clustering data frame 
cluster.slice.data <- supermarket_data_clean[,c(29,30,31,32,33)]

# Scale the data and Determine the ideal number of clusters
cluster.slice.scale <- scale(cluster.slice.data)

wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}

wssplot(cluster.slice.scale)

```

The above plot depicts a sharp decrease for number of clusters from values 1 to 4 with a slight decrease from 4 to 5 which estimates a 4-cluster or 5-cluster solution.

__2) Perform K-means clustering with the number of clusters as 4 and 5__

```{r echo=TRUE, message=FALSE}

# Perform k-means on cluster values as 4 and 5

# On entire dataset
set.seed(123) # fix the random starting clusters
kclust4 <- kmeans(cluster.slice.data, 4, nstart = 25)

set.seed(123) # fix the random starting clusters
kclust5 <- kmeans(cluster.slice.data, 5, nstart = 25)

```

__3) Perform PCA to visualize the clusters__

```{r echo=TRUE, message=FALSE}
pca <- prcomp(t(cluster.slice.data), scale. = T, center = T)
fviz_eig(pca) + 
  theme_bw() + scale_y_continuous(labels = scales::comma) +
  ggtitle(label='Principal Component Analysis')
```

As observed, PCA 1 and PCA 2 combined explain the majority of the data variance and then there is a drop from PCA 2 to PCA 3. Hence this infers that visualization with PCA 1 and PC 2 will give a good understanding of the data and including more PCA's after PCA 2 will only result in minimal improvement.

PCA with 4 cluster K-means

```{r echo=TRUE, message=FALSE}
cluster.pc4 <- prcomp(cluster.slice.data, center = FALSE, scale. = FALSE)$x %>% as.data.frame()
cluster.pc4$kmeans.cluster <- factor(kclust4$cluster)

p<-ggplot(cluster.pc4,aes(x=PC1,y=PC2,color=kmeans.cluster))
p+geom_point() +
  theme_bw() + scale_y_continuous(labels = scales::comma) + 
  ggtitle(label='PCA with 4 cluster K-means')
```

PCA with 5 cluster K-means

```{r echo=TRUE, message=FALSE}
cluster.pc5 <- prcomp(cluster.slice.data, center = FALSE, scale. = FALSE)$x %>% as.data.frame()
cluster.pc5$kmeans.cluster <- factor(kclust5$cluster)

p<-ggplot(cluster.pc5,aes(x=PC1,y=PC2,color=kmeans.cluster))
p+geom_point() +
  theme_bw() + scale_y_continuous(labels = scales::comma) + 
  ggtitle(label='PCA with 5 cluster K-means')
```

Comparing the above two plots determine that a 5 cluster solution will be an ideal estimate for K-means clustering.

__4) Visualize the different separable clusters in the data__

```{r echo=TRUE, message=FALSE}

fviz_cluster(kclust5, data = cluster.slice.data, geom = "point",
             stand = FALSE, ellipse.type = "norm") + 
  theme_bw() + scale_y_continuous(labels = scales::comma) +
  ggtitle(label='Customer Clusters')

```

__5) Cluster Analysis__

Determine the different customers belonging to each cluster

```{r echo=TRUE, message=FALSE}

## retrieve customer ID's in each cluster
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 1,])))

```

```{r echo=TRUE, message=FALSE, results="hide"}

## retrieve customer ID's in each cluster
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 2,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 3,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 4,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 5,])))
```

__6) Customer Segmentation__

```{r echo=TRUE, message=FALSE}

#Customer segmentation through aggeration of results by mean
cluster.slice.kmeans.aggregate <- aggregate(cluster.slice.data, by = list(kclust5$cluster), mean)

cluster<-c(cluster.slice.kmeans.aggregate$Group.1)
shop1<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_1)
shop2<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_2)
shop3<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_3)
shop4<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_4)
shop5<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_5)

# Plot a Bar graph
Legends <-c(rep("Customers Shop 1", 5), rep("Customers Shop 2", 5), rep("Customers Shop 3", 5), rep("Customers Shop 4", 5), rep("Customers Shop 5", 5))
values <-c(shop1,shop2,shop3,shop4,shop5)
mydata <-data.frame(cluster, values)

p <-ggplot(mydata, aes(cluster, values))
p +geom_bar(stat = "identity", aes(fill = Legends)) +
  xlab("Cluster") + ylab("Total") +
  ggtitle("Customer Segmentation") +
  theme_bw() + scale_y_continuous(labels = scales::comma)

```

__Observations:__ Clustering the data based on the purchase behaviour of customers i.e. from the shops they shop the most, revealed 5 separable clusters to analyze. Cluster analysis helped to identify the customers in each cluster based on their customer IDs. This is useful to understand the different customers that build the customer base in each cluster. Further, Customer Segmentation facilitated to identify the customers of different shops in each segment (cluster). This further helped to divide the cluster and attach meaning to it. Thus, the research question is answered by identifying five customer segments based on their purchase behaviour and further partitioning these segments by determining the specific customers belonging to the five different shops.     

__Applications:__ Detection of clusters can help the business to develop a specific strategy for each cluster base. Clustering can also be used to understand the purchase behaviour of customers by keeping a track of customers over months and detecting the number of customers moving from one cluster to other. This helps the business to better organize strategies to increase revenue at different shops. Customer Segmentation insights obtained can be further utilized by the business to better focus their marketing efforts on the right customers. Eg. Discounts and offers related to a particular shop can be sent to only the customers who usually purchase at the particular shop without bothering the customers of other shops. Thus, targeting the right customers for the right deals can help to cut-down the marketing costs, generate more revenue and increase customer satisfaction.  

__5. Which are the Top 100 customers that are most profitable in terms of revenue generation for each shop?__<br />

__Algorithms selected:__ Custom Ranking Algorithm

__Features Selected:__ amount_purchased_shop_1, amount_purchased_shop_2, amount_purchased_shop_3, amount_purchased_shop_4, amount_purchased_shop_5.

__Analysis__

__1) Generate a loyality score of the customers for each shop__

It is assumed that more amount the customers spend at a shop, more loyal they are to that shop. Based on this assumption, loyalty score is defined as the contribution of each customer towards the revenue of each shop i.e ratio of the amount spent by the customer at a shop to the total revenue generated by a shop.

```{r echo=TRUE, message=FALSE}
library(tidyverse)
library(ggplot2)
library("RColorBrewer")
library(plotly)

file_path <- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)

## Calculate the revenue generated for each shop
revenue_shop_1 <- sum(supermarket_data_clean$amount_purchased_shop_1)
revenue_shop_2 <- sum(supermarket_data_clean$amount_purchased_shop_2)
revenue_shop_3 <- sum(supermarket_data_clean$amount_purchased_shop_3)
revenue_shop_4 <- sum(supermarket_data_clean$amount_purchased_shop_4)
revenue_shop_5 <- sum(supermarket_data_clean$amount_purchased_shop_5)

## Calculate the Loyality Score of the customers for each shop based on their contribution to the revenue of the shop
supermarket_data_clean$loyality_score_shop_1 <- with(supermarket_data_clean, amount_purchased_shop_1/revenue_shop_1 * 100)
supermarket_data_clean$loyality_score_shop_2 <- with(supermarket_data_clean, amount_purchased_shop_2/revenue_shop_2 * 100)
supermarket_data_clean$loyality_score_shop_3 <- with(supermarket_data_clean, amount_purchased_shop_3/revenue_shop_3 * 100)
supermarket_data_clean$loyality_score_shop_4 <- with(supermarket_data_clean, amount_purchased_shop_4/revenue_shop_4 * 100)
supermarket_data_clean$loyality_score_shop_5 <- with(supermarket_data_clean, amount_purchased_shop_5/revenue_shop_5 * 100)

```

__2) Retrieve list of Top 100 customers for each shop__ 

```{r echo=TRUE, message=FALSE}

## Sort the column - Loyality Score for each shop in descending order and generate a list of top 100 customers for each shop
shop_1 <- head(order(supermarket_data_clean$loyality_score_shop_1, decreasing = TRUE), 100)
shop_2 <- head(order(supermarket_data_clean$loyality_score_shop_2, decreasing = TRUE), 100)
shop_3 <- head(order(supermarket_data_clean$loyality_score_shop_3, decreasing = TRUE), 100)
shop_4 <- head(order(supermarket_data_clean$loyality_score_shop_4, decreasing = TRUE), 100)
shop_5 <- head(order(supermarket_data_clean$loyality_score_shop_5, decreasing = TRUE), 100)

shop_1

```

__3) Visualize Top Ranked 5 customers for each shop__

```{r echo=TRUE, message=FALSE}
## Create the legend for the plot
Legend <- c(rep("Shop 1", 5), rep("Shop 2", 5), rep("Shop 3", 5), rep("Shop 4", 5), rep("Shop 5", 5))

## Generate the count of columns assigned to each factor
amt_shop_1 <- supermarket_data_clean[head(shop_1, 5), "amount_purchased_shop_1"]
amt_shop_2 <- supermarket_data_clean[head(shop_2, 5), "amount_purchased_shop_2"]
amt_shop_3 <- supermarket_data_clean[head(shop_3, 5), "amount_purchased_shop_3"]
amt_shop_4 <- supermarket_data_clean[head(shop_4, 5), "amount_purchased_shop_4"]
amt_shop_5 <- supermarket_data_clean[head(shop_5, 5), "amount_purchased_shop_5"]

## Create a vector of values to be shown in the plot
values <- c(unlist(amt_shop_1), unlist(amt_shop_2), unlist(amt_shop_3), unlist(amt_shop_4), unlist(amt_shop_5))

## Create a vector of labels to be shown in the plot
labels <- c(head(shop_1, 5), head(shop_2, 5), head(shop_3, 5), head(shop_4, 5), head(shop_5, 5))
labels_factor <- factor(labels, levels = labels)

## Create a data frame to store the vectors
top_cust <- data.frame(labels_factor, values)

## Generate a plot
tc_plot <- ggplot(top_cust, aes(labels_factor, values))

## Add featurs to the plot
tc_plot + geom_bar(stat = "identity", aes(fill = Legend), width = 0.6) +
  xlab("Customer ID") + ylab("Amount spent") +
  ggtitle("Top 5 Customers for each Shop") +
  theme_bw() + scale_y_continuous(labels = scales::comma) + 
  theme(axis.text.x=element_text(angle=90, hjust=1))

```

__Observations:__ Top 100 customers that spend the most amount of money in each of the 5 shops based on their loyalty score were ranked and determined. Thus, the research question is answered by retrieving and ranking the Top 100 profitable customers for each shop based on their loyalty score.

__Applications:__ These insights obtained can be further utilized by the business to identify the customers who contribute majorly to the revenue of the respective shops. This can further facilitate to formulate reward schemes for them and in turn retain the high-value customer base. Eventually, targeting appropriate loyalty programs can transform satisfied loyal customers as advocates for the business.

#Important Links

<b>GitHub Repository</b> 

https://github.com/Rspawar/Data-Science-with-R.git 

<b>Project Website</b> 

https://sites.google.com/view/customerbehaviouralanalytics

<b>Project Screencast</b> 

https://youtu.be/vICvLamX-hA

<b>Project Presentation</b>

https://github.com/Rspawar/Data-Science-with-R/blob/master/Project%20Presentation/Customer_Behavioural_Analytics_in_the_Retail_Sector.pdf

<b>Complete Analysis Report as RMarkdown file</b>

https://github.com/Rspawar/Data-Science-with-R/blob/master/Customer%20Behavioural%20Analytics%20in%20the%20Retail%20sector.Rmd

<b>Complete Analysis Report as HTML file</b>

https://github.com/Rspawar/Data-Science-with-R/blob/master/Customer_Behavioural_Analytics_in_the_Retail_sector.html

<h4><b>Date</b></h4> ```r format(Sys.Date(), "%B %e, %Y")``` 

#References