Elevating Retail Analytics: Predicting Ecuador's Grocery Sales - The Favorita Forecasting Challenge!
Welcome to the exciting world of time series forecasting, where we embark on a journey to predict store sales for Favorita, one of Ecuador's largest and most prominent grocery retailers.
In this data-driven project, we delve deep into the intricate art of predictive analytics, armed with historical sales data, cutting-edge machine learning techniques, and the drive to optimize the future of retail.
This project follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework to explore and analyze sales for store sales for Favorita. The aim is to leverage data-driven insights to identify models that can accurately predict the value of the dependent variable based on the values of the independent variables.
- Project Overview
- Project Structure
- Data Dictionary
- Project Highlights
- Summary
- Hypothesis Investigated
- Exploratory Data Analysis (EDA)
- Model Selection
- Recommendations
- Getting Started
- License
- Author
code/
: Contains the dataset used for analysis and the Jupyter notebook detailing the data exploration, preprocessing, and model building steps.article/
: Holds project-related article.LICENSE
: Project license.README.md
: Project overview, links, highlights, and information.
Dataset | Description |
---|---|
train.csv | Training data containing time series of features store_nbr, family, and onpromotion, as well as the target sales. |
- store_nbr : Identifies the store where the products are sold. |
|
- family : Identifies the type of product sold. |
|
- sales : Total sales for a product family at a specific store on a given date (can be fractional). |
|
- onpromotion : Total number of items in a product family that were being promoted at a store on a given date. |
|
test.csv | Test data with the same features as the training data. Predict target sales for these dates. |
transaction.csv | Contains date, store_nbr, and transactions made on specific dates. |
sample_submission.csv | Sample submission file in the correct format. |
stores.csv | Store metadata, including city, state, type, and cluster. |
- cluster : Grouping of similar stores. |
|
oil.csv | Daily oil price data, including values during both the train and test data timeframes. |
holidays_events.csv | Holidays and events data, with metadata. |
- Employed a holistic approach, embracing the CRISP-DM framework, to gain a deep understanding of retail dynamics.
- Mined invaluable insights from extensive exploratory data analysis, unveiling hidden trends and patterns within the dataset.
- Engineered advanced predictive models, featuring the formidable XGBoost algorithm, to forecast sales with unprecedented accuracy.
- Implemented rigorous hyperparameter tuning, unlocking the full potential of our models and achieving unparalleled predictive performance.
- Crafted a compelling and informative article, sharing the project's compelling journey, groundbreaking results, and its potential to reshape the future of retail forecasting.
Code | Name | Published Article | Deployed Dashboard |
---|---|---|---|
LP3 | Sales Time Series Forecasting(Prediction) | Read Article | View Dashboard |
Null Hypothesis (H0) : The number of products under promotion does not influence sales in supermarkets.
Alternate Hypothesis (H1) : The number of products under promotion significantly influence sales in supermarkets.
The rationale for testing these hypotheses is to determine whether there is empirical evidence to support the idea that promotions have a meaningful impact on sales in supermarkets.
By testing these hypotheses and examining the correlation between promotions and sales, businesses can gain valuable insights into the dynamics of supermarket sales and make evidence-based decisions regarding their promotional strategies.
Test Conducted | Pearson Correlation | P-Value |
---|---|---|
Independent Samples T - Test | 0.4180 | 0.0000 |
In conclusion, the Pearson correlation coefficient calculated between the number of products under promotion (as indicated by the "onpromotion" column) and sales in supermarkets is approximately 0.4180. The corresponding p-value obtained from the correlation analysis is very close to zero (P-value: 0.0000). Based on the results of this analysis, we reject the null hypothesis.
There is a statistically significant positive correlation (Pearson Correlation Coefficient = 0.4180) between the number of products under promotion and sales in supermarkets. This suggests that promotions have a significant influence on sales, and as the number of products under promotion increases, sales tend to increase as well.
A snapshot of the conducted exploratory data analysis, aimed at addressing pivotal business inquiries during the analysis process.
After carefully assessing the performance of our models using key evaluation metrics, it is evident that the XGBoost model stands out as the most effective choice for our dataset. The RMSLE (Root Mean Squared Logarithmic Error) serves as a crucial indicator, and the XGBoost model achieved the lowest RMSLE of 0.0054 among all models evaluated. This indicates that the XGBoost model provides the most accurate and precise predictions when compared to ARIMA, SARIMA, and ETS models.
Therefore, for this specific forecasting task, we are adopting the XGBoost model for its superior predictive accuracy.
-
Promotion Optimization: Based on the analysis of the impact of promotions on sales, consider optimizing promotion strategies. Identify which types of promotions (e.g., discounts, BOGO offers) have the most significant influence on sales and tailor promotional campaigns accordingly. By focusing promotional efforts on what truly drives sales, you can maximize the return on investment.
-
Focus on High-Performing Cities: The top-performing city, "Quito," stands out with the highest sales. It's essential to allocate additional resources and marketing efforts to maintain and potentially increase sales in Quito. Additionally, cities like "Guayaquil," "Cuenca," "Ambato," and "Santo Domingo" have also shown strong sales performance. Consider developing city-specific strategies to capitalize on these markets.
-
Cluster-Centric Approach: The analysis reveals that certain clusters, such as "Cluster 14," "Cluster 6," and "Cluster 8," exhibit remarkable sales figures. Invest in understanding the unique characteristics of these clusters and tailor product assortments, promotions, and inventory management strategies to maximize sales potential in these areas.
-
Cross-Analysis Opportunities: Explore opportunities to combine the strengths of high-performing cities, clusters, store types, and states. For example, consider aligning promotions with holidays and events in top cities and clusters to maximize sales impact. Additionally, assess whether specific store types thrive in particular cities or clusters, and use this information to refine expansion plans.
- Clone this repository:
git clone https://github.com/fantastic-rambo/LP3-Super-Store-Sales-Forecasting-ML.git
- Navigate to the project directory:
LP3-Super-Store-Time-Series-Forecasting
- Explore the Jupyter notebooks for detailed steps and code execution.
- Read the published article for a comprehensive understanding of the project.
This project is licensed under the MIT License.
Isaac Agbogah Rambo
Connect with me on LinkedIn: LinkedIn Profile
Feel free to star ⭐ this repository if you find it helpful!