Rossmann Sales Predictions

Predicting the future sales of Rossmann Drugstore

obs: Still the dataset and the company are real, the business problem is ficticious. The dataset was collected from [Kaggle]

Business Problem

Rossmann is a drugstore chain that operates over 3000 stores in 7 European country. Rossmann's CFO required from the managers a forecast of the sales of each store, because he needs to assure that will have enough budget to make the renovation of them.

The Data Science project is focused on solving the following:

Delevop a sales prediction for next 6 weeks for each store to ensure taht budget
Deploy the model of predictions on a Telegram Bot

Business Assumption

Data overview

Click to see the description of the columns

Feature	Definition
Id	an Id that represents a (Store, Date) duple within the dataset.
Store	a unique Id for each store.
Sales	the turnover for any given day.
DayOfWeek	day of week on which the sale was made (e.g. DayOfWeek=1 -> monday, DayOfWeek=2 -> tuesday, etc).
Date	date on which the sale was made.
Customers	the number of customers on a given day.
Open	an indicator for whether the store was open: 0 = closed, 1 = open.
StateHoliday	indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.
SchoolHoliday	indicates if the (Store, Date) was affected by the closure of public schools.
StoreType	differentiates between 4 different store models: a, b, c, d.
Assortment	describes an assortment level: a = basic, b = extra, c = extended.
CompetitionDistance	distance in meters to the nearest competitor store.
CompetitionOpenSince(Month/Year)	gives the approximate year and month of the time the nearest competitor was opened.
Promo	indicates whether a store is running a promo on that day.
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.
Promo2Since(Year/Week)	describes the year and calendar week when the store started participating in Promo2.
PromoInterval	describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

Assumptions

For missing values in the column "competition_distance" we assume that there is no competitor nearby, and then it was replaced for a greater value than the maximum value of "competition_distance"

Column "customer", which represents the number of customers, was dropped when training the model since there is no information about this number in the future

Click to see the new features created

New Feature	Definition
year/month/day/week_of_year/year_week	year/month/day/week_of_year/year_week extraced from the column 'date'
day_sin/day_cos/month_sin/month_cos/week_of_year_sin/week_of_year_cos/day_of_week_sin/day_of_week_cos	features derived in sin/cos to capture their ciclycal atribute
competition_since	date since the competition was opened
competiton_time_month	period in months since the competition started
promo_since	concatenation of 'promo2_since_year' and 'promo2_since_week'
promo_time_week	time in weeks from when the promotion was active.
state_holiday(christmas/easter_holiday/public_holiday/regular_day)	indicates wheter the sale was made in christmas, easter, public holiday or regular day.
is_promo2	whether the purchase occurred during an active promo2 (1) or not (0)

Solution Strategy

Step 01. Data Description. The goal is to use statistcs metrics to identify data outside the scope of the business

Step 02. Feature Engineering. Derive new attributes based on the original variables to better describe the phenomenon that will be modeled

Step 03. Data Filtering. Filter rows and select columns that do not contain information for modeling or that do not match the scope of the business

Step 04. Exploratory Data Analysis. Explore the data to find insights and better understand the impact of variables on model learning

Step 05. Data Preparation. Prepare the data so that the Machine Learning models can learn specific behavior.

Step 06. Feature Selection. Selection of the most significant attributes for training the model

Step 07. Machine Learning Modeling. Machine Learning model training

Step 08. Hyperparameter Fine Tuning. Choose the best values for each of the parameters of the model selected from the previous step

Step 09. Convert Model Performance to Business Values. Convert the performance of the Machine Learning model into a business result

Step 10. Deploy Model to Production. Publish the model in a cloud environment so that other people or services can use the results to improve the business decision

Top 3 business insights

1. Closer competition does not implicate reduced sales

2. Older competition has a negative correlation with a lowering in sales

3. Extended promotions do not implicate greater sales

Machine Learning Models Applied

At this project, six models was trained:

Average Model (used as baseline model)
Linear Regression
Lasso Regression (Regularized Linear Regression)
Random Forest Regressor
XGBoost Regressor

Machine Learning Models Performance

After time series cross-validation (except for the average model) it obtained the following results:

Model	MAE	MAPE	RMSE
Linear Regression	2094.24+/-301.77	0.3+/-0.02	2094.24+/-301.77
Lasso Regression	2388.68+/-398.48	0.34+/-0.01	2388.68+/-398.48
Random Forest Regressor	842.21+/-236.42	0.12+/-0.03	842.21+/-236.42
XGBoost Regressor	1675.93+/-176.76	0.23+/-0.01	1675.93+/-176.76

Though Random Forest Regressor obtained the lower RMSE, XGBoost was the chosen model, because of its time of execution, and an RMSE low enough for the purpose of this project. Its parameter was tuned and the final performance on the test data obtained was:

Model	MAE	MAPE	RMSE
XGBoost Regressor	1084.09	0.16	1575.98

Metrics definition and interpretation

MAE: Mean Absolute Error
MAPE: Mean Absolute Percentage Error
RMSE: Root Mean Squared Error

RMSE is a metric key to check statistical performance and is used as a standard for measuring performance models.

Both MAE and MAPE, indicate the average error in absolute and relative terms, respectively, and are useful for explaining to the business teams the model's errors

Business Results

Recapping the business problem, the CFO needs to have 6 six weeks of prediction sales, in order to have the necessary budget to reform the stores. The answer to this question is presented below in an aggregate value (total sum of sales), as well as the best and worst scenarios, calculated from the MAE.

Scenario	Values
Predictions	$ 276,156,800.00
Worst Scenario	$ 275,390,761.93
Best Scenario	$ 276,922,824.34

Considering the 1115 stores, it gives a mean of $ 247674/store in the next 6 weeks.

Model Deployment

The prediction for each store for the next 6 weeks can be accessed in a Telegram Bot. The ideia behind this is that the results can be acessed at any time with any device with a proper internet connection.

To use this bot, you have to inform the id store, after passing a / (e.g. /12, /34, etc).

Link to chat with Telegram Bot

Next Steps

Improve the EDA to find out more insights
Test other machine learning algorithms
Forecast of the number of customers, since this feature has a high correlation with sales

Conclusion

This project could address the requisition of a CFO and make predictions of 6 six weeks of sales for each store. Then it will be possible to estimate the budget under disposal to make the renovation of stores and a Telegram Bot to access this information with a device connected to the internet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Rossmann Sales Predictions

Predicting the future sales of Rossmann Drugstore

Business Problem

Business Assumption

Data overview

Assumptions

Solution Strategy

Top 3 business insights

Machine Learning Models Applied

Machine Learning Models Performance

Metrics definition and interpretation

Business Results

Model Deployment

Next Steps

Conclusion

Files

README.md

Latest commit

History

README.md

File metadata and controls

Rossmann Sales Predictions

Predicting the future sales of Rossmann Drugstore

Business Problem

Business Assumption

Data overview

Assumptions

Solution Strategy

Top 3 business insights

Machine Learning Models Applied

Machine Learning Models Performance

Metrics definition and interpretation

Business Results

Model Deployment

Next Steps

Conclusion