Predict future consumption of customers using online trading customer log data
- Binary Classification whether the customer purchase amount for December 2011 exceeds 300:
Yes(1)
,No(0)
Model Structure
Using 3 Machine Learning Models & 1 Deep Learning Model for binary classification
- XGBoost
- LightGBM
- CatBoost
- TabNet
Structured data is the basis for data analysis. Structured data analysis enables you to conduct EDA, preprocessing exercises, and learn the overall capabilities of data analysis through machine learning modeling.
Recently, the number of customers using online transactions is increasing, so the log data of customers is increasing. We are planning to conduct a project to predict and analyze customers' future consumption using previous online customer log data.
When we checked the total monthly purchase price of our customers, it was confirmed that consumption was taking place at the end of the year. Therefore, we are planning to make a model for successful marketing through promotion to customers in December. Online transaction log data are given from December 2009 to November 2011. By November 2011, we need to predict whether the customer purchase amount for December 2011 exceeds 300 using the data.
One data is generated when customers purchase, and the data is given as a learning dataset from December 2009 to November 2011 with the following features:
- Information about the customer (customer_id / country)
- Product Information (product_id / description / price)
- Transaction Information (order_id / order_date / quantity)
-
Aggregation with Time Series: Create features by dividing periods so that models can learn trends & cycles not just considering information about all months at once.
- feature: price, quantity, total, unique order_id
- aggregation: 'max', 'min', 'sum', 'mean', 'count', 'std', 'skew', mean_ewm, last_ewm, avg_diff
-
Customer purchasing capabilities
- Number of months spending $300 or more
- Number of months the customer has used the online shopping mall
- Percentage of monthly expenses of $300 or more during the shopping mall period
- Average of total over time the customer has used the online shopping mall
-
Feature Selection with Permutation Importance
preprocess.py
: feature engineering with train datasetCatboost_train.py
: using Catboost model for trainingLightGBM_train.py
: using LightGBM model for trainingXGBoost_train.py
: using XGBoost model for trainingTabNet.ipynb
: using TabNet model for trainingEnsemble.ipynb
: make ensemble reusultPyCaret.ipynd
: using pycaret for selecting model which fits best for our taskutils.py
: config.json parser, logging scoresparamter_tuning_optuna.ipynb
: hyper parameter tuning by bayesian optimization using optuna
Final Leader Board
- 10th / 100
- AUC: 0.8627