LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. Lending Club operates an online lending platform that enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. Lending Club is the world's largest peer-to-peer lending platform. (from Wikipedia).
The goal of this project is to analyze and model Lending Club's issued loans. A summary of the whole projects can be found in the corresponding Jupyter notebook: 0. Summary.ipynb.
The loan data is available through multiple sources, including Kaggle Lending Club Loan Data, All Lending Club Load Data, or Lending Club Statistics. In this project, I use the data from Kaggle Lending Club Loan Data, which contains the issued load data from 2007 to 2015. In addition, I also use the issued loan data from 2016 from Lending Club Statistics.
The data collection and concatenation process can be found in the corresponding notebook: 1. Data Collection and Concatenation.ipynb.
- Notebook: 2. Data Cleaning.ipynb
- Notebook: 3. Feature Engineering.ipynb
- Categorical and discrete features: 4. Data Visualization - Discrete Variable.ipynb
- Numerical features: 4. Data Visualization - Numerical Variable.ipynb
- Summary of influential features: 4. Data Visualization Summary.ipynb
Since the above notebooks have relatively large file sizes, to view them, there are two suggest ways.
- Download the corresponding html files from folder
./htmls/
- View the notebook in nbviewer: nbviewer.jupyter.org/
The corresponding nbviewer pages are as follows:
- Categorical and discrete features: 4. Data Visualization - Discrete Variable.ipynb
- Numerical features: 4. Data Visualization - Numerical Variable.ipynb
- Summary of influential features: 4. Data Visualization Summary.ipynb
For binary classification problems, there are some commonly used algorithms, from the widely used Logistic Regression, to tree-based ensemble models, such as Random Forest and [Boosting](https://en.wikipedia.org/wiki/Boosting_(machine_learning) algorithms.
For imbalanced classification problems, despite the naive method, there are several re-sampling based methods, including:
- Without Sampling
- Under-Sampling
- Over-Sampling
- Synthetic Minority Oversampling Technique (SMOTE)
- Adaptive Synthetic (ADASYN) sampling
Here, the performance of several commonly used algorithms under the conditions of without sampling and over-sampling are compared. The metric used here is AUC, or Area Under the ROC Curve.
While the famous scikit-learn has been widely used for a lot of problems, it requires manually transformation of categorical variable into numerical format, which is not always a good choice. There are several new packages that naively support categorical features, including H2O, LightGBM, and CatBoost.
In this projects, several widely used algorithms are explored, including:
- Logistic Regression
- Random Forest
- Boosting
- Stacked Models
Model | Logistic Regression | Random Forest | Random Forest | Boosting | Boosting |
---|---|---|---|---|---|
Package | H2O | H2O | LightGBM | LightGBM | CatBoost |
Without oversampling AUC | 0.6982 | 0.7007 | 0.6882 | 0.7204 | 0.7222 |
With oversampling AUC | 0.6982 | 0.7008 | 0.6893 | 0.7195 | 0.6814 |
As a comparison, I also use DataRobot, An Automated Machine Learning for Predictive Modeling platform, to run the classification. Below is the performance
Model | GBM | GBM | GBM | GBM |
---|---|---|---|---|
Package | H2O | LightGBM | LightGBM | XGBoost |
Test AUC | 0.7155 | 0.7133 | 0.7147 | 0.7113 |
Detailed analysis can be found in my blog. Feel free to read through it.
Copyright @ Jifu Zhao 2018