Fraud transaction detection using Machine Learning algorithms on highly imbalanced dataset using ANN, Random Forest Classifier and XGBoost Classifier
- The dataset is highly imbalanced, with only 0.129% of observations being fraudulent.
- There is no missing data in the dataset
- The dataset consists of 11 features which needed to be transformed
- oldbalanceOrg and newbalanceOrg are perfectly correlated because these two columns represent the original and new balances in the sender's account after the transaction.
- oldbalanceDest and newbalanceDest are also perfectly correlated because these two columns represent the original and new balances in the recipient's account
- nameOrig and nameDest are mass categorical variable
- Removing newbalanceOrig and newbalanceDest to avoid multicollinearity
- Removing nameOrig and nameDest because of irrelavnce
After applying Undersampling and then Oversampling the following are the weights of the new dataset :
- Fraudulant transaction weight: 0.3335339444434781
- Non-Fraudulant transaction weight: 0.6664660555565219
ANN_model (Artificial Neural Network):
- F1-score on the training set: 0.9500
- F1-score on the test set: 0.9493
Random Forest:
- F1-score on the training set: 1.0 (perfect score)
- F1-score on the test set: 0.9992
XGBoost:
- F1-score on the training set: 0.9967
- F1-score on the test set: 0.9963
Random Forest Model works best