Skip to content

Identify Enron employees who may have committed fraud based on public email and financial data

License

Notifications You must be signed in to change notification settings

AdmcCarthy/Identify_Fraud_from_Enron_Email

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Identify Fraud from Enron Email

Adam McCarthy

Problem posed in Udacity Intro to machine learning

Getting Started

Following the biggest corporate scandal in American history can emails and financial information be used to predict predict persons of interest in the subsequent criminal investigations.

To test results:

$ tester.py

To re-run and store the classifier, and processed data set:

$ poi_id.py

View report on GitHub pages.

Introduction

The question and data set are provided in Udacity´s introduction to machine learning. The question is to identify a label (person of interest) using a predictive model. To predict those in the Enron scandal who were under some form of investigation and deemed the title, person of interest.

Machine learning is used here to predict if a person is of interest or not based on a large number of variables. Machine learning can work through the high number of variables (high dimensionality) in ways that a human manually interpreting and assessing the data will not be able to achieve.

docsimagesScreen_Shot.png

Screen shot from udacity intro to machine learning course.

This report will work through the stages of a machine learning investigation. It will begin by giving an overview of the dataset and some insights from exploratory data analysis.

It will then move onto to feature selection, scaling and engineering.

It will discuss the approach taken to validate and tune the algorithm and which metrics are used to evaluate the quality of the model.

Following this will be a review of the different approaches taken and their results.

The objective is to find a methodology which can achieve 0.3 or greater in both precision and recall.

Overview of data

The analysis of the data includes:

Jupyter Notebook Exploring the dataset

Jupyter Notebook Exploring the email data

Report for Exploratory data analysis using R markdown

Person of interest - Label to be predicted

The predicted label is the person of interest (POI). A person of interest reflects those in the Enron case who have been indicted, settled without admitting guilt or testified in exchange for immunity.

The list of POIs has been generated by Udacity. The list was hand drafted from various sources so could contain errors.

There are 35 persons of interest in total 30 of which worked for Enron.

Jeffrey Skilling was the CEO during the fraud period.

Kenneth Lay was chairman.

Andrew Fastow was CFO.

Email dataset

The email dataset is from here

Email dataset consists of 150 directories each reflecting a person, specified as the last name followed by the first letter of the first name.

There are 86 people with email data.

Within poi_names.txt it can be seen with a yes (y), no (n) column if the poi has an email directory in the dataset.

Financial dataset

The financial information is sourced from the Enron insiderpay pdf which is from Case No. 01-16034.

There are POIs who have email information but do not have financial information.

Non-POIs all come from the financial spreadsheet.

Only POIs or non-POIs who have financial information are used as combining POIs without any financial information, i.e. they have all NaN values for financial data will cause issues with the machine learning process.

An alternative approach would be to only use email information to be able to expand the POI and non-POI dataset but that will not be taken further here.

There are anomalous values in the dataset.

One person value is TOTAL, which gives a sum of values, rather than relating to being a person. This is removed during the data processing pipeline.

Enron Final Project dataset

The dataset created by Udacity is aggregated to contain email and financial information.

It is set up as a key value pair where each key is a person with all the features stored in a dictionary as that person value.

There are 146 persons within the dataset. For each person there are 21 variables.

The dataset contains data on 18 of the POIs.

Note that when missing values occur featureFormat() and targetFeatureSplit() will replace this with 0.

Most of the values have a range of missing parameters, see table below.

Datset Variables
Variable Missing Values
bonus 64
deferral_payments 107
deferred_income 97
director_fees 129
email_address 35
exercised_stock_options 44
expenses 51
from_messages 60
from_poi_to_this_person 60
from_this_person_to_poi 60
loan_advances 142
long_term_incentive 80
other 53
poi 0
restricted_stock 36
restricted_stock_deferred 128
salary 51
shared_receipt_with_poi 60
to_messages 60
total_payments 21
total_stock_value 20

This will be challenging for the machine learning process, a feature selection process will be useful to remove any variables that are not informative, e.g. director fees has 129 missing values so is unlikely to be well suited within a predictive model.

The TOTAL key relates to an erroneous input, it is an order of magnitude larger than other values. It is the sum of all people in the dataset and is removed using:

Other large values have been checked and are associated with real people. See enron61702insiderpay.pdf for evidence.

Email Variables

The variables are:

  • Email address
  • From messages
  • From poi to this person
  • From this person to poi
  • Shared receipt with poi
  • To messages

Email address is a string of the persons email address, it is not a useful variable for making a predictive model so is not included in the machine learning.

See feature engineering for more information on email variables.

Financial variables

  • Bonus
  • Deferral payments
  • Deferred income
  • Director fees
  • Exercised stock options
  • Expenses
  • Loan advances
  • Long term incentive
  • Other
  • Restricted stock
  • Restricted stock deferred
  • Salary
  • Total payments
  • Total stock value

Bonuses are highly skewed with top bonuses being exceedingly high.

docsimagesTop_Bonuses.png

95 have salary information. The minimum is 477$. The lowest salary seems a strange number for salary.

docsimagesTop_Salaries.png

Salary can be compared to bonus as these are two variables that may be correlated.

docsimagessalary_bonus.png

The plot also splits the data into two sets to view how a linear regression model would behave. The data has a large spread with a couple of key outliers. These outliers mean that a linear model is only useful for the cluster of values associated with lower salary and smaller bonuses. The outliers drag the regression model, for example, see the blue trend line.

All outliers are interesting data points. High salary, high bonus pairs are the top paid in the company.

docsimagesbivariate_finacial.png

Using frequency polygons on each of the variables and splitting them into groups of POI and non-POI gives a quick way to see if any of the variables stand out as important.

In this case, few variables stand out. Loan advances is due to so few people having this value.

Restricted stock deferred has no members in POI which will limit the use of this variable.

docsimagesfinancial_1.png

Using multivariate analysis to try and separate POIs from non-POIs is challenging with the financial variables. An initial assumption may be that salary, bonus and total payments are important, those committing crimes may have been receiving more money.

The plot shows a few of these cases with extreme outliers away from the main cluster like Kenne Lay and Jerrefry Skilling but there are also a number of POIs within the main cluster of people.

Some of the figures here are astonishing. The high figures and skewed distribution suggests a number of these datasets are over dispersed.

There are also some suspicious low values like the minimum salary.

A different feature engineering approach could be to bin the values, for example using log spacing between bins. This will not be attempted during this first pass.

Outlier removal

TOTAL is removed as this is a sum of all people.

THE TRAVEL AGENCY IN THE PARK is removed as this is not a valid person.

These are removed from the dataset at the start of the data processing pipeline.

if ro:
    data_dict.pop("TOTAL", None)
    data_dict.pop("THE TRAVEL AGENCY IN THE PARK", None)

It can be turned off by setting ro to FALSE.

Feature selection

Four ensemble or tree classifiers are run to investigate feature importance. This is using the entire dataset and all variable apart from email address and name of the person.

The prediction is for the target, POI.

docsimagesDT_feature_importance.png

docsimagesRF_feature_importance.png

docsimagesAB_feature_importance.png

docsimagesGB_feature_importance.png

Exercised stock options is the most important feature in three of the classifiers.

In AdaBoost the deferred income followed by bonus are the most important.

Decision tree does not use many of the variables.

Director fees are consistently low (almost no) importance.

Loan advances are of low importance but has minor impact.

restricted_stock_deferred is either of no importance or of minor importance. Similarly, deferral_payments is of little importance.

This gives four variables with very little importance, Director fees, loan advances, restricted stock deferred and deferral payments.

A way to select these variables will be using a limit on importance. For example, AdaBoost feature importance <0.02 will remove the weakest four variables. Upon implementation a default ratio of 0.01 is used as the cut-off. This is based of the four graphs and experimenting with different values.

Algorithm comparison
Algorithm Accuracy Precision Recall F1 F2 Tot. pred. True pos. False pos. False neg. True neg.
Logistic Regression with feature selection 0.84 0.33 0.17 0.23 0.19 15000 364 712 1654 12288
Logistic Regression without feature selection 0.84 0.35 0.26 0.3 0.27 15000 520 950 1480 12050

The better scores without feature selection shows this is not the best approach for feature selection. See further down in the report for information about logistic regression implementation, this versio uses the total dataset.

After initial trials using this feature selection approach an iterative univariate ANOVA feature selection step is added into the final pipeline. This uses Kbest and iterates through a range of parameters during GridSearchCV to find the best number of features to use. See further in the report for more information.

Feature engineering

Within the email data, there are five variables.

docsimagesemail_poi.png

The bubble chart highlights all five variables by combining two in ratios along x and y. These ratios seem suitable candidates for feature engineering.

One takes the ratio of emails from a POI compared to the total number of emails to that person.

The second the ratio of emails to a POI compared to the total number of emails that person has sent.

The idea being that this will highlight persons of interest better than the two variables separately.

When using these ratios the input variables will be removed. So from_messages, to_messages, from_poi_to_this_person and from_this_person_to_poi are not used when using feature engineering.

To check the result of implementing feature engineering the final estimator will be used to check the output results with and without feature engineering.

The three tests are, with feature engineering and therefore

fe = True  # Feature engineering

With parameters:

"ratio_to_poi", "ratio_from_poi"

Without feature engineering and with the original variables included.

fe = False  # Feature engineering

with parameters:

"from_messages", "to_messages", "from_poi_to_this_person", "from_this_person_to_poi"

With feature engineering but include original features. Note these will likely be removed by the ANOVA feature selection layer in the pipeline. A grid search will optimize each of these estimators each will have different parameter combinations.

fe = True  # Feature engineering

with parameters:

"ratio_to_poi", "ratio_from_poi", "from_messages", "to_messages", "from_poi_to_this_person", "from_this_person_to_poi"

Each of these is using the full data set to train on.

Algorithm comparison
Algorithm Accuracy Precision Recall F1 F2 Tot. pred. True pos. False pos. False neg. True neg.
Logistic Regression with feature engineering 0.81 0.315 0.392 0.349 0.374 15000 784 1705 1216 11295
Logistic Regression without feature engineering 0.806 0.318 0.396 0.352 0.377 15000 791 1700 1209 11300
Logistic Regression with feature engineering and original variables 0.84 0.425 0.565 0.485 0.53 15000 1129 1526 871 11474

With the feature engineering uses k=8 for feature selection.

Without feature engineering uses k=8 for feature selection. There is little performance difference between these two options. This may relate to the overall importance of the email variables compared to the financial variables.

With feature engineering and all variables uses k='all' for feature selection. This led to a surprise, an increase was achieved on both precision and recall. Giving an F1 score or 0.46. This approach leads to the best estimators, further information can be found in the results section, the pipeline with all features.

NOTE with feature engineering and all variables appears to lead to different estimators during each run of the GridSearchCV.

Feature Scaling

Feature scaling is often a requirement for effective machine learning.

Exploratory data analysis has shown that even after removing the extreme outlier, TOTAL, a number of the variables have over dispersed data.

A robust scaler can be used for datasets with many outliers. This will use more robust estimates for central tendency and dispersion before scaling the dataset.

Cross-validation and optimization

To make a classifier that works well on new or unseen data cross validation aids the algorithm from overfitting on the training data.

Firstly the data is separated into a train and test set using train test split, with 30% held back for testing. This gives 29 people for testing and 65 people for training.

The training data is then used further.

By splitting up the available data (e.g. only the training data) into separate groups, these can be used to cross-validate the performance of a classifier.

In sklearn one useful approach is GridSearchCV, which combines cross-validation and parameter optimization.

Each classifier will have a range of parameters that are not learnt when the classifier is fitted to the data. Each of these is passed as arguments. These can have a large impact on the performance of the classifier and fundamentally change how it approaches making predictions using this dataset.

Parameter optimization can be undertaken manually, running different combinations of parameters to see which performs best but GridSearchCV will compare combinations of the classifier parameters and see which performs the best during cross validation.

The cross validation method can be selected, for this use case stratified K fold is used to maintain an even proportion of labels across the folds of data.

Note that when using 3 folds 65 persons become around 22 and 2 folds 32. This means this problem set is always working with a very small dataset. Having a large number of variables will not be a good idea with such a small dataset.

The validation is undertaken using poi_id.py this outputs the accuracy.

After a good estimator is found this is compared to a test set using tester.py. Which gives precision, recall and F1 which can give a better a review of the performance.

A limitation in this approach is by using the performance from the test set this can leak into into the evaluation and lead to overfitting on new data.

Evaluation metrics

This problem is a skewed binary classification, therefore accuracy is not the best metric to judge the reliability of the evaluation.

There is an asymmetry in this problem, we can optimize for placing more people as innocent or more people as guilty. Or aim for a balance between the two.

Recall: True Positive / (True Positive + False Negative). Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were 'recalled' from the dataset.

Precision: True Positive / (True Positive + False Positive). Out of all the items labelled as positive, how many truly belong to the positive class.

A high precision low recall model would give greater confidence that flagged POIs are truly POI but may miss out on POIs. This would be suitable if avoiding flagging innocent people is the most important issue.

A hhigh recall low precision model would find nearly all POIs but also flag others as involved when they are innocent. This would be useful if screening a large number of people to quickly decide who to focus on for further investigation.

A high F1 score with balanced precision and recall is the best of both settings.

The preference here is to achieve a respectable F1 score and recall but focus on precision. We can live with innocent people being flagged up as this model will give an overview of all those who may be POI. Further investigation could then check these predictions. This would work well as a screening tool to quickly evaluate a range of people.

Testing classifiers

Default setting

Using the default setting of one label and one feature we can take an initial review. of the prediction.

features_list = ['poi', 'salary']

The outputs for the initial algorithm (Gaussian Naive Bayes) is compared to three other algorithms.

AdaBoost performs considerably slower.

KMeans gives warning about predicted labels not equal to 0 or 1.

Naive Bayes gives a very high recall value (0.798).

Gradient Boosting Classifer

After completing a version of the machine learning pipeline including outlier removal, feature selection, feature engineering and feature scaling a gradient boosting classifier is used with GridSearchCv. This means that parameters can be optimized across cross-validations (in this run 2 folds using stratified k fold). The score to optimize on is F1 weighted.

This is not removing any zeros and using all features as input apart from email address and those that duplicate ratio feature engineering.

This evaluation uses a broad parameter grid.

parameters = [{
               "loss": ["deviance", "exponential"],
               "n_estimators": [120, 300, 500, 800, 1200],
               "max_depth": [3, 5, 7, 9, 12, 15, 17, 25],
               "min_samples_split": [2, 5, 10, 15, 100],
               "min_samples_leaf": [2, 5, 10],
               "subsample": [0.6, 0.7, 0.8, 0.9, 1],
               "max_features": ["sqrt", "log2", None]
               }]

This gives 18000 combinations to try in an exhaustive grid search. This is useful to get an overview of which parameter combinations perform well, however it comes at a computational cost. It takes a number of hours to fit the classifier. This resulted in:

Best classifier score: 0.894907227728 :

{'subsample': 0.8, 'n_estimators': 120, 'max_depth': 25, 'loss':'deviance', 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'}

When applying this method using the testing function the results are:

Algorithm comparison
Algorithm Accuracy Precision Recall F1 F2 Tot. pred. True pos. False pos. False neg. True neg.
Gradient Boosting 0.862 0.454 0.186 0.264 0.211 15000 373 448 1627 12552

This method has improved on the original methods but still does not achieve 0.3 for precision and recall.

The 0.45 for precision compared to the 0.19 for recall suggests that it is finding nearly half the POIs but flagging too many non-POIs as guilty.

Further feature optimization

Removing features with a high number of NaNs includes dropping, restricted_stock_deferred, loan_advances, director_fees, deferral_payments, and deferred_income. These variables have over 100 missing values (apart from deferred_income with 97). The current features passing feature selection are shown here:

['poi', 'deferred_income', 'exercised_stock_options', 'expenses', 'long_term_incentive', 'other', 'restricted_stock', 'salary', 'shared_receipt_with_poi', ' total_payments', 'total_stock_value', 'ratio_to_poi', 'ratio_from_poi']

Of this only deferred_income is currently passing through the feature selection process. Note that bonus has also been dropped. It is suspected that bonus is dropped as it is correlated to a number of other variables, seen in the pair plot during EDA.

Increasing the cutoff to 0.03 drops total_stock_value and shared_receipt_with_poi. This does not improve the results using the current classifier.

The current classifier is likely overfitting the dataset and is giving more precision than recall.

Logistic Regression

Ensemble methods like gradient boosting can be prone to overfitting so trying a different model type may lead to different results.

Instead of default, this uses a cut of 0.03:

features_list = feature_selection.selection(
                                             data_dict,
                                             features_list,
                                             clf_fs,
                                             cut_off=0.03
                                             )
Algorithm comparison
Algorithm Accuracy Precision Recall F1 F2 Tot. pred. True pos. False pos. False neg. True neg.
Logistic Regression 0.85 0.368 0.177 0.239 0.197 15000 354 609 1646 12391

Similar problems occur as when using the previous classifier with a higher precision than recall.

Further approaches like PCA and more advanced feature selection can be undertaken to see if this improves performance.

Pipeline - Anova Feature Selection > PCA > Logistic Regression

To expand the classifier sklearns pipeline module can be used to expand the number of steps within the classifier. The main purpose of this is to allow grid search cv to explore different combinations automatically rather than performing manual adjustments.

Feature selection will select fixed number of components based on a classification ANOVA (Analysis of variance) statistical test. The grid search can iterate over different numbers of components (k) to explore which number of features removed works best.

Principal component analysis can reduce the dimensionality of the dataset and reduce the number of features used for machine learning further. This is beneficial in this case as there are few training data points and a high variance to the results. The standard PCA method will be applied to the number of components being iterated through the grid search.

The plan is to get better performance by reducing the number of features used in a machine learning algorithm like logistic regresssion. The results are:

Best classifier score: 0.847349475383 : {'r_dim__n_components': 2, 'r_dim__whiten': True, 'clf__C': 0.1, 'anova__k': 8, 'clf__class_weight': 'balanced'}

Algorithm comparison
Algorithm Accuracy Precision Recall F1 F2 Tot. pred. True pos. False pos. False neg. True neg.
Logistic Regression 0.80 0.315 0.392 0.349 0.374 15000 784 1705 1216 11295
Logistic Regression full data set 0.81 0.324 0.391 0.354 0.376 15000 782 1630 1218 11370

This just achieves the goal of being above 0.3 for precision and recall. Note that the method uses just 2 components of data based on only 8 features. This suggests that a pipeline approach is a good approach for this problem.

The f1 score here is 0.35, with a higher recall than precision. This suggests that more POI are being found more accurately but there are still a significant proportion of POI who are not identified.

Following creation of estimators, poi_id.py is changed to use all of the data for training. While tester.py is used to compare the results. These give similar results as seen in the table above.

Pipeline - With all features

Following testing of the impact feature engineering has all features were tested including the engineered features and the original features used to create these.

This led to a surprising result, higher performance was achieved using the pipeline and it would often select k='all'. So it is using all features after the initial static feature selection even though a number may have linear correlations or are so sparsely populated.

Furthermore, the pipeline is now unstable during GridSearchCV and will give slightly different estimators which lead to different results when running tester.py to compare performance.

The conclusion is still valid but scores now increase especially for the recall which can achieve above 0.6 and precision above 0.406. One example is below.

Pipeline(steps=[('anova', SelectKBest(k='all', score_func=<function f_classif at 0x1184416a8>)), ('r_dim', PCA(copy=True, iterated_power='auto', n_components=4, random_state=None, svd_solver='auto', tol=0.0, whiten=True)), ('clf', LogisticRegression(C=100, class_weight='balanced', dual=False,

fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])

Accuracy: 0.82633 Precision: 0.40632 Recall: 0.65600 F1: 0.50182 F2: 0.58420 Total predictions: 15000 True positives: 1312 False positives: 1917 False negatives: 688 True negatives: 11083

Conclusions

The logistic regression combined with PCA and ANOVA feature selection offers an estimator which gives above 0.3 for both Precision and Recall. This achieves the objective criteria. This is a balanced model.

Other methods have been attempted. One which is documented is Gradient Boost which overfits the data giving a high precision (0.45) but poor recall, meaning that it is predicting too many cases to be a person of interest.

Further work could be undertaken to improve this. Further optimization could be attempted using Logistic Regression and its parameters.

New features could be generated from the email corpus. Highlighting key a word set (for example related to specific criminal activities like electric grid manipulation) which relates somehow to POI. This would expand the input variables to perhaps include information to improve performance.

Overall this is a challenging case due to the limited size of the dataset and mixed missing values across different people.

Code issues and changes

Pickle

Changed code in both poi_id.py and tester.py to fit with Python 3 and pickle otherwise a TypeError is returned. Now has to include "rb" (read binary) and "wb" (write binary) instead of "r" and "w" respectively.

From:

with open(f, "r") as data_file:
    data_dict = pickle.load(data_file)

To:

with open(f, "rb") as data_file:
    data_dict = pickle.load(data_file)

Depreciation of CV

The code returns this warning.

DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functio ns are moved. Also note that the interface of the new CV iterators are different from that of this module. This module w ill be removed in 0.20.

This has not been corrected as the starter code iterates over the cross-validation objects and requires this.

Resources used

I hereby confirm that this submission is my work. I have cited above the origins of any parts of the submission that were taken from Websites, books, forums, blog posts, GitHub repositories, etc.

Sklearn API

Sklearn feature scaling

Pandas and sklearn scaling

Random forest parameter range suggestion

Sklearn pipeline

Sklearn pipeline ANOVA feature selection

Sklearn pipeline chaining PCA and logistic regression

Univariate feature selection Sklearn

About

Identify Enron employees who may have committed fraud based on public email and financial data

Resources

License

Stars

Watchers

Forks

Packages

No packages published