Adam McCarthy
Problem posed in Udacity Intro to machine learning
Following the biggest corporate scandal in American history can emails and financial information be used to predict predict persons of interest in the subsequent criminal investigations.
To test results:
$ tester.py
To re-run and store the classifier, and processed data set:
$ poi_id.py
View report on GitHub pages.
The question and data set are provided in Udacity´s introduction to machine learning. The question is to identify a label (person of interest) using a predictive model. To predict those in the Enron scandal who were under some form of investigation and deemed the title, person of interest.
Machine learning is used here to predict if a person is of interest or not based on a large number of variables. Machine learning can work through the high number of variables (high dimensionality) in ways that a human manually interpreting and assessing the data will not be able to achieve.
Screen shot from udacity intro to machine learning course.
This report will work through the stages of a machine learning investigation. It will begin by giving an overview of the dataset and some insights from exploratory data analysis.
It will then move onto to feature selection, scaling and engineering.
It will discuss the approach taken to validate and tune the algorithm and which metrics are used to evaluate the quality of the model.
Following this will be a review of the different approaches taken and their results.
The objective is to find a methodology which can achieve 0.3 or greater in both precision and recall.
The analysis of the data includes:
Jupyter Notebook Exploring the dataset
Jupyter Notebook Exploring the email data
Report for Exploratory data analysis using R markdown
The predicted label is the person of interest (POI). A person of interest reflects those in the Enron case who have been indicted, settled without admitting guilt or testified in exchange for immunity.
The list of POIs has been generated by Udacity. The list was hand drafted from various sources so could contain errors.
There are 35 persons of interest in total 30 of which worked for Enron.
Jeffrey Skilling was the CEO during the fraud period.
Kenneth Lay was chairman.
Andrew Fastow was CFO.
The email dataset is from here
Email dataset consists of 150 directories each reflecting a person, specified as the last name followed by the first letter of the first name.
There are 86 people with email data.
Within poi_names.txt it can be seen with a yes (y), no (n) column if the poi has an email directory in the dataset.
The financial information is sourced from the Enron insiderpay pdf which is from Case No. 01-16034.
There are POIs who have email information but do not have financial information.
Non-POIs all come from the financial spreadsheet.
Only POIs or non-POIs who have financial information are used as combining POIs without any financial information, i.e. they have all NaN values for financial data will cause issues with the machine learning process.
An alternative approach would be to only use email information to be able to expand the POI and non-POI dataset but that will not be taken further here.
There are anomalous values in the dataset.
One person value is TOTAL, which gives a sum of values, rather than relating to being a person. This is removed during the data processing pipeline.
The dataset created by Udacity is aggregated to contain email and financial information.
It is set up as a key value pair where each key is a person with all the features stored in a dictionary as that person value.
There are 146 persons within the dataset. For each person there are 21 variables.
The dataset contains data on 18 of the POIs.
Note that when missing values occur featureFormat() and targetFeatureSplit() will replace this with 0.
Most of the values have a range of missing parameters, see table below.
Variable | Missing Values |
---|---|
bonus | 64 |
deferral_payments | 107 |
deferred_income | 97 |
director_fees | 129 |
email_address | 35 |
exercised_stock_options | 44 |
expenses | 51 |
from_messages | 60 |
from_poi_to_this_person | 60 |
from_this_person_to_poi | 60 |
loan_advances | 142 |
long_term_incentive | 80 |
other | 53 |
poi | 0 |
restricted_stock | 36 |
restricted_stock_deferred | 128 |
salary | 51 |
shared_receipt_with_poi | 60 |
to_messages | 60 |
total_payments | 21 |
total_stock_value | 20 |
This will be challenging for the machine learning process, a feature selection process will be useful to remove any variables that are not informative, e.g. director fees has 129 missing values so is unlikely to be well suited within a predictive model.
The TOTAL key relates to an erroneous input, it is an order of magnitude larger than other values. It is the sum of all people in the dataset and is removed using:
Other large values have been checked and are associated with real people. See enron61702insiderpay.pdf for evidence.
The variables are:
- Email address
- From messages
- From poi to this person
- From this person to poi
- Shared receipt with poi
- To messages
Email address is a string of the persons email address, it is not a useful variable for making a predictive model so is not included in the machine learning.
See feature engineering for more information on email variables.
- Bonus
- Deferral payments
- Deferred income
- Director fees
- Exercised stock options
- Expenses
- Loan advances
- Long term incentive
- Other
- Restricted stock
- Restricted stock deferred
- Salary
- Total payments
- Total stock value
Bonuses are highly skewed with top bonuses being exceedingly high.
95 have salary information. The minimum is 477$. The lowest salary seems a strange number for salary.
Salary can be compared to bonus as these are two variables that may be correlated.
The plot also splits the data into two sets to view how a linear regression model would behave. The data has a large spread with a couple of key outliers. These outliers mean that a linear model is only useful for the cluster of values associated with lower salary and smaller bonuses. The outliers drag the regression model, for example, see the blue trend line.
All outliers are interesting data points. High salary, high bonus pairs are the top paid in the company.
Using frequency polygons on each of the variables and splitting them into groups of POI and non-POI gives a quick way to see if any of the variables stand out as important.
In this case, few variables stand out. Loan advances is due to so few people having this value.
Restricted stock deferred has no members in POI which will limit the use of this variable.
Using multivariate analysis to try and separate POIs from non-POIs is challenging with the financial variables. An initial assumption may be that salary, bonus and total payments are important, those committing crimes may have been receiving more money.
The plot shows a few of these cases with extreme outliers away from the main cluster like Kenne Lay and Jerrefry Skilling but there are also a number of POIs within the main cluster of people.
Some of the figures here are astonishing. The high figures and skewed distribution suggests a number of these datasets are over dispersed.
There are also some suspicious low values like the minimum salary.
A different feature engineering approach could be to bin the values, for example using log spacing between bins. This will not be attempted during this first pass.
TOTAL is removed as this is a sum of all people.
THE TRAVEL AGENCY IN THE PARK is removed as this is not a valid person.
These are removed from the dataset at the start of the data processing pipeline.
if ro:
data_dict.pop("TOTAL", None)
data_dict.pop("THE TRAVEL AGENCY IN THE PARK", None)
It can be turned off by setting ro to FALSE.
Four ensemble or tree classifiers are run to investigate feature importance. This is using the entire dataset and all variable apart from email address and name of the person.
The prediction is for the target, POI.
Exercised stock options is the most important feature in three of the classifiers.
In AdaBoost the deferred income followed by bonus are the most important.
Decision tree does not use many of the variables.
Director fees are consistently low (almost no) importance.
Loan advances are of low importance but has minor impact.
restricted_stock_deferred is either of no importance or of minor importance. Similarly, deferral_payments is of little importance.
This gives four variables with very little importance, Director fees, loan advances, restricted stock deferred and deferral payments.
A way to select these variables will be using a limit on importance. For example, AdaBoost feature importance <0.02 will remove the weakest four variables. Upon implementation a default ratio of 0.01 is used as the cut-off. This is based of the four graphs and experimenting with different values.
Algorithm | Accuracy | Precision | Recall | F1 | F2 | Tot. pred. | True pos. | False pos. | False neg. | True neg. |
---|---|---|---|---|---|---|---|---|---|---|
Logistic Regression with feature selection | 0.84 | 0.33 | 0.17 | 0.23 | 0.19 | 15000 | 364 | 712 | 1654 | 12288 |
Logistic Regression without feature selection | 0.84 | 0.35 | 0.26 | 0.3 | 0.27 | 15000 | 520 | 950 | 1480 | 12050 |
The better scores without feature selection shows this is not the best approach for feature selection. See further down in the report for information about logistic regression implementation, this versio uses the total dataset.
After initial trials using this feature selection approach an iterative univariate ANOVA feature selection step is added into the final pipeline. This uses Kbest and iterates through a range of parameters during GridSearchCV to find the best number of features to use. See further in the report for more information.
Within the email data, there are five variables.
The bubble chart highlights all five variables by combining two in ratios along x and y. These ratios seem suitable candidates for feature engineering.
One takes the ratio of emails from a POI compared to the total number of emails to that person.
The second the ratio of emails to a POI compared to the total number of emails that person has sent.
The idea being that this will highlight persons of interest better than the two variables separately.
When using these ratios the input variables will be removed. So from_messages, to_messages, from_poi_to_this_person and from_this_person_to_poi are not used when using feature engineering.
To check the result of implementing feature engineering the final estimator will be used to check the output results with and without feature engineering.
The three tests are, with feature engineering and therefore
fe = True # Feature engineering
With parameters:
"ratio_to_poi", "ratio_from_poi"
Without feature engineering and with the original variables included.
fe = False # Feature engineering
with parameters:
"from_messages", "to_messages", "from_poi_to_this_person", "from_this_person_to_poi"
With feature engineering but include original features. Note these will likely be removed by the ANOVA feature selection layer in the pipeline. A grid search will optimize each of these estimators each will have different parameter combinations.
fe = True # Feature engineering
with parameters:
"ratio_to_poi", "ratio_from_poi", "from_messages", "to_messages", "from_poi_to_this_person", "from_this_person_to_poi"
Each of these is using the full data set to train on.
Algorithm | Accuracy | Precision | Recall | F1 | F2 | Tot. pred. | True pos. | False pos. | False neg. | True neg. |
---|---|---|---|---|---|---|---|---|---|---|
Logistic Regression with feature engineering | 0.81 | 0.315 | 0.392 | 0.349 | 0.374 | 15000 | 784 | 1705 | 1216 | 11295 |
Logistic Regression without feature engineering | 0.806 | 0.318 | 0.396 | 0.352 | 0.377 | 15000 | 791 | 1700 | 1209 | 11300 |
Logistic Regression with feature engineering and original variables | 0.84 | 0.425 | 0.565 | 0.485 | 0.53 | 15000 | 1129 | 1526 | 871 | 11474 |
With the feature engineering uses k=8 for feature selection.
Without feature engineering uses k=8 for feature selection. There is little performance difference between these two options. This may relate to the overall importance of the email variables compared to the financial variables.
With feature engineering and all variables uses k='all' for feature selection. This led to a surprise, an increase was achieved on both precision and recall. Giving an F1 score or 0.46. This approach leads to the best estimators, further information can be found in the results section, the pipeline with all features.
NOTE with feature engineering and all variables appears to lead to different estimators during each run of the GridSearchCV.
Feature scaling is often a requirement for effective machine learning.
Exploratory data analysis has shown that even after removing the extreme outlier, TOTAL, a number of the variables have over dispersed data.
A robust scaler can be used for datasets with many outliers. This will use more robust estimates for central tendency and dispersion before scaling the dataset.
To make a classifier that works well on new or unseen data cross validation aids the algorithm from overfitting on the training data.
Firstly the data is separated into a train and test set using train test split, with 30% held back for testing. This gives 29 people for testing and 65 people for training.
The training data is then used further.
By splitting up the available data (e.g. only the training data) into separate groups, these can be used to cross-validate the performance of a classifier.
In sklearn one useful approach is GridSearchCV, which combines cross-validation and parameter optimization.
Each classifier will have a range of parameters that are not learnt when the classifier is fitted to the data. Each of these is passed as arguments. These can have a large impact on the performance of the classifier and fundamentally change how it approaches making predictions using this dataset.
Parameter optimization can be undertaken manually, running different combinations of parameters to see which performs best but GridSearchCV will compare combinations of the classifier parameters and see which performs the best during cross validation.
The cross validation method can be selected, for this use case stratified K fold is used to maintain an even proportion of labels across the folds of data.
Note that when using 3 folds 65 persons become around 22 and 2 folds 32. This means this problem set is always working with a very small dataset. Having a large number of variables will not be a good idea with such a small dataset.
The validation is undertaken using poi_id.py this outputs the accuracy.
After a good estimator is found this is compared to a test set using tester.py. Which gives precision, recall and F1 which can give a better a review of the performance.
A limitation in this approach is by using the performance from the test set this can leak into into the evaluation and lead to overfitting on new data.
This problem is a skewed binary classification, therefore accuracy is not the best metric to judge the reliability of the evaluation.
There is an asymmetry in this problem, we can optimize for placing more people as innocent or more people as guilty. Or aim for a balance between the two.
Recall: True Positive / (True Positive + False Negative). Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were 'recalled' from the dataset.
Precision: True Positive / (True Positive + False Positive). Out of all the items labelled as positive, how many truly belong to the positive class.
A high precision low recall model would give greater confidence that flagged POIs are truly POI but may miss out on POIs. This would be suitable if avoiding flagging innocent people is the most important issue.
A hhigh recall low precision model would find nearly all POIs but also flag others as involved when they are innocent. This would be useful if screening a large number of people to quickly decide who to focus on for further investigation.
A high F1 score with balanced precision and recall is the best of both settings.
The preference here is to achieve a respectable F1 score and recall but focus on precision. We can live with innocent people being flagged up as this model will give an overview of all those who may be POI. Further investigation could then check these predictions. This would work well as a screening tool to quickly evaluate a range of people.
Using the default setting of one label and one feature we can take an initial review. of the prediction.
features_list = ['poi', 'salary']
The outputs for the initial algorithm (Gaussian Naive Bayes) is compared to three other algorithms.
AdaBoost performs considerably slower.
KMeans gives warning about predicted labels not equal to 0 or 1.
Naive Bayes gives a very high recall value (0.798).
After completing a version of the machine learning pipeline including outlier removal, feature selection, feature engineering and feature scaling a gradient boosting classifier is used with GridSearchCv. This means that parameters can be optimized across cross-validations (in this run 2 folds using stratified k fold). The score to optimize on is F1 weighted.
This is not removing any zeros and using all features as input apart from email address and those that duplicate ratio feature engineering.
This evaluation uses a broad parameter grid.
parameters = [{
"loss": ["deviance", "exponential"],
"n_estimators": [120, 300, 500, 800, 1200],
"max_depth": [3, 5, 7, 9, 12, 15, 17, 25],
"min_samples_split": [2, 5, 10, 15, 100],
"min_samples_leaf": [2, 5, 10],
"subsample": [0.6, 0.7, 0.8, 0.9, 1],
"max_features": ["sqrt", "log2", None]
}]
This gives 18000 combinations to try in an exhaustive grid search. This is useful to get an overview of which parameter combinations perform well, however it comes at a computational cost. It takes a number of hours to fit the classifier. This resulted in:
Best classifier score: 0.894907227728 :
{'subsample': 0.8, 'n_estimators': 120, 'max_depth': 25, 'loss':'deviance', 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'}
When applying this method using the testing function the results are:
Algorithm | Accuracy | Precision | Recall | F1 | F2 | Tot. pred. | True pos. | False pos. | False neg. | True neg. |
---|---|---|---|---|---|---|---|---|---|---|
Gradient Boosting | 0.862 | 0.454 | 0.186 | 0.264 | 0.211 | 15000 | 373 | 448 | 1627 | 12552 |
This method has improved on the original methods but still does not achieve 0.3 for precision and recall.
The 0.45 for precision compared to the 0.19 for recall suggests that it is finding nearly half the POIs but flagging too many non-POIs as guilty.
Removing features with a high number of NaNs includes dropping, restricted_stock_deferred, loan_advances, director_fees, deferral_payments, and deferred_income. These variables have over 100 missing values (apart from deferred_income with 97). The current features passing feature selection are shown here:
['poi', 'deferred_income', 'exercised_stock_options', 'expenses', 'long_term_incentive', 'other', 'restricted_stock', 'salary', 'shared_receipt_with_poi', ' total_payments', 'total_stock_value', 'ratio_to_poi', 'ratio_from_poi']
Of this only deferred_income is currently passing through the feature selection process. Note that bonus has also been dropped. It is suspected that bonus is dropped as it is correlated to a number of other variables, seen in the pair plot during EDA.
Increasing the cutoff to 0.03 drops total_stock_value and shared_receipt_with_poi. This does not improve the results using the current classifier.
The current classifier is likely overfitting the dataset and is giving more precision than recall.
Ensemble methods like gradient boosting can be prone to overfitting so trying a different model type may lead to different results.
Instead of default, this uses a cut of 0.03:
features_list = feature_selection.selection(
data_dict,
features_list,
clf_fs,
cut_off=0.03
)
Algorithm | Accuracy | Precision | Recall | F1 | F2 | Tot. pred. | True pos. | False pos. | False neg. | True neg. |
---|---|---|---|---|---|---|---|---|---|---|
Logistic Regression | 0.85 | 0.368 | 0.177 | 0.239 | 0.197 | 15000 | 354 | 609 | 1646 | 12391 |
Similar problems occur as when using the previous classifier with a higher precision than recall.
Further approaches like PCA and more advanced feature selection can be undertaken to see if this improves performance.
To expand the classifier sklearns pipeline module can be used to expand the number of steps within the classifier. The main purpose of this is to allow grid search cv to explore different combinations automatically rather than performing manual adjustments.
Feature selection will select fixed number of components based on a classification ANOVA (Analysis of variance) statistical test. The grid search can iterate over different numbers of components (k) to explore which number of features removed works best.
Principal component analysis can reduce the dimensionality of the dataset and reduce the number of features used for machine learning further. This is beneficial in this case as there are few training data points and a high variance to the results. The standard PCA method will be applied to the number of components being iterated through the grid search.
The plan is to get better performance by reducing the number of features used in a machine learning algorithm like logistic regresssion. The results are:
Best classifier score: 0.847349475383 : {'r_dim__n_components': 2, 'r_dim__whiten': True, 'clf__C': 0.1, 'anova__k': 8, 'clf__class_weight': 'balanced'}
Algorithm | Accuracy | Precision | Recall | F1 | F2 | Tot. pred. | True pos. | False pos. | False neg. | True neg. |
---|---|---|---|---|---|---|---|---|---|---|
Logistic Regression | 0.80 | 0.315 | 0.392 | 0.349 | 0.374 | 15000 | 784 | 1705 | 1216 | 11295 |
Logistic Regression full data set | 0.81 | 0.324 | 0.391 | 0.354 | 0.376 | 15000 | 782 | 1630 | 1218 | 11370 |
This just achieves the goal of being above 0.3 for precision and recall. Note that the method uses just 2 components of data based on only 8 features. This suggests that a pipeline approach is a good approach for this problem.
The f1 score here is 0.35, with a higher recall than precision. This suggests that more POI are being found more accurately but there are still a significant proportion of POI who are not identified.
Following creation of estimators, poi_id.py is changed to use all of the data for training. While tester.py is used to compare the results. These give similar results as seen in the table above.
Following testing of the impact feature engineering has all features were tested including the engineered features and the original features used to create these.
This led to a surprising result, higher performance was achieved using the pipeline and it would often select k='all'. So it is using all features after the initial static feature selection even though a number may have linear correlations or are so sparsely populated.
Furthermore, the pipeline is now unstable during GridSearchCV and will give slightly different estimators which lead to different results when running tester.py to compare performance.
The conclusion is still valid but scores now increase especially for the recall which can achieve above 0.6 and precision above 0.406. One example is below.
Pipeline(steps=[('anova', SelectKBest(k='all', score_func=<function f_classif at 0x1184416a8>)), ('r_dim', PCA(copy=True, iterated_power='auto', n_components=4, random_state=None, svd_solver='auto', tol=0.0, whiten=True)), ('clf', LogisticRegression(C=100, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])Accuracy: 0.82633 Precision: 0.40632 Recall: 0.65600 F1: 0.50182 F2: 0.58420 Total predictions: 15000 True positives: 1312 False positives: 1917 False negatives: 688 True negatives: 11083
The logistic regression combined with PCA and ANOVA feature selection offers an estimator which gives above 0.3 for both Precision and Recall. This achieves the objective criteria. This is a balanced model.
Other methods have been attempted. One which is documented is Gradient Boost which overfits the data giving a high precision (0.45) but poor recall, meaning that it is predicting too many cases to be a person of interest.
Further work could be undertaken to improve this. Further optimization could be attempted using Logistic Regression and its parameters.
New features could be generated from the email corpus. Highlighting key a word set (for example related to specific criminal activities like electric grid manipulation) which relates somehow to POI. This would expand the input variables to perhaps include information to improve performance.
Overall this is a challenging case due to the limited size of the dataset and mixed missing values across different people.
Changed code in both poi_id.py and tester.py to fit with Python 3 and pickle otherwise a TypeError is returned. Now has to include "rb" (read binary) and "wb" (write binary) instead of "r" and "w" respectively.
From:
with open(f, "r") as data_file:
data_dict = pickle.load(data_file)
To:
with open(f, "rb") as data_file:
data_dict = pickle.load(data_file)
The code returns this warning.
DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functio ns are moved. Also note that the interface of the new CV iterators are different from that of this module. This module w ill be removed in 0.20.
This has not been corrected as the starter code iterates over the cross-validation objects and requires this.
I hereby confirm that this submission is my work. I have cited above the origins of any parts of the submission that were taken from Websites, books, forums, blog posts, GitHub repositories, etc.
Random forest parameter range suggestion
Sklearn pipeline ANOVA feature selection