A Loan Prediction Risk Analysis for the Nonprofit Foundation, Alphabet Soup.
The purpose of this project was to design a binary classification deep learning model that was capable of predicting (with accuracy higher than 75%) whether applicants were successful if funded by Alphabet Soup.
From Alphabet Soup’s business team, a CSV containing more than 34,000 organizations that have received funding from Alphabet Soup over the years was analyzed. Within this dataset are a number of columns that capture metadata about each organization, such as the following:
- EIN and NAME—Identification columns
- APPLICATION_TYPE—Alphabet Soup application type
- AFFILIATION—Affiliated sector of industry
- CLASSIFICATION—Government organization classification
- USE_CASE—Use case for funding
- ORGANIZATION—Organization type
- STATUS—Active status
- INCOME_AMT—Income classification
- SPECIAL_CONSIDERATIONS—Special consideration for application
- ASK_AMT—Funding amount requested
- IS_SUCCESSFUL—Was the money used effectively
Using knowledge of Pandas and the Scikit-Learn’s StandardScaler(), the dataset was preprocessed in order to compile, train, and evaluate the neural network model later in Deliverable 2.
Using TensorFlow, a neural network or deep learning model was designed to create a binary classification model that can predict if an Alphabet Soup–funded organization will be successful based on the features in the dataset. Then compile, train, and evaluate the binary classification model to calculate the model’s loss and accuracy.
See AlphabetSoupCharity.ipynb file for Deliverable 1 and 2 code.
Original Model Parameters and Accuracy:
Using TensorFlow, the model was optimized in order to achieve a target predictive accuracy higher than 75%. See AlphabetSoupCharity_Optimzation file for Deliverable 3 code.
See TensorFlow Playground for more information.
A written report on the performance of the deep learning model created for AlphabetSoup. The following questions were answered after completing Deliverables 1-3 above.
-
What variable(s) are considered the target(s) for your model? The IS_SUCCESSFUL variable was the target and answers the question whether the money received from the foundation was used effectively or not. The variable had two values: Yes = 1, No = 0.
-
What variable(s) are considered to be the features for your model?
- NAME--Identification column
- APPLICATION_TYPE—Alphabet Soup application type
- AFFILIATION—Affiliated sector of industry
- CLASSIFICATION—Government organization classification
- USE_CASE—Use case for funding
- ORGANIZATION—Organization type
- STATUS—Active status
- INCOME_AMT—Income classification
- SPECIAL_CONSIDERATIONS—Special consideration for application
- ASK_AMT—Funding amount requested
- IS_SUCCESSFUL—Was the money used effectively
- What variable(s) are neither targets nor features, and should be removed from the input data? The EIN variable was removed as it was an identification column.
-
How many neurons, layers, and activation functions did you select for your neural network model, and why? For our input layer, we added the number of input features equal to the number of variables in our feature DataFrame (number_input_features = len(X_train[0])). hidden_nodes_layer1 = 50, and hidden_nodes_layer2 = 25 and hidden_nodes_layer3 = 12.
-
Were you able to achieve the target model performance? Yes, the new model resulted in predictive accuracy of 78%.
- What steps did you take to try and increase model performance? I changed the input variables and only dropped the "EIN" variable; the "NAME" variable was included in the features. I also changed the number of epochs from 100 to 50 and added a hidden layer while changing the activation functions to sigmoid on layers 2 and 3. See below:
After making the changes to optimize our model:
- We are able to correctly classify each of the points in the test data 78% of the time.
- The type of APPLICATION is one of the following: T3 (27037), T4 (1542), T6 (1216), T5 (1173), T19 (1065), T8 (737), T7 (725), T10 (528), and Other (276).
- The application has 71 classifications.
- Another model to recommend might include the Random Forest model because Random Forest are good for classification problems. Using a Random Forest model will most likely produce higher accuracy percentages.