This repository contains code for analyzing user behavior within a mobile app. The primary focus is on understanding user interactions and their correlation with enrollment in the app.
Often, some companies provide free products/services in an attempt to transition their customers to a paid membership. Since marketing efforts are never free, these companies need to know exactly who to target with offers and promotions.
- In this project, we will work with a company (hypothetical) that will allow customers to track all of their finances in one place.
- The company has tasked us to identify which users will most likely NOT enroll in the paid product. Because of marketing costs, the company does not want to offer them to everyone espacially customers who were going to enroll anyways.
The analysis uses the appdata10.csv dataset, consisting of various user attributes and their activities within the app. The data are manufactured fields based on trends found in real world case studies. The fields describe what companies usually track from their users, and the data is based on observed distributions.
This data requires cleaning and a lot of pre-processing is needed to get it ready for modelling.
- The code starts by importing necessary libraries like Pandas, NumPy, Matplotlib, Seaborn, and dateutil.
- It then loads a dataset named 'appdata10.csv' using Pandas' read_csv function.
- Initial data exploration and cleaning are performed using df.describe() to understand basic statistics and identify potential issues. Cleaning operations include converting the 'hour' column from string to integer format using string slicing.
- The columns are described again to confirm the changes after cleaning.
- Histograms are plotted for numerical columns using Matplotlib to visualize the distribution of each feature.
- Correlation analysis is conducted between numerical columns and the 'enrolled' column using Pandas' corrwith function. A bar plot is created to visualize these correlations.
- A correlation matrix heatmap using Seaborn's heatmap function is generated to visualize the correlations between all pairs of numerical features.
- Feature engineering begins with transforming date columns ('first_open' and 'enrolled_date') to datetime objects for further analysis.
- A new feature 'dataset_diff' is created by calculating the time difference between 'enrolled_date' and 'first_open'.
- Histograms are plotted to understand the distribution of time since enrollment and identify the time window where most enrollments occur.
Based on the histogram analysis, the 'enrolled' column is modified to set a time limit (48 hours) for considering user enrollment after the app opening.
- Feature engineering continues by processing the 'screen_list' column, which contains comma-separated strings of screens.
- It involves creating binary columns for each top screen, checking if the screen appears in the 'screen_list', and counting any remaining screens as 'Other'.
- Funnel analysis is applied to group related screens ('Saving' screens) into a single column named 'SavingCount' to avoid correlation between individual screens.
- The updated dataframe is saved to new_data.csv using the to_csv function.
- Data Loading: The initial step typically involves loading the dataset into memory. This could be from a CSV file, a database, or any other data source. For instance:
import pandas as pd
data = pd.read_csv('new_data.csv')
- Feature Engineering: Converting categorical variables to numerical using techniques like one-hot encoding or label encoding.
- Data Splitting: The dataset is divided into predictor variables (X) and the target variable ('enrolled').
X = df.drop('enrolled', axis=1)
y = df['enrolled']
- Model Instantiation and Training:
- Choosing a Model: In this case, Logistic Regression is chosen. Other models could be utilized based on the problem type (classification/regression) and data characteristics.
- Splitting Data for Training and Testing
- Training the Model
- Model Prediction: After training, the model can be used to predict on the test data.
- Calculating evaluation metrics such as accuracy, precision, recall, and F1-score using functions from the sklearn.metrics module.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
- Creating a confusion matrix.
Utilizing k-fold cross-validation to evaluate model performance using the cross_val_score function from sklearn.model_selection.
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=10)
- Displaying and interpreting the evaluation metrics, cross-validation scores, and potentially feature importance scores.\
- Creating a dataframe with these three columns: