Skip to content

Latest commit

 

History

History
1249 lines (990 loc) · 37.9 KB

ODS1_eng.md

File metadata and controls

1249 lines (990 loc) · 37.9 KB

OpenDataScience/ODS Open Course Kick-off

Greetings everyone who waited for the open course in hands-on data analysis and machine learning to launch!


The first part is about primary data analysis with Pandas.

For now, we plan to have 7 articles that come along with Jupyter notebooks (mlcourse_open repository), competitions and hometasks.

Now comes the list of future articles, course description and the first part itself.

Plan of articles

  1. Primary data analysis with Pandas
  2. Visual data analysis with Python
  3. Classification, decision trees and nearest-neighbour method.
  4. Linear models for classification and regression. Cross-validation and model evaluation
  5. Ensembles: bagging, random forest. Validation and learning curves
  6. Unsupervised learning: PCA, clusterization, anomaly detection
  7. The art of feature engineering and selection. Applications in text mining, image processing and geospatial data

Plan of this article

  1. About the course
  2. Hometasks in the course
  3. Demonstration of Pandas main methods
  4. First try to predict churn rate
  5. Hometask #1
  6. Useful links and resources

1. About the course

Many in our OpenDataScience team are involved in state-of-the-art machine learning technologies: DL-frameworks, Bayesian machine learning methods, probabilistic programming and not only. We are currently preparing lectures and practical exercises/workshops on these topics, and in order to prepare audience/students for them, we decided to publish a series of introductory articles on Habr.

We do not aim to develop another comprehensive introductory course on machine learning or data analysis (so it’s not a substitute for the fundamental online and offline programs and books). The purpose of this series of articles is to quickly refresh your knowledge or help you find topics for further study. Approach is similar to that of the authors of the Deep Learning book, which begins with a review of mathematics and the basics of machine learning - short, concise and with many references to sources.

If you plan to take a course, then you should know that when selecting themes and creating materials, we assume that our audience have at least 2nd year knowledge of advanced/higher mathematics and can (write) code in Python. These are not strict selection criteria, but just recommendations - you can enroll in a course without knowing math or Python (or even both!), and simultaneously catch up the following themes:

  • Basic math (calculus, linear algebra, optimization, probability theory and statistics) can be reviewed in excellent online courses from MIPT and the Higher School of Economics on Coursera;
  • for Python, small interactive tutorial on basic algorithms and data structures from Datacamp of this repository should be enough. For more advanced level see, for example, course from St.Petersburg Computer Science Center;
  • As for machine learning, in 95% of cases one will recommend you a classic course from Andrew Ng (Stanford, Coursera). And also some must-reads: "The elements of statistical learning" (Hastie, Tibshirani), "Pattern recognition" (Bishop), "Machine Learning: A Probabilistic Perspective" (Murphy) and "Deep Learning" (Goodfellow, Bengio). The latter book begins with an comprehensive and interesting intro into machine learning and the internal arrangement of its algorithms.

What software you’ll need

As for now, you’ll only need Anaconda build with Python 3.6 to reproduce the code in the course. Later you’ll need other libraries, but this will be additionally discussed.

Update:

We hope that most of you already know about PyPI, and can install Python 3.6, or even heard of and used venv / virtualenvwrapper, and if you are one of them, use what's more convenient for you (but we do not guarantee full compatibility of course materials with all versions of libraries!).

For all others, install Anaconda: Quick Install Guide. The build includes 100 Python libraries and supports another 720. There is also a light version - Miniconda, but you’ll need to install all the necessary libraries by yourself with it.

You can also use the Docker container that already has all the necessary software. Details are in the README in the repository.

How to join the course

You can join the course any time, but homework deadlines are strict. To participate:

  • Fill in the form and indicate your real name and preferably Google email in it (you’ll need google account for homework);
  • You can join the OpenDataScience community, the course is discussed in the #mlcourse_open channel.

2. Course Homework

  • Each article goes with homework in the form of Jupyter notebook, where you need to add the code, and, based on this, choose the correct answer in the Google form (that’s what you need a Google account for);
  • You can get up to 10 points/credits per homework;
  • Each homework is due in 1 week/should be done in 1 week, it’s hard deadline;
  • Homework solutions/answers will be posted in the repository right after the deadline;;
  • In the end we will publish the rank of participants.

3. Demonstration of Pandas main methods

Pandas is a Python library that provides extensive capabilities/means for data analysis. Data scientists work with data that is often stored in table formats, like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process and analyze such tabular data using SQL-like queries. And in conjunction with Matplotlib and Seaborn libraries, Pandas provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in Pandas are Series and DataFrame classes. The former is a one-dimensional indexed array of some fixed type data. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can represent it as a dictionary of objects of type Series. The structure of DataFrame is great for representing real data: rows correspond to feature description of individual objects, and the columns correspond to the features.

# Make sure that Python 2 and 3 are compatible
# pip install future
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)

# turn off Anaconda warnings
import warnings
warnings.simplefilter('ignore')

# import Pandas and Numpy
import pandas as pd
import numpy as np

We’ll show the main methods in action by analyzing dataset on the outflow of customers/churn rate of the telecom operator. Let’s read the data (read_csv method) and look at the first 5 lines using the head() method:

df = pd.read_csv('../../data/telecom_churn.csv')
df.head()


In Jupyter notebooks Pandas dataframes are printed as these pretty tables and `print(df.head())` looks worse. By default, Pandas displays 20 columns and 60 rows, so if your dataframe is bigger use `set_option` function:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

Each row is one client - this is the object of research. Columns are features of the object.

| Name | Description | Type | |--- |--: | | | **State** | Letter state code| nominal | | **Account length** | How long the client has been with the company | numerical | | **Area code** | Phone number prefix | numerical | | **International plan** | International Roaming (on/off) | binary | | **Voice mail plan** | Voicemail (on/off) | binary | | **Number vmail messages** | Number of voicemail messages | numerical | | **Total day minutes** | Total duration of calls during day | numerical | | **Total day calls** | Total number of calls during day | numerical | | **Total day charge** | Total charge for services during day | numerical | | **Total eve minutes** | Total duration of calls during evening | numerical | | **Total eve calls** | Total number of calls during evening | numerical | | **Total eve charge** | Total charge for services during evening | numerical | | **Total night minutes** | Total duration of calls during night | numerical | | **Total night calls** | Total number of calls during night | numerical | | **Total night charge** | Total charge for services during night | numerical | | **Total intl minutes** | Total duration of international calls | numerical | | **Total intl calls** | Total number of international calls | numerical | | **Total intl charge** | Total charge for international calls| numerical | | **Customer service calls** | Number of calls to service center | numerical |

Target variable is Churn – feature of outflow, it’s binary(1 means the loss of client, i.e. churn). Later we’ll build models that predict this variable based on others, that’s why we called it target.

Let’s have a look at data dimensionality, names of features and their types.

print(df.shape)
(3333, 20)

We can see that table contains 3333 rows and 20 columns. Let’s print out column names:

print(df.columns)
Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')

We can use info method to see the general information about data frame:

print(df.info())
RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): State 3333 non-null object Account length 3333 non-null int64 Area code 3333 non-null int64 International plan 3333 non-null object Voice mail plan 3333 non-null object Number vmail messages 3333 non-null int64 Total day minutes 3333 non-null float64 Total day calls 3333 non-null int64 Total day charge 3333 non-null float64 Total eve minutes 3333 non-null float64 Total eve calls 3333 non-null int64 Total eve charge 3333 non-null float64 Total night minutes 3333 non-null float64 Total night calls 3333 non-null int64 Total night charge 3333 non-null float64 Total intl minutes 3333 non-null float64 Total intl calls 3333 non-null int64 Total intl charge 3333 non-null float64 Customer service calls 3333 non-null int64 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB None

bool, int64, float64 and object are types of features. We see that 1 feature is logical (bool), 3 features are of type object and 16 are numeric. We can also use info method to quickly look at the missing values in the data. In our case there are none, each column contains 3333 observations.

We can change column type with astype method. Let’s apply this method to Churn feature and convert it into int64:

df['Churn'] = df['Churn'].astype('int64')

describe method shows the basic statistical characteristics of the data for each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

df.describe()


In order to see statistics on non-numerical features, one needs to explicitly state/indicate types of interest in the include parameter.

df.describe(include=['object', 'bool'])
State Area code International plan Voice mail plan Churn
count 3333 3333 3333 3333 3333
unique 51 3 2 2 2
top WV 415 No No False
freq 106 1655 3010 2411 2850
For categorical (type `object`) and boolean (type `bool`) we can use **`value_counts`** method. Let’s have a look at data distribution over our target variable - `Churn`:
df['Churn'].value_counts()
0    2850
1     483
Name: Churn, dtype: int64

2850 users of 3333 are loyal, their Churn value is 0.

Let’s have a look at users distribution over/according to the Area code variable. We’ll set parameter normalize=True to see relative frequencies, not absolute.

df['Area code'].value_counts(normalize=True)
415    0.496550
510    0.252025
408    0.251425
Name: Area code, dtype: float64

Sorting

DataFrame can be sorted by the value of one of the variables. In our case, for example, it can be sorted by Total day charge (ascending=False for sorting in descending order):

df.sort_values(by='Total day charge', 
        ascending=False).head()


We can also sort by group of columns:

df.sort_values(by=['Churn', 'Total day charge'],
        ascending=[True, False]).head()


спасибо за замечание про устаревший sort @makkos


Indexing and retrieving data

DataFrame can be indexed in different ways. That said, let’s consider various ways of indexing and retrieving the data we need from the dataframe using simple questions.

To retrieve a single column, you can use a DataFrame ['Name'] construct. Let's use this to answer the question: what is the proportion of disloyal users in our dataframe?

df['Churn'].mean()
0.14491449144914492

14.5% is quite bad for a company, with such churn rate it is possible to go bankrupt.

DataFrame logical indexing by one column is also very convenient. It looks like this: df[P(df['Name'])], where P is some logical condition that is checked for each element of the Name column. The result of such indexing is the DataFrame consisting only of rows that satisfy the P condition on the Name column.

Let’s use it to answer the question: What are average values of numerical variables for disloyal users?

df[df['Churn'] == 1].mean()
Account length            102.664596
Number vmail messages       5.115942
Total day minutes         206.914079
Total day calls           101.335404
Total day charge           35.175921
Total eve minutes         212.410145
Total eve calls           100.561077
Total eve charge           18.054969
Total night minutes       205.231677
Total night calls         100.399586
Total night charge          9.235528
Total intl minutes         10.700000
Total intl calls            4.163561
Total intl charge           2.889545
Customer service calls      2.229814
Churn                       1.000000
dtype: float64

Let’s use the combination of two above methods to answer the following question: How much time do disloyal users spend on phone in average during a day?

df[df['Churn'] == 1]['Total day minutes'].mean()
206.91407867494823

What is the maximum length of international calls among loyal users (Churn == 0) who do not use the international roaming service ('International plan' == 'No')?

df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max()
18.899999999999999

DataFrames can be indexed both by column or row name, or by the serial number. loc method is used for indexing by name, iloc - for indexing by the number.

In the first case we say "give us the values of the first five rows in the columns from State to Area code", and in the second - "give us the values of the first five rows in the first three columns".

df.loc[0:5, 'State':'Area code']
State Account length Area code
0 KS 128 415
1 OH 107 415
2 NJ 137 415
3 OH 84 408
4 OK 75 415
5 AL 118 510
df.iloc[0:5, 0:3]
State Account length Area code
0 KS 128 415
1 OH 107 415
2 NJ 137 415
3 OH 84 408
4 OK 75 415
If we need the first or last line of the data frame, we use the `df[:1]` or `df[-1:]` construct:
df[-1:]


Applying Functions to Cells, Columns and Rows

To apply functions to each column, use apply:

df.apply(np.max) 
State                        WY
Account length              243
Area code                   510
International plan          Yes
Voice mail plan             Yes
Number vmail messages        51
Total day minutes         350.8
Total day calls             165
Total day charge          59.64
Total eve minutes         363.7
Total eve calls             170
Total eve charge          30.91
Total night minutes         395
Total night calls           175
Total night charge        17.77
Total intl minutes           20
Total intl calls             20
Total intl charge           5.4
Customer service calls        9
Churn                      True
dtype: object

apply method can also be used to apply a function to each line. To do this, specify axis=1.

To apply functions to each cell of the column, use map.

For example, map method can be used to replace values in a column by passing it a dictionary of the form {old_value: new_value} as an argument:

d = {'No' : False, 'Yes' : True}
df['International plan'] = df['International plan'].map(d)
df.head()


Same thing can be done with replace method:

df = df.replace({'Voice mail plan': d})
df.head()


Grouping

In general, grouping of data in Pandas goes as follows:

df.groupby(by=grouping_columns)[columns_to_show].function()
  1. First, groupby method is applied to DataFrame; it divides data according to the grouping_columns - one or several variables.
  2. Then, columns of interest are selected (columns_to_show).
  3. Finally, one or several functions are applied to the obtained groups.

Here’s how grouping of data according to the value of Churn variable and displaying statistics on three columns in each group is implemented:

columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes']

df.groupby(['Churn'])[columns_to_show].describe(percentiles=[])
Total day minutes Total eve minutes Total night minutes
Churn
0 count 2850.000000 2850.000000 2850.000000
mean 175.175754 199.043298 200.133193
std 50.181655 50.292175 51.105032
min 0.000000 0.000000 23.200000
50% 177.200000 199.600000 200.250000
max 315.600000 361.800000 395.000000
1 count 483.000000 483.000000 483.000000
mean 206.914079 212.410145 205.231677
std 68.997792 51.728910 47.132825
min 0.000000 70.900000 47.400000
50% 217.600000 211.300000 204.800000
max 350.800000 363.700000 354.900000
Let’s do the same thing slightly differently by passing list of functions to `agg`:
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes']

df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, np.max])
Total day minutes Total eve minutes Total night minutes
mean std amin amax mean std amin amax mean std amin amax
Churn
0 175.175754 50.181655 0.0 315.6 199.043298 50.292175 0.0 361.8 200.133193 51.105032 23.2 395.0
1 206.914079 68.997792 0.0 350.8 212.410145 51.728910 70.9 363.7 205.231677 47.132825 47.4 354.9

Summary tables

Suppose we want to see how the observations in our sample are distributed in the context of two variables - Churn and International plan. To do so, we can build a contingency table using crosstab method:

pd.crosstab(df['Churn'], df['International plan'])
International plan No Yes
Churn
0 2664 186
1 346 137
pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)
Voice mail plan No Yes
Churn
0 0.602460 0.252625
1 0.120912 0.024002
We can see that most of the users are loyal and use additional services (international roaming/voicemail).

Advanced Excel users will probably remember about such a feature as pivot tables. In Pandas, pivot_table method is responsible for pivot tables, it takes the following parameters:

*values - list of variables for which one needs to calculate statistics; *index – list of variables according to which to group data, *aggfunc — what statistics we need to calculate for groups - sum, mean, maximum, minimum or something else.

Let’s take a look at average number of day, evening and night calls for different Area codes:

df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'], ['Area code'], aggfunc='mean').head(10)
Total day calls Total eve calls Total night calls
Area code
408 100.496420 99.788783 99.039379
415 100.576435 100.503927 100.398187
510 100.097619 99.671429 100.601190

DataFrame transformations

Like many other things in Pandas, adding columns to the DataFrame is feasible in several ways.

For example, we want to calculate the total number of calls for all users. Let’s create total_calls object of type Series and paste it into the data frame:

total_calls = df['Total day calls'] + df['Total eve calls'] + df['Total night calls'] + df['Total intl calls']
df.insert(loc=len(df.columns), column='Total calls', value=total_calls) 
# loc is the number of column after which to insert the Series object
# we set it to len(df.columns) to paste it in the very end
df.head()


It is possible to add a column from existing ones in an easier way, without creating intermediate Series:

df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + df['Total night charge'] + df['Total intl charge']

df.head()


To delete columns or rows, use drop method, passing it the required indexes and the required value of the axis parameter (1 if you delete the columns, and nothing or 0 if you delete the rows) as the argument:

df = df.drop(['Total charge', 'Total calls'], axis=1) # get rid of the just created columns

df.drop([1, 2]).head() # and here’s how you can delete rows


4. First Tries to Predict Churn Rate

Let's see how churn rate is connected with the International plan variable. We’ll do this using a crosstab contingency table and Seaborn illustration/visualisation (we’ll cover the material on how to build such pictures and make sense of them in the next article).

pd.crosstab(df['Churn'], df['International plan'], margins=True)
International plan False True All
Churn
0 2664 186 2850
1 346 137 483
All 3010 323 3333


We see that when roaming is switched on, churn rate is much higher - that’s an interesting observation! Perhaps large and poorly controlled spending/expenses in roaming is very conflict-prone/triggers conflicts and leads to discontent/dissatisfaction of the telecom operator's customers and, accordingly, to their churn.

Next, let’s look at another important feature - "Number of calls to the service center" (Customer service calls). Let’s also make a summary table and a picture.

pd.crosstab(df['Churn'], df['Customer service calls'], margins=True)
Customer service calls 0 1 2 3 4 5 6 7 8 9 All
Churn
0 605 1059 672 385 90 26 8 4 1 0 2850
1 92 122 87 44 76 40 14 5 1 2 483
All 697 1181 759 429 166 66 22 9 2 2 3333


Perhaps, it is not so obvious from the summary table (or it's boring to creep along the lines with numbers), but the picture eloquently testifies that the churn rate strongly increases from 4 calls to the service center.

Let’s now add a binary attribute to our DataFrame - the result of the comparison Customer service calls > 3. And once again, let's see how it relates to the churn.

df['Many_service_calls'] = (df['Customer service calls'] > 3).astype('int')

pd.crosstab(df['Many_service_calls'], df['Churn'], margins=True)
Churn 0 1 All
Many_service_calls
0 2721 345 3066
1 129 138 267
All 2850 483 3333


Let’s merge conditions from above and construct a summary table for this union and churn.

pd.crosstab(df['Many_service_calls'] & df['International plan'] , df['Churn'])
Churn 0 1
row_0
False 2841 464
True 9 19
Therefore, predicting customer loyalty in case when the number of calls to the service center is less than 4 and roaming is not connected (and predicting churn otherwise), you can expect a percentage of "customer loyalty guessing" of about 85.8% (we are mistaken only 464 + 9 times). This 85.8% that we got with very simple reasoning is a good starting point (*baseline*) for the further machine learning models that we will build.

In general, before the advent of machine learning, the data analysis process looked something like this. Let's recap:

  • The share of loyal clients in the sample is 85.5%. The most naive model that always predicts a "loyal customer" on such data will guess right in about 85.5% of cases. That is, the proportion of correct answers (accuracy) of subsequent models should be not less, and better significantly higher than this number;
  • With the help of a simple forecast that can be expressed by the following formula: "International plan = False & Customer Service calls <4 => Churn = 0, else Churn = 1", we can expect a guessing rate of 85.8%, which is just above 85.5%. Subsequently, we'll talk about decision trees and figure out how to find such rules automatically based only on the input data;
  • We got these two baselines without any machine learning, and they’ll serve as the starting point for our subsequent models. If it turns out that with enormous efforts we enlarge the share of correct answers by, say, 0.5%, then perhaps we are doing something wrong, and it suffices to confine ourselves to a simple model with two conditions;
  • Before training complex models it is recommended to twist the data a little and check simple assumptions. Moreover, in business applications of machine learning, they usually start with simple solutions, and then experiment with their complications.

5. Homework #1

The first homework assignment is devoted to the analysis of the Adult data set of the UCI repository and is presented in this Jupyter notebook. This dataset contains demographic information of the residents of the United States.

We suggest that you complete this task and then answer 10 questions. Here’s the link to the form for answers (you can also find it in the notebook). Answers in the form can be changed after sending, but not after deadline (well, technically it is possible even after deadline, but it will not be credited).

Hard deadline: March 6 23:59

6. Useful links and resources

The article is written in association with @yorko (Yuri Kashitski) and with ODS community support