OpenDataScience/ODS Open Course Kick-off

Greetings everyone who waited for the open course in hands-on data analysis and machine learning to launch!

The first part is about primary data analysis with Pandas.

For now, we plan to have 7 articles that come along with Jupyter notebooks (mlcourse_open repository), competitions and hometasks.

Now comes the list of future articles, course description and the first part itself.

Plan of articles

Primary data analysis with Pandas
Visual data analysis with Python
Classification, decision trees and nearest-neighbour method.
Linear models for classification and regression. Cross-validation and model evaluation
Ensembles: bagging, random forest. Validation and learning curves
Unsupervised learning: PCA, clusterization, anomaly detection
The art of feature engineering and selection. Applications in text mining, image processing and geospatial data

Plan of this article

About the course
Hometasks in the course
Demonstration of Pandas main methods
First try to predict churn rate
Hometask #1
Useful links and resources

1. About the course

Many in our OpenDataScience team are involved in state-of-the-art machine learning technologies: DL-frameworks, Bayesian machine learning methods, probabilistic programming and not only. We are currently preparing lectures and practical exercises/workshops on these topics, and in order to prepare audience/students for them, we decided to publish a series of introductory articles on Habr.

We do not aim to develop another comprehensive introductory course on machine learning or data analysis (so it’s not a substitute for the fundamental online and offline programs and books). The purpose of this series of articles is to quickly refresh your knowledge or help you find topics for further study. Approach is similar to that of the authors of the Deep Learning book, which begins with a review of mathematics and the basics of machine learning - short, concise and with many references to sources.

If you plan to take a course, then you should know that when selecting themes and creating materials, we assume that our audience have at least 2nd year knowledge of advanced/higher mathematics and can (write) code in Python. These are not strict selection criteria, but just recommendations - you can enroll in a course without knowing math or Python (or even both!), and simultaneously catch up the following themes:

Basic math (calculus, linear algebra, optimization, probability theory and statistics) can be reviewed in excellent online courses from MIPT and the Higher School of Economics on Coursera;
for Python, small interactive tutorial on basic algorithms and data structures from Datacamp of this repository should be enough. For more advanced level see, for example, course from St.Petersburg Computer Science Center;
As for machine learning, in 95% of cases one will recommend you a classic course from Andrew Ng (Stanford, Coursera). And also some must-reads: "The elements of statistical learning" (Hastie, Tibshirani), "Pattern recognition" (Bishop), "Machine Learning: A Probabilistic Perspective" (Murphy) and "Deep Learning" (Goodfellow, Bengio). The latter book begins with an comprehensive and interesting intro into machine learning and the internal arrangement of its algorithms.

What software you’ll need

As for now, you’ll only need Anaconda build with Python 3.6 to reproduce the code in the course. Later you’ll need other libraries, but this will be additionally discussed.

Update:

We hope that most of you already know about PyPI, and can install Python 3.6, or even heard of and used venv / virtualenvwrapper, and if you are one of them, use what's more convenient for you (but we do not guarantee full compatibility of course materials with all versions of libraries!).

For all others, install Anaconda: Quick Install Guide. The build includes 100 Python libraries and supports another 720. There is also a light version - Miniconda, but you’ll need to install all the necessary libraries by yourself with it.

You can also use the Docker container that already has all the necessary software. Details are in the README in the repository.

How to join the course

You can join the course any time, but homework deadlines are strict. To participate:

Fill in the form and indicate your real name and preferably Google email in it (you’ll need google account for homework);
You can join the OpenDataScience community, the course is discussed in the #mlcourse_open channel.

2. Course Homework

Each article goes with homework in the form of Jupyter notebook, where you need to add the code, and, based on this, choose the correct answer in the Google form (that’s what you need a Google account for);
You can get up to 10 points/credits per homework;
Each homework is due in 1 week/should be done in 1 week, it’s hard deadline;
Homework solutions/answers will be posted in the repository right after the deadline;;
In the end we will publish the rank of participants.

3. Demonstration of Pandas main methods

Pandas is a Python library that provides extensive capabilities/means for data analysis. Data scientists work with data that is often stored in table formats, like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process and analyze such tabular data using SQL-like queries. And in conjunction with Matplotlib and Seaborn libraries, Pandas provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in Pandas are Series and DataFrame classes. The former is a one-dimensional indexed array of some fixed type data. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can represent it as a dictionary of objects of type Series. The structure of DataFrame is great for representing real data: rows correspond to feature description of individual objects, and the columns correspond to the features.

# Make sure that Python 2 and 3 are compatible
# pip install future
from __future__ import (absolute_import, division,
                        print_function, unicode_literals)

# turn off Anaconda warnings
import warnings
warnings.simplefilter('ignore')

# import Pandas and Numpy
import pandas as pd
import numpy as np

We’ll show the main methods in action by analyzing dataset on the outflow of customers/churn rate of the telecom operator. Let’s read the data (read_csv method) and look at the first 5 lines using the head() method:

df = pd.read_csv('../../data/telecom_churn.csv')

df.head()

In Jupyter notebooks Pandas dataframes are printed as these pretty tables and `print(df.head())` looks worse. By default, Pandas displays 20 columns and 60 rows, so if your dataframe is bigger use `set_option` function:

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

Each row is one client - this is the object of research. Columns are features of the object.

Target variable is Churn – feature of outflow, it’s binary(1 means the loss of client, i.e. churn). Later we’ll build models that predict this variable based on others, that’s why we called it target.

Let’s have a look at data dimensionality, names of features and their types.

print(df.shape)

(3333, 20)

We can see that table contains 3333 rows and 20 columns. Let’s print out column names:

print(df.columns)

Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')

We can use info method to see the general information about data frame:

print(df.info())

RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): State 3333 non-null object Account length 3333 non-null int64 Area code 3333 non-null int64 International plan 3333 non-null object Voice mail plan 3333 non-null object Number vmail messages 3333 non-null int64 Total day minutes 3333 non-null float64 Total day calls 3333 non-null int64 Total day charge 3333 non-null float64 Total eve minutes 3333 non-null float64 Total eve calls 3333 non-null int64 Total eve charge 3333 non-null float64 Total night minutes 3333 non-null float64 Total night calls 3333 non-null int64 Total night charge 3333 non-null float64 Total intl minutes 3333 non-null float64 Total intl calls 3333 non-null int64 Total intl charge 3333 non-null float64 Customer service calls 3333 non-null int64 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB None

bool, int64, float64 and object are types of features. We see that 1 feature is logical (bool), 3 features are of type object and 16 are numeric. We can also use info method to quickly look at the missing values in the data. In our case there are none, each column contains 3333 observations.

We can change column type with astype method. Let’s apply this method to Churn feature and convert it into int64:

df['Churn'] = df['Churn'].astype('int64')

describe method shows the basic statistical characteristics of the data for each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

df.describe()

In order to see statistics on non-numerical features, one needs to explicitly state/indicate types of interest in the include parameter.

df.describe(include=['object', 'bool'])

	State	Area code	International plan	Voice mail plan	Churn
count	3333	3333	3333	3333	3333
unique	51	3	2	2	2
top	WV	415	No	No	False
freq	106	1655	3010	2411	2850

For categorical (type `object`) and boolean (type `bool`) we can use **`value_counts`** method. Let’s have a look at data distribution over our target variable - `Churn`:

df['Churn'].value_counts()

0    2850
1     483
Name: Churn, dtype: int64

2850 users of 3333 are loyal, their Churn value is 0.

Let’s have a look at users distribution over/according to the Area code variable. We’ll set parameter normalize=True to see relative frequencies, not absolute.

df['Area code'].value_counts(normalize=True)

415    0.496550
510    0.252025
408    0.251425
Name: Area code, dtype: float64

Sorting

DataFrame can be sorted by the value of one of the variables. In our case, for example, it can be sorted by Total day charge (ascending=False for sorting in descending order):

df.sort_values(by='Total day charge', 
        ascending=False).head()

We can also sort by group of columns:

df.sort_values(by=['Churn', 'Total day charge'],
        ascending=[True, False]).head()

спасибо за замечание про устаревший sort @makkos

Indexing and retrieving data

DataFrame can be indexed in different ways. That said, let’s consider various ways of indexing and retrieving the data we need from the dataframe using simple questions.

To retrieve a single column, you can use a DataFrame ['Name'] construct. Let's use this to answer the question: what is the proportion of disloyal users in our dataframe?

df['Churn'].mean()

0.14491449144914492

14.5% is quite bad for a company, with such churn rate it is possible to go bankrupt.

DataFrame logical indexing by one column is also very convenient. It looks like this: df[P(df['Name'])], where P is some logical condition that is checked for each element of the Name column. The result of such indexing is the DataFrame consisting only of rows that satisfy the P condition on the Name column.

Let’s use it to answer the question: What are average values of numerical variables for disloyal users?

df[df['Churn'] == 1].mean()

Account length            102.664596
Number vmail messages       5.115942
Total day minutes         206.914079
Total day calls           101.335404
Total day charge           35.175921
Total eve minutes         212.410145
Total eve calls           100.561077
Total eve charge           18.054969
Total night minutes       205.231677
Total night calls         100.399586
Total night charge          9.235528
Total intl minutes         10.700000
Total intl calls            4.163561
Total intl charge           2.889545
Customer service calls      2.229814
Churn                       1.000000
dtype: float64

Let’s use the combination of two above methods to answer the following question: How much time do disloyal users spend on phone in average during a day?

df[df['Churn'] == 1]['Total day minutes'].mean()

206.91407867494823

What is the maximum length of international calls among loyal users (Churn == 0) who do not use the international roaming service ('International plan' == 'No')?

df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max()

18.899999999999999

DataFrames can be indexed both by column or row name, or by the serial number. loc method is used for indexing by name, iloc - for indexing by the number.

In the first case we say "give us the values of the first five rows in the columns from State to Area code", and in the second - "give us the values of the first five rows in the first three columns".

df.loc[0:5, 'State':'Area code']

	State	Account length	Area code
0	KS	128	415
1	OH	107	415
2	NJ	137	415
3	OH	84	408
4	OK	75	415
5	AL	118	510

df.iloc[0:5, 0:3]

	State	Account length	Area code
0	KS	128	415
1	OH	107	415
2	NJ	137	415
3	OH	84	408
4	OK	75	415

If we need the first or last line of the data frame, we use the `df[:1]` or `df[-1:]` construct:

df[-1:]

Applying Functions to Cells, Columns and Rows

To apply functions to each column, use apply:

df.apply(np.max)

State                        WY
Account length              243
Area code                   510
International plan          Yes
Voice mail plan             Yes
Number vmail messages        51
Total day minutes         350.8
Total day calls             165
Total day charge          59.64
Total eve minutes         363.7
Total eve calls             170
Total eve charge          30.91
Total night minutes         395
Total night calls           175
Total night charge        17.77
Total intl minutes           20
Total intl calls             20
Total intl charge           5.4
Customer service calls        9
Churn                      True
dtype: object

apply method can also be used to apply a function to each line. To do this, specify axis=1.

To apply functions to each cell of the column, use map.

For example, map method can be used to replace values in a column by passing it a dictionary of the form {old_value: new_value} as an argument:

d = {'No' : False, 'Yes' : True}
df['International plan'] = df['International plan'].map(d)
df.head()

Same thing can be done with replace method:

df = df.replace({'Voice mail plan': d})
df.head()

Grouping

In general, grouping of data in Pandas goes as follows:

df.groupby(by=grouping_columns)[columns_to_show].function()

First, groupby method is applied to DataFrame; it divides data according to the grouping_columns - one or several variables.
Then, columns of interest are selected (columns_to_show).
Finally, one or several functions are applied to the obtained groups.

Here’s how grouping of data according to the value of Churn variable and displaying statistics on three columns in each group is implemented:

columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes']

df.groupby(['Churn'])[columns_to_show].describe(percentiles=[])

		Total day minutes	Total eve minutes	Total night minutes
Churn
0	count	2850.000000	2850.000000	2850.000000
	mean	175.175754	199.043298	200.133193
	std	50.181655	50.292175	51.105032
	min	0.000000	0.000000	23.200000
	50%	177.200000	199.600000	200.250000
	max	315.600000	361.800000	395.000000
1	count	483.000000	483.000000	483.000000
	mean	206.914079	212.410145	205.231677
	std	68.997792	51.728910	47.132825
	min	0.000000	70.900000	47.400000
	50%	217.600000	211.300000	204.800000
	max	350.800000	363.700000	354.900000

Let’s do the same thing slightly differently by passing list of functions to `agg`:

columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes']

df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, np.max])

	Total day minutes				Total eve minutes				Total night minutes
	mean	std	amin	amax	mean	std	amin	amax	mean	std	amin	amax
Churn
0	175.175754	50.181655	0.0	315.6	199.043298	50.292175	0.0	361.8	200.133193	51.105032	23.2	395.0
1	206.914079	68.997792	0.0	350.8	212.410145	51.728910	70.9	363.7	205.231677	47.132825	47.4	354.9

Summary tables

Suppose we want to see how the observations in our sample are distributed in the context of two variables - Churn and International plan. To do so, we can build a contingency table using crosstab method:

pd.crosstab(df['Churn'], df['International plan'])

International plan	No	Yes
Churn
0	2664	186
1	346	137

pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)

Voice mail plan	No	Yes
Churn
0	0.602460	0.252625
1	0.120912	0.024002

We can see that most of the users are loyal and use additional services (international roaming/voicemail).

Advanced Excel users will probably remember about such a feature as pivot tables. In Pandas, pivot_table method is responsible for pivot tables, it takes the following parameters:

*values - list of variables for which one needs to calculate statistics; *index – list of variables according to which to group data, *aggfunc — what statistics we need to calculate for groups - sum, mean, maximum, minimum or something else.

Let’s take a look at average number of day, evening and night calls for different Area codes:

df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'], ['Area code'], aggfunc='mean').head(10)

	Total day calls	Total eve calls	Total night calls
Area code
408	100.496420	99.788783	99.039379
415	100.576435	100.503927	100.398187
510	100.097619	99.671429	100.601190

DataFrame transformations

Like many other things in Pandas, adding columns to the DataFrame is feasible in several ways.

For example, we want to calculate the total number of calls for all users. Let’s create total_calls object of type Series and paste it into the data frame:

total_calls = df['Total day calls'] + df['Total eve calls'] + df['Total night calls'] + df['Total intl calls']
df.insert(loc=len(df.columns), column='Total calls', value=total_calls) 
# loc is the number of column after which to insert the Series object
# we set it to len(df.columns) to paste it in the very end
df.head()

It is possible to add a column from existing ones in an easier way, without creating intermediate Series:

df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + df['Total night charge'] + df['Total intl charge']

df.head()

To delete columns or rows, use drop method, passing it the required indexes and the required value of the axis parameter (1 if you delete the columns, and nothing or 0 if you delete the rows) as the argument:

df = df.drop(['Total charge', 'Total calls'], axis=1) # get rid of the just created columns

df.drop([1, 2]).head() # and here’s how you can delete rows

4. First Tries to Predict Churn Rate

Let's see how churn rate is connected with the International plan variable. We’ll do this using a crosstab contingency table and Seaborn illustration/visualisation (we’ll cover the material on how to build such pictures and make sense of them in the next article).

pd.crosstab(df['Churn'], df['International plan'], margins=True)

International plan	False	True	All
Churn
0	2664	186	2850
1	346	137	483
All	3010	323	3333

We see that when roaming is switched on, churn rate is much higher - that’s an interesting observation! Perhaps large and poorly controlled spending/expenses in roaming is very conflict-prone/triggers conflicts and leads to discontent/dissatisfaction of the telecom operator's customers and, accordingly, to their churn.

Next, let’s look at another important feature - "Number of calls to the service center" (Customer service calls). Let’s also make a summary table and a picture.

pd.crosstab(df['Churn'], df['Customer service calls'], margins=True)

Customer service calls	0	1	2	3	4	5	6	7	8	9	All
Churn
0	605	1059	672	385	90	26	8	4	1	0	2850
1	92	122	87	44	76	40	14	5	1	2	483
All	697	1181	759	429	166	66	22	9	2	2	3333

Perhaps, it is not so obvious from the summary table (or it's boring to creep along the lines with numbers), but the picture eloquently testifies that the churn rate strongly increases from 4 calls to the service center.

Let’s now add a binary attribute to our DataFrame - the result of the comparison Customer service calls > 3. And once again, let's see how it relates to the churn.

df['Many_service_calls'] = (df['Customer service calls'] > 3).astype('int')

pd.crosstab(df['Many_service_calls'], df['Churn'], margins=True)

Churn	0	1	All
Many_service_calls
0	2721	345	3066
1	129	138	267
All	2850	483	3333

Let’s merge conditions from above and construct a summary table for this union and churn.

pd.crosstab(df['Many_service_calls'] & df['International plan'] , df['Churn'])

Churn	0	1
row_0
False	2841	464
True	9	19

Therefore, predicting customer loyalty in case when the number of calls to the service center is less than 4 and roaming is not connected (and predicting churn otherwise), you can expect a percentage of "customer loyalty guessing" of about 85.8% (we are mistaken only 464 + 9 times). This 85.8% that we got with very simple reasoning is a good starting point (*baseline*) for the further machine learning models that we will build.

In general, before the advent of machine learning, the data analysis process looked something like this. Let's recap:

The share of loyal clients in the sample is 85.5%. The most naive model that always predicts a "loyal customer" on such data will guess right in about 85.5% of cases. That is, the proportion of correct answers (accuracy) of subsequent models should be not less, and better significantly higher than this number;
With the help of a simple forecast that can be expressed by the following formula: "International plan = False & Customer Service calls <4 => Churn = 0, else Churn = 1", we can expect a guessing rate of 85.8%, which is just above 85.5%. Subsequently, we'll talk about decision trees and figure out how to find such rules automatically based only on the input data;
We got these two baselines without any machine learning, and they’ll serve as the starting point for our subsequent models. If it turns out that with enormous efforts we enlarge the share of correct answers by, say, 0.5%, then perhaps we are doing something wrong, and it suffices to confine ourselves to a simple model with two conditions;
Before training complex models it is recommended to twist the data a little and check simple assumptions. Moreover, in business applications of machine learning, they usually start with simple solutions, and then experiment with their complications.

5. Homework #1

The first homework assignment is devoted to the analysis of the Adult data set of the UCI repository and is presented in this Jupyter notebook. This dataset contains demographic information of the residents of the United States.

We suggest that you complete this task and then answer 10 questions. Here’s the link to the form for answers (you can also find it in the notebook). Answers in the form can be changed after sending, but not after deadline (well, technically it is possible even after deadline, but it will not be credited).

Hard deadline: March 6 23:59

6. Useful links and resources

First of all, of course, the official documentation of Pandas. In particular, we recommend a short introduction of 10 minutes to pandas, PDF-шпаргалка по библиотеке,
PDF-шпаргалка по библиотеке,
Отличная презентация Александра Дьяконова «Знакомство с Pandas»,
На гитхабе есть подборка упражнений по Pandas и еще один полезный репозиторий (на английском языке),
scipy-lectures.org — учебник по работе с pandas, numpy, matplotlib и scikit-learn.

The article is written in association with @yorko (Yuri Kashitski) and with ODS community support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ODS1_eng.md

ODS1_eng.md

OpenDataScience/ODS Open Course Kick-off

Plan of articles

Plan of this article

1. About the course

What software you’ll need

How to join the course

2. Course Homework

3. Demonstration of Pandas main methods

Sorting

Indexing and retrieving data

Applying Functions to Cells, Columns and Rows

Grouping

Summary tables

DataFrame transformations

4. First Tries to Predict Churn Rate

5. Homework #1

6. Useful links and resources

Files

ODS1_eng.md

Latest commit

History

ODS1_eng.md

File metadata and controls

OpenDataScience/ODS Open Course Kick-off

Plan of articles

Plan of this article

1. About the course

What software you’ll need

How to join the course

2. Course Homework

3. Demonstration of Pandas main methods

Sorting

Indexing and retrieving data

Applying Functions to Cells, Columns and Rows

Grouping

Summary tables

DataFrame transformations

4. First Tries to Predict Churn Rate

5. Homework #1

6. Useful links and resources