This repository contains notebooks to get started with predictive analytics using scikit-learn and pandas.
This material is strongly inspired from the EuroPython 2014 scikit-learn tutorial
-
Olivier Grisel @ogrisel | http://ogrisel.com
-
Gael Varoquaux @GaelVaroquaux | http://gael-varoquaux.info
which was inspired by http://github.com/jakevdp/sklearn_scipy2013 by Jake VanderPlas @jakevdp | http://jakevdp.github.com
This tutorial will require recent installations of numpy, scipy, matplotlib, scikit-learn, pandas and Pillow (or PIL).
For users who do not yet have these packages installed, a relatively painless way to install all the requirements is to use a package such as Anaconda, which can be downloaded and installed for free.
Please download in advance the datasets mentionned in Data Downloads
The recommended way to access the materials is to execute them in the
IPython/jupyter notebook. If you have the notebook installed, you should
download the materials (see below), go the the notebooks
directory, and
launch IPython notebook from there by typing:
cd notebooks
jupyter notebook # ipython notebook if old version
in your terminal window. This will open a notebook panel load in your web browser.
I would highly recommend using git, not only for this tutorial, but for the general betterment of your life. Once git is installed, you can clone the material in this tutorial by using the git address shown above:
If you can't or don't want to install git, there is a link above to download the contents of this repository as a zip file. I may make minor changes to the repository in the days before the tutorial, however, so cloning the repository is a much better option.
The data for this tutorial is not included in the repository. We will be
using several data sets during the tutorial: most are built-in to
scikit-learn, which includes code which automatically downloads and
caches these data. Because the wireless network at conferences can often
be spotty, it would be a good idea to download these data sets before
arriving at the conference. You can do so by using the fetch_data.py
included in the tutorial materials.
You will also need:
https://dl.dropboxusercontent.com/u/2140486/data/titanic_train.csv
https://dl.dropboxusercontent.com/u/2140486/data/adult_train.csv