How-To Use Python to Remove or Modify Empty Values in a CSV Dataset

Problem

Data sets are not perfect. Sometimes they end up with invalid, corrupt, or missing values. For the project I was working on, I could not have any values that are null or empty. This How-To will walk you through writing a simple Python script to see if your data set has null or empty values, and if so, it will propose two options for how to modify your data.

Example

This simple data set shows you a flight and tells you its airline, flight number, and the reason it was cancelled. However, if a flight wasn't cancelled, it will have no cancelled reason, and therefore has a null/empty value.

Example of CSV missing data in excel - there is no "Cancelled Reason" for AwesomeAir flight 456:

Example of same CSV missing data in traditional comma-separated format - there is no "Cancelled Reason" for AwesomeAir flight 456:

Dependencies

Python
Python Data Analysis Library (Pandas)
A CSV dataset
Optional: iPython interactive shell

Disclaimers

This can significantly increase the size of your data set, because you are adding values to it. With large data sets, the pandas commands can take time.
These may not be the best solutions for your data. For more information on other ways to handle missing data with pandas, please refer to Handling missing data

Other Resources

I used this kaggle data set to perform these operations, and it includes some more instructions on using Pandas and other Python libraries to explore your data

Steps

The following steps can be written in a Python script and run at once, but I find it more interesting to explore in an interactive Python shell like iPython

If you want to see what the scripts look like all together, please check out Solution 1 and Solution 2. Otherwise, keep reading and follow along step by step.

Import pandas import pandas as pd
Import csv into a Pandas DataFrame object flights = pd.read_csv('flights.csv')
Check the shape of your data in (rows, columns) format flights.shape
(Optional) Check for all null values in your dataset. This will return a boolean stating if each cell is null. This can take a long time and may not be particularly useful in a very large dataset. flights.isnull()
Explore how many null values are in each column of your dataset flights.isnull().sum()
(Optional) Check how many null values are in a specific column, substituting the name of your column in string form where it says 'col' flights[col].isnull().sum()

At this point, you will either replace your values with a space or remove them entirely

Solution 1: Replace empty/null values with Space

Fill all null or empty cells in your original DataFrame with an empty space and set that to a new DataFrame variable, here, called 'modifiedFlights'*. modifiedFlights=flights.fillna(“ “)
Verify that you no longer have any null values by running modifiedFlights.isnull().sum()
Save your modified dataset to a new CSV, replacing 'modifiedFlights.csv' with whatever you would like to name your new file. modifiedFlights.to_csv('modifiedFlights.csv',index=False)

*If you wish, you can replace your original DataFrame, using flights=flights.fillna(" ")

Solution 2: Remove rows with empty values

If there are only a few null values and you know that deleting values will not cause adverse effects on your result, remove them from your DataFrame and store that in a new DataFrame* modifiedFlights = flights.dropna()
Verify that you no longer have any null values by running modifiedFlights.isnull().sum()
Save your modified dataset to a new CSV, replacing 'modifiedFlights.csv' with whatever you would like to name your new file. modifiedFlights.to_csv('modifiedFlights.csv',index=False)

*If you wish, you can replace your original DataFrame, using flights=flights.dropna()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

How-To Use Python to Remove or Modify Empty Values in a CSV Dataset

Problem

Example

Dependencies

Disclaimers

Other Resources

Steps

Solution 1: Replace empty/null values with Space

Solution 2: Remove rows with empty values

Files

README.md

Latest commit

History

README.md

File metadata and controls

How-To Use Python to Remove or Modify Empty Values in a CSV Dataset

Problem

Example

Dependencies

Disclaimers

Other Resources

Steps

Solution 1: Replace empty/null values with Space

Solution 2: Remove rows with empty values