title

cover title

description

programming_language

learning objectives

estimated time

prerequisites

instructors

authors

Editors

readings

ethical considerations

projects

resources

Data Manipulation in Pandas and Python

Pandas

In this workshop, we are going to learn some basic commands in Pandas, an expansive Python library for working with tabular data like CSV files. You can think of Pandas as a more powerful version of Excel that operates within the Python environment, where you can wrangle, clean, analyze, and visualize data. Knowing how to use Pandas is important if you plan on working with datasets that include qualitative and/or quantitative data points.

jupyter

Import Pandas and read in a CSV file as a DataFrame

Explore your data, including displaying and sampling the data

Clean your data, including checking for duplicates and converting data types

Filter your data, including renaming, selecting, dropping, and adding columns

Analyze your data by sorting columns, grouping columns, and counting values

Visualize your data with basic bar charts, pie charts, and time series

Write a DataFrame to a CSV file

Build your Pandas skills with the Pandas documentation and other resources

2 hours

python

description	required
(required) This workshop relies heavily on concepts from the Intro to Python workshop, and having a basic understanding of how to use the commands discussed in the workshop will be central for anyone who wants to learn about data analysis with Python and Pandas.	true

command-line

description	recommended
(recommended) This workshop makes some reference to concepts from the Command Line workshop, and having basic knowledge about how to use the command line will be central for anyone who wants to learn about text analysis with Python and Pandas.	true

data ethics

description	recommended
(recommended) This workshop will give you a basis for thinking through the ethical considerations of your programming projects.	true

Rebecca Krisel

Stephen Zweibel

This workshop uses Jupyter Notebooks to process the Python commands in a clear and visual way. Read this very short introduction to how to use Notebooks [A Beginner’s Tutorial to Jupyter Notebooks](https://towardsdatascience.com/a-beginners-tutorial-to-jupyter-notebooks-1b2f8705888a)

[Short Introduction to Jupyter Notebooks](https://curriculum.dhinstitutes.org/insights/short-introduction-to-jupyter-notebooks/)

[Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data](https://www.tableau.com/learn/articles/what-is-data-cleaning)

The dataset we are using for this workshop is from the [U.S. Refugee Arrivals Data](https://github.com/BuzzFeedNews/2015-11-refugees-in-the-united-states/blob/master/data/WRAPS-arrivals-by-destination-2005-2015-clean.csv). This dataset contains data about refugee arrivals to the United States between 2005 and 2015 and was originally compiled from the Department of State’s Refugee Processing Center by Jeremy Singer-Vine for his BuzzFeed article [“Where U.S. Refugees Come From — And Go — In Charts.”](https://www.buzzfeednews.com/article/jsvine/where-us-refugees-come-from-and-go-in-charts#.vooNwy74jO)

As with any dataset, responsible data analysis requires more than just technical tools like Pandas. We also need to interrogate the data. Who collected this data? How and why is this data being collected? What assumptions are baked into this data? What consequences does this data have in the world? What does this data tell us about our world? When exploring this dataset, we can consider the meaning of the status of refugee, who has a right to claim that status, and which refugees are considered eligible to legally resettle in the U.S.. We could also think about who may not be represented in this data, including asylum seekers who migrate to the U.S. and become undocumented immigrants while waiting for their asylum claims to be processed.

The Simplest Data Science Project Using Pandas & Matplotlib

description	link
Further deepen your Pandas skills while making beautiful data visualizations in Matplotlib using a dataset (from Kaggle) of forest fires in Brazil.	https://towardsdatascience.com/the-simplest-data-science-project-using-pandas-matplotlib-9d7042e7ce6f

Make a Gradebook With Python & pandas

description	link
Learn how you can automate the process of calculating grades at the end of the semester using Pandas	https://realpython.com/pandas-project-gradebook/

Jupyter Notebook shortcuts, tips and tricks

description	link
Here are some of the keyboard shortcuts and text snippets to help you program faster!	http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html

Jupyter & JupyterLite

In this workshop, we are going to use JupyterLite as our Python environment, which is a lightweight version of Jupyter Notebooks.

You can think of Jupyter Notebooks as a comprehensive toolset for interactive computing and data exploration. It's like a fully equipped research laboratory where you can seamlessly blend narrative text, code, and visualizations. JupyterLite is a streamlined, lightweight version of this toolset, offering a more agile and accessible environment.

Jupyter Notebooks - Full Research Lab:

Versatility: Jupyter Notebooks are like a fully equipped research lab. They support various programming languages, allowing you to conduct diverse computational experiments.
Interactivity: Much like a lab where you can actively engage with your experiments, Jupyter Notebooks let you interactively run code cells, observe results, and refine your analyses on the fly.
Rich Outputs: Visualizations, graphs, and other dynamic outputs enrich your narrative, creating a comprehensive and visually engaging research document.

JupyterLite - Portable Research Kit:

Simplicity: JupyterLite is a simplified, more portable version. It's akin to having a compact research kit that allows you to conduct essential experiments without the full complexity of the lab.
Accessibility: When you need a quick, accessible environment for coding and exploration, JupyterLite provides a more straightforward setup. It's like having your essential tools ready without the need for a fully equipped lab.

Learning and On-the-Go:

Educational Tool: JupyterLite is often used for educational purposes or quick experiments. It serves as a friendly entry point for those learning about interactive computing without overwhelming them with the extensive features of the full Jupyter Notebooks.
Portability: JupyterLite is convenient for situations where the full-scale Jupyter environment might be impractical or unnecessary. It's like having a research companion that fits into your backpack.

In essence, JupyterLite is a more nimble version of Jupyter Notebooks, suitable for scenarios where a lighter computational environment is sufficient. It shares the same core principles of interactive computing but in a more accessible and portable form, making it a valuable tool for quick experiments, learning, and situations where resource constraints or simplicity are paramount.

- Jupyter Notebook Comprehensive toolset for interactive computing and data exploration, blending narrative text, code, and visualizations

JupyterLite A lightweight version of Jupyter Notebook

Preparing your workspace and folders

Keeping all your files for a particular project in a designated file directory will keep your project organized and will make it easier to read in your files.

Creating a new folder

Let’s create a “pandas_workshop” folder in our JupyterLite code editor workspace:

Click on "Open Code Editor" in the top right corner of this browser window.

In the folders toolbar, click the icon for the folder with the '+' on it. This will create a new "Untitled Folder."

To rename the folder, click on the folder name while pressing the "control" key on your keyboard at the same time (MacOS) or right-clicking on your mousepad (Windows), and selecting "Rename" from the drop-down menu.
Rename the folder: “pandas_workshop”.

- Code editor A code editor is a software application that is specifically designed for writing and editing source code of computer programs. It provides features and tools to make the process of coding more efficient and productive for developers. Code editors are essential tools for software development and come in various forms, ranging from simple text editors to feature-rich integrated development environments (IDEs).

Download the Full Code in a Jupyter Notebook & upload it into the code editor

We’ve saved all of the code for this section in a Jupyter Notebook file.

To access this file, click the following button:

Open your code editor in the right hand corner and locate the pandas_workshop_2024.ipynb file
When prompted, select the "Python Pyodide" kernel
Finally, drag and drop the pandas_workshop_2024.ipynb file into your pandas_workshop folder
Once it's in that folder, you should be all set!

In this file you will find all of the workshop commands and the expected outputs. If you ever feel stuck or can’t seem to be able to advance in the workshop, you can open this file and see how we did it.

For the best possible experience, we suggest/encourage you to:

Follow the workshop typing all the code yourself.
Avoid copying/pasting the code. Much of learning has to do with you typing yourself.
Only check the PandasWorkshop.ipynb file if you get lost or if you are not able to get the right output. Before opening it, put some time trying to figure out by yourself why it isn’t working. A big part of coding is learning to identify what we are doing wrong.
We also caution you against working with both files open at the same time. It is easy to get confused and start modifying the wrong one. But those are only suggestions. Maybe they will work for you, maybe they won’t, so feel free to do what suits you best. You are in charge here!

Launching a new JupyterLite Notebook

Finally, let's launch a new JupyterLite Notebook from within the “pandas_workshop” folder.

Make sure you are in the “pandas_workshop” folder. If not, navigate there.
From the launcher window, click the "Python (Pyodide)" button

A new JupyterLite Notebook will open. Rename the notebook by clicking on the notebook name while pressing the "control" key on your keyboard at the same time (MacOS) or right-clicking on your mousepad (Windows), and selecting "Rename" from the drop-down menu.
Rename the notebook: “pandas_workshop”.

Hooray! You are all set to move forward!

- Jupyter Notebook Comprehensive toolset for interactive computing and data exploration, blending narrative text, code, and visualizations

JupyterLite A lightweight version of Jupyter Notebook

Getting started with Pandas

Pandas is a Python software library
The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals
You can think of it as a FREE equivalent to Stata or SPSS for data manipulation and analysis. It is also more powerful than Excel
Knowing how to use Pandas is important if you plan on working with data organized in spreadsheets (either quantitative or qualitative)

- Pandas A powerful Python library for data manipulation and analysis, particularly designed to handle structured data. It provides data structures like DataFrames that allow users to easily organize, clean, and analyze tabular data, such as spreadsheets or SQL tables. With Pandas, non-technical users can perform tasks like filtering, aggregating, and visualizing data without extensive programming knowledge, making it a valuable tool for data exploration and preparation.

Import Pandas and Matplotlib

In the first blank cell, type the following command to import the Pandas library into our Jupyter Notebook. To run the command, you can click the “Run” button in the top toolbar, or you can click shift + return.

import pandas as pd
import matplotlib

The first import statement not only imports the Pandas library but also gives it the alias “pd.” Using this alias will save us from having to type out the entire word “Pandas” each time we need to use it. Libraries are sets of instructions that Python can use to perform specialized functions.

The second import, matplotlib, is used to import the Matplotlib library, which is a plotting library for Python and Pandas. We will use Matplotlib to create visualizations of our data later in the workshop.

If you don’t see an error when you run the notebook—that is, if there is no output—you can move on to the next step. It is not rare in programming that when you do things right, the result will be nothing happening. This is what we like to call a silent success.

- Import In Python, the "import" keyword is used to bring external modules or libraries into your script, making their functionality available for use. Once imported, you can access the functions, classes, or variables defined in the module using the module's name as a prefix.

Silent Success In Python, a "silent success" typically refers to a successful operation or function call that does not produce any visible output or raise explicit errors. It implies that the operation was completed successfully, but the result might not be explicitly displayed, making it important for developers to check return values or logs to confirm the success.

Read in a CSV file as a DataFrame

Next, we will read in our dataset saved as a CSV file. We will specifically work with the refugee-arrivals-by-destination.csv dataset, which contains data about refugee arrivals to the United States between 2005 and 2015, including the number of refugees who arrived in each U.S. city and state, the year that they arrived, and the country from which they arrived.

We will connect to the dataset directly from this Github link

import pyodide_http # makes it possible to read https links in pyodide
pyodide_http.patch_all()
url = 'https://raw.githubusercontent.com/GC-DRI/DRI24/main/uploads/pandas/refugee-arrivals-by-destination.csv'
refugee_df = pd.read_csv(url, delimiter=",", encoding='utf-8')

With this command, we are creating a Pandas DataFrame object, which is a 2-dimensional labeled data structure with columns of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
It is common practice to abbreviate DataFrame with “df”, as in refugee_df.

- DataFrame In Python, a DataFrame is a two-dimensional, tabular data structure provided by the Pandas library. It organizes data into rows and columns, similar to a spreadsheet, and offers powerful tools for data manipulation, analysis, and cleaning.

Delimiter In a CSV (Comma-Separated Values) file, a delimiter is a character or sequence of characters that separates individual fields or data elements within each row. Common delimiters include commas, tabs, or semicolons, and they enable the proper parsing of data when reading or writing CSV files.
Encoding In the context of text data, encoding refers to the method used to represent characters as binary data. UTF-8, for example, is a widely used encoding that can represent most characters in the Unicode standard, allowing for the storage and transmission of text in multiple languages.

Evaluation

What is the primary data structure in Pandas for working with tabular data?

- Series - DataFrame* - Array - List

What is a DataFrame?

- A two-dimensional data structure* - A three-dimensional data structure - A single-dimensional data structure - A series object

Data exploration

In the following section we will learn how to view the DataFrame in addition to viewing smaller sections of it.

Display the DataFrame

To display the DataFrame, we can run a cell with the variable name of the DataFrame:

refugee_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals
0	2005	Afghanistan	Alabama	Mobile	0
1	2006	Afghanistan	Alabama	Mobile	0
2	2007	Afghanistan	Alabama	Mobile	0
3	2008	Afghanistan	Alabama	Mobile	0
4	2009	Afghanistan	Alabama	Mobile	5
...	...	...	...	...	...
121240	2011	Vietnam	Wisconsin	Milwaukee	0
121241	2012	Vietnam	Wisconsin	Milwaukee	4
121242	2013	Vietnam	Wisconsin	Milwaukee	2
121243	2014	Vietnam	Wisconsin	Milwaukee	0
121244	2015	Vietnam	Wisconsin	Milwaukee	0

121245 rows × 5 columns

Let’s take a look at a few elements in this DataFame:

Index
- The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
- By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.
- The default index is a Unique ID - that being said, the index does not have to be a Unique ID. You can set your index to be any column in your DataFrame
Truncation
- The DataFrame is truncated, signaled by the ellipses in the middle … of every column.
- The DataFrame is truncated because by default, Pandas will display 60 rows and 20 columns. You can change these default settings see documentation
Rows x Columns
- Pandas reports how many rows and columns are in this dataset at the bottom of the output. Our DataFrame has 121,245 rows × 5 columns.
NAN
- NaN is the Pandas value for any missing data.

- Index In a Pandas DataFrame, the index serves as a unique identifier for each row, allowing for easy access and manipulation of data. It provides a label for the rows, facilitating efficient data retrieval, alignment, and merging operations in various analytical tasks.

Display a selection of the DataFrame

We can also display the first 2 rows of the DataFrame with the .head() method

refugee_df.head(2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals
0	2005	Afghanistan	Alabama	Mobile	0
1	2006	Afghanistan	Alabama	Mobile	0

We can accomplish the same task using a slice instead:

refugee_df[:2]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals
0	2005	Afghanistan	Alabama	Mobile	0
1	2006	Afghanistan	Alabama	Mobile	0

We can also display the last 10 rows of the DataFrame with the .tail() method

refugee_df.tail(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals
121235	2006	Vietnam	Wisconsin	Milwaukee	5
121236	2007	Vietnam	Wisconsin	Milwaukee	18
121237	2008	Vietnam	Wisconsin	Milwaukee	12
121238	2009	Vietnam	Wisconsin	Milwaukee	4
121239	2010	Vietnam	Wisconsin	Milwaukee	7
121240	2011	Vietnam	Wisconsin	Milwaukee	0
121241	2012	Vietnam	Wisconsin	Milwaukee	4
121242	2013	Vietnam	Wisconsin	Milwaukee	2
121243	2014	Vietnam	Wisconsin	Milwaukee	0
121244	2015	Vietnam	Wisconsin	Milwaukee	0

We can accomplish the same task using a slice instead:

refugee_df[-10:]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals
121235	2006	Vietnam	Wisconsin	Milwaukee	5
121236	2007	Vietnam	Wisconsin	Milwaukee	18
121237	2008	Vietnam	Wisconsin	Milwaukee	12
121238	2009	Vietnam	Wisconsin	Milwaukee	4
121239	2010	Vietnam	Wisconsin	Milwaukee	7
121240	2011	Vietnam	Wisconsin	Milwaukee	0
121241	2012	Vietnam	Wisconsin	Milwaukee	4
121242	2013	Vietnam	Wisconsin	Milwaukee	2
121243	2014	Vietnam	Wisconsin	Milwaukee	0
121244	2015	Vietnam	Wisconsin	Milwaukee	0

- Function vs. Method In Python, a function is a block of reusable code that can be defined and called independently of any object, while a method is a function associated with an object, often designed to operate on that specific object. Methods are invoked using dot notation on an instance of a class, whereas functions are typically standalone and can be called without an associated object.

.head() & .tail() In Pandas, the .head() method is used to display the first few rows of a DataFrame, providing a quick overview of its structure, while the .tail() method shows the last few rows. Both methods help users inspect data frames efficiently.

Display a random sample of the DataFrame

We can also look at a random sample of data with the .sample() method

refugee_df.sample(15)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals
102329	2009	Dem. Rep. Congo	Texas	Cedar Park	0
40110	2006	Ukraine	Illinois	Woodridge	0
45485	2013	Cuba	Louisiana	Avondale	2
62900	2015	Liberia	Minnesota	Brooklyn Center	1
46604	2010	Iraq	Maine	Norway	10
116748	2007	Latvia	Washington	Lynnwood	0
5039	2014	Burma	California	Chula Vista	0
66487	2005	Iraq	Missouri	Saint Louis	0
108591	2012	Iraq	Vermont	Essex Junction	0
94980	2008	Russia	Pennsylvania	Warminster	2
4690	2006	Bhutan	California	Alameda	0
72042	2005	Liberia	New Jersey	Englewood	0
66931	2009	Somalia	Missouri	Kansas City	141
91265	2011	Belarus	Pennsylvania	Millersburg	0
2077	2011	Iraq	Arizona	Goodyear	1

We can tell it's a random sample since the index numbers are completely disorganized.

- .sample() In Pandas, the `.sample()` method is used to randomly select a specified number of rows from a DataFrame. It provides a way to obtain a random subset of data for exploratory analysis or testing purposes.

Challenge

Using both the .head() method and a slice, write code to display the top 9 rows of the DataFrame:

Solution

```python refugee_df.head(9) refugee_df[:9] ```

Write code to display the last 5 rows of the DataFrame (there are two ways to do this):

Solution

```python refugee_df.tail(5) refugee_df[-5:] ```

Write code to display a random selection of 10 rows in the DataFrame:

Solution

```python refugee_df.sample(10) ```

Evaluation

What is the purpose of the .head() method in Pandas?

- To extract the first few rows of a DataFrame* - To display a summary of the data types in a DataFrame - To sort the DataFrame in ascending order - To count the number of rows in a DataFrame

Which of the following commands would return the top three rows in a dataframe? Select all that apply

- `refugee_df[0:3]`* - `refugee_df[0:4]` - `refugee_df[0:2]` - `refugee_df.head(3)`*

Basic data cleaning

In the following sections we will review some basic data cleaning steps including checking and converting our data types in addition to checking and removing duplicate rows.

Data cleaning is a crucial process for refining tabular data. Begin by addressing missing values. Tackle outliers to maintain data integrity and accuracy. Remove duplicates for a clean and unique dataset. Standardize formats for consistency, and finally, engage in feature engineering to enhance data insights and prepare it for analysis. These steps collectively ensure a refined and reliable foundation for effective data exploration and interpretation.

Data Types

We can get information about the columns in the DataFrame by using the .info() method.

refugee_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121245 entries, 0 to 121244
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   year        121245 non-null  int64 
 1   origin      121245 non-null  object
 2   dest_state  121245 non-null  object
 3   dest_city   121245 non-null  object
 4   arrivals    121245 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 4.6+ MB

This report tells us how many non-null, or non-blank, values are in each column, as well as what type of data is in each column.

Pandas uses a different lexicon to describe data types from those we learned in our intro to Python curriculum . Below is a table that explains what each data type means:

Pandas data types	Python data types	Usage
object	String or mixed	Text or mixed numeric and non-numeric values
float64	float	Floating point numbers
int64	integer	Integer numbers
datetime64	NA	Date and time values

- Data Type In Python, data types define the type of values that variables can hold, such as integers, floats, strings, and booleans. Each data type comes with specific operations that can be performed on the associated values.

Converting data types

Keeping this in mind, it looks as though the data type for the year column is a “int64” instead of being “datetime64.”

refugee_df['year'] = pd.to_datetime(refugee_df['year'], format="%Y")

This command translates to: for the “year” column in the “refugee_df” DataFrame, use the to_datetime method in the Pandas library to convert the values in the “year” column in the “refugee_df” DataFrame to datetime data types. We specify the format as %Y since this is the format of our data.

We can then check to see if the data type was properly converted using the .dtypes object, which is similar to the .info() method, except it only provides information on data types.

refugee_df.dtypes

year          datetime64[ns]
origin                object
dest_state            object
dest_city             object
arrivals               int64
dtype: object

As we can see, the data in the “year” column was successfully transformed into the datetime64 data type.

- Datetime In Pandas, the datetime data type represents date and time information. It allows for efficient handling, manipulation, and analysis of temporal data in a tabular format, offering functionalities for parsing, formatting, and performing operations on date and time values.

Check for duplicate rows

As part of our data cleaning process, we want to check for duplicate rows. We can do this by using the .duplicated() method inside a filter to isolate only the rows in the DataFrame that are exact duplicates. Filtering data by certain values is similar to selecting columns. We add the parameter keep=False, which will display all the duplicated values in the dataset (meaning none are considered unique) — rather than just the first duplicated value keep='first' or the last duplicated value keep='last'.

refugee_df[refugee_df.duplicated(keep=False)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals
13	2007-01-01	Burma	Alabama	Auburn	0
14	2008-01-01	Burma	Alabama	Auburn	0
15	2009-01-01	Burma	Alabama	Auburn	0
16	2010-01-01	Burma	Alabama	Auburn	0
17	2007-01-01	Burma	Alabama	Auburn	0
18	2008-01-01	Burma	Alabama	Auburn	0
19	2009-01-01	Burma	Alabama	Auburn	0
20	2010-01-01	Burma	Alabama	Auburn	0
155	2013-01-01	Cuba	Alabama	Russellville	0
156	2014-01-01	Cuba	Alabama	Russellville	0
157	2015-01-01	Cuba	Alabama	Russellville	0
158	2005-01-01	Cuba	Alabama	Theodore	0
159	2006-01-01	Cuba	Alabama	Theodore	0
160	2013-01-01	Cuba	Alabama	Russellville	0
161	2014-01-01	Cuba	Alabama	Russellville	0
162	2015-01-01	Cuba	Alabama	Russellville	0
163	2005-01-01	Cuba	Alabama	Theodore	0
164	2006-01-01	Cuba	Alabama	Theodore	0
236	2012-01-01	Iran	Alabama	Mobile	1
237	2013-01-01	Iran	Alabama	Mobile	6
238	2014-01-01	Iran	Alabama	Mobile	3
239	2015-01-01	Iran	Alabama	Mobile	5
240	2005-01-01	Iran	Alabama	Pelham	0
247	2012-01-01	Iran	Alabama	Mobile	1
248	2013-01-01	Iran	Alabama	Mobile	6
249	2014-01-01	Iran	Alabama	Mobile	3
250	2015-01-01	Iran	Alabama	Mobile	5
251	2005-01-01	Iran	Alabama	Pelham	0

Looks like we have a few duplicate rows in our dataset.

To remove those duplicates, we can use the .drop_duplicates() method to drop duplicates from the DataFrame and select to keep the first instance of the duplicate or the last instance:

refugee_df = refugee_df.drop_duplicates(keep='first')

We can check to see if the command got rid of the duplicate rows by running the .duplicated() method again:

refugee_df[refugee_df.duplicated(keep=False)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	dest_state	dest_city	arrivals

Great news! We successfully removed our duplicate rows!

- .duplicated() The .duplicated() method in Pandas identifies duplicate rows in a DataFrame based on the values in all columns. When keep=False is specified, all occurrences of the duplicated rows are marked as True, meaning none are considered unique, and all duplicates are included in the result. If keep is set to 'first' or 'last', it retains the first or last occurrence of each duplicated set, respectively, while marking the rest as True.

Evaluation

What does the keep parameter in the .duplicated() method of Pandas control?

- It specifies the number of duplicates to keep. - It determines whether to display all occurrences or just the first occurrence of duplicated rows.* - It controls the sorting order of duplicated values. - It determines whether to keep or remove duplicated rows.

Rename, select, drop, filter and add new columns

In the following section we will learn how to rename, select, drop, filter, and add new columns.

See list of columns

To see a full list of the columns in our DataFrame, we can run the following command:

refugee_df.columns

Our DataFrame has relatively few columns, so seeing the full list is not absolutely necessary in our case. This step becomes important when you are working with DataFrames with many columns.

Rename columns

To improve the readability of our dataset, we can rename columns. In our case, let’s rename “dest_state” as “state” and “dest_city” as “city”. We will use the .rename() method and the columns= parameter. Note that in this case we are setting the DataFrame equal to the returned value of the method so as to save the results into the DataFrame.

refugee_df=refugee_df.rename(columns={'dest_state': 'state','dest_city':'city' })
refugee_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	state	city	arrivals
0	2005-01-01	Afghanistan	Alabama	Mobile	0
1	2006-01-01	Afghanistan	Alabama	Mobile	0
2	2007-01-01	Afghanistan	Alabama	Mobile	0
3	2008-01-01	Afghanistan	Alabama	Mobile	0
4	2009-01-01	Afghanistan	Alabama	Mobile	5
...	...	...	...	...	...
121240	2011-01-01	Vietnam	Wisconsin	Milwaukee	0
121241	2012-01-01	Vietnam	Wisconsin	Milwaukee	4
121242	2013-01-01	Vietnam	Wisconsin	Milwaukee	2
121243	2014-01-01	Vietnam	Wisconsin	Milwaukee	0
121244	2015-01-01	Vietnam	Wisconsin	Milwaukee	0

121231 rows × 5 columns

- .rename() In Pandas, the rename method is used to change the labels (names) of columns or index in a DataFrame. It allows for both selective renaming and modifying all labels using a dictionary or a function.

Select columns

Let’s say we wanted to view data from just one column in the DataFrame. To do this, we could run the following command:

refugee_df[['state']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	state
0	Alabama
1	Alabama
2	Alabama
3	Alabama
4	Alabama
...	...
121240	Wisconsin
121241	Wisconsin
121242	Wisconsin
121243	Wisconsin
121244	Wisconsin

121231 rows × 1 columns

Here we use double brackets around the column name to transform the column from a Series object into a DataFrame. Basically, the interior brackets are for lists, and the outside brackets are indexing operators. If you are curious to see the difference, try the following command instead: refugee_df['state'].

To view additional columns at the same time, you can add them to the list within the square brackets, separated by a comma. However, you can’t select multiple columns as a Series (try: refugee_df['state','city'])

refugee_df[['state','city']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	state	city
0	Alabama	Mobile
1	Alabama	Mobile
2	Alabama	Mobile
3	Alabama	Mobile
4	Alabama	Mobile
...	...	...
121240	Wisconsin	Milwaukee
121241	Wisconsin	Milwaukee
121242	Wisconsin	Milwaukee
121243	Wisconsin	Milwaukee
121244	Wisconsin	Milwaukee

121231 rows × 2 columns

- DataFrame vs Series Object In Pandas, a Series is a one-dimensional labeled array that can hold any data type, while a DataFrame is a two-dimensional tabular data structure comprising multiple columns, each of which is a Series. Essentially, a DataFrame is a collection of Series that share the same index, providing a structure similar to a table in a relational database.

Drop columns

To remove a column from the DataFrame, we can use the .drop() method and include the column name. In our case, we could drop the “city” column and save the result as a new DataFrame “refugee_drop_df” so we don’t override our original DataFrame.

refugee_drop_city_df = refugee_df.drop(columns="city")

refugee_drop_city_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	state	arrivals
0	2005-01-01	Afghanistan	Alabama	0
1	2006-01-01	Afghanistan	Alabama	0
2	2007-01-01	Afghanistan	Alabama	0
3	2008-01-01	Afghanistan	Alabama	0
4	2009-01-01	Afghanistan	Alabama	5
...	...	...	...	...
121240	2011-01-01	Vietnam	Wisconsin	0
121241	2012-01-01	Vietnam	Wisconsin	4
121242	2013-01-01	Vietnam	Wisconsin	2
121243	2014-01-01	Vietnam	Wisconsin	0
121244	2015-01-01	Vietnam	Wisconsin	0

121231 rows × 4 columns

- .drop() In Pandas, the drop method is used to remove specified rows or columns from a DataFrame. It returns a new DataFrame with the specified elements removed, and the original DataFrame remains unchanged unless the inplace parameter is set to True or the new DataFrame is stored in a new variable.

Filter columns

We can filter a Pandas DataFrame to select only certain values. Filtering data by certain values is similar to selecting columns.

We type the name of the DataFrame followed by square brackets and then, instead of inserting a column name, we insert a True/False condition. For example, to select only rows that contain the value “Iraq”, we could run the following command:

refugee_iraq_df = refugee_df[refugee_df['origin'] == 'Iraq']

refugee_iraq_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	state	city	arrivals
267	2005-01-01	Iraq	Alabama	Birmingham	0
268	2006-01-01	Iraq	Alabama	Birmingham	0
269	2007-01-01	Iraq	Alabama	Birmingham	0
270	2008-01-01	Iraq	Alabama	Birmingham	6
271	2009-01-01	Iraq	Alabama	Birmingham	0
...	...	...	...	...	...
120503	2011-01-01	Iraq	Wisconsin	Wauwatosa	0
120504	2012-01-01	Iraq	Wisconsin	Wauwatosa	0
120505	2013-01-01	Iraq	Wisconsin	Wauwatosa	2
120506	2014-01-01	Iraq	Wisconsin	Wauwatosa	0
120507	2015-01-01	Iraq	Wisconsin	Wauwatosa	6

18205 rows × 5 columns

We stored our results in a new variable refugee_iraq_df

- == In Python, the == operator is used for equality comparison, determining whether two values or expressions are equal. It returns True if the values are equal and False otherwise.

Drop Rows

If we wanted to exclude all rows containing the value “Iraq”, we could run the following command:

refugee_drop_iraq_df = refugee_df[refugee_df['origin'] != 'Iraq']

refugee_drop_iraq_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	state	city	arrivals
0	2005-01-01	Afghanistan	Alabama	Mobile	0
1	2006-01-01	Afghanistan	Alabama	Mobile	0
2	2007-01-01	Afghanistan	Alabama	Mobile	0
3	2008-01-01	Afghanistan	Alabama	Mobile	0
4	2009-01-01	Afghanistan	Alabama	Mobile	5
...	...	...	...	...	...
121240	2011-01-01	Vietnam	Wisconsin	Milwaukee	0
121241	2012-01-01	Vietnam	Wisconsin	Milwaukee	4
121242	2013-01-01	Vietnam	Wisconsin	Milwaukee	2
121243	2014-01-01	Vietnam	Wisconsin	Milwaukee	0
121244	2015-01-01	Vietnam	Wisconsin	Milwaukee	0

103026 rows × 5 columns

- != In Python, the != operator is used for inequality comparison, checking whether two values or expressions are not equal. It returns True if the values are different and False if they are equal.

Add columns

We can also add columns to the DataFrame. For example, we can add a percent_total column to calculate the percentage of total refugee arrivals for each row.

refugee_df['percent_total'] = (refugee_df['arrivals'] / refugee_df['arrivals'].sum())*100

refugee_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	state	city	arrivals	percent_total
0	2005-01-01	Afghanistan	Alabama	Mobile	0	0.000000
1	2006-01-01	Afghanistan	Alabama	Mobile	0	0.000000
2	2007-01-01	Afghanistan	Alabama	Mobile	0	0.000000
3	2008-01-01	Afghanistan	Alabama	Mobile	0	0.000000
4	2009-01-01	Afghanistan	Alabama	Mobile	5	0.000743
...	...	...	...	...	...	...
121240	2011-01-01	Vietnam	Wisconsin	Milwaukee	0	0.000000
121241	2012-01-01	Vietnam	Wisconsin	Milwaukee	4	0.000595
121242	2013-01-01	Vietnam	Wisconsin	Milwaukee	2	0.000297
121243	2014-01-01	Vietnam	Wisconsin	Milwaukee	0	0.000000
121244	2015-01-01	Vietnam	Wisconsin	Milwaukee	0	0.000000

121231 rows × 6 columns

*Note: refugee_df['arrivals'].sum() calculates the sum of all the values in the arrivals column.

You can read the command we just ran as: create a new column that calculates the number of arrivals in a row divided by the total number of arrivals in the dataset, times 100. The result of this calculation will equal the percentage of total refugee arrivals for each row.

- .sum() In Pandas, the .sum() method is used to calculate the sum of values along a specified axis in a DataFrame or Series. It can be applied to numeric columns, providing the total sum of the values in the specified axis.

Challenge

Write code for selecting the year, origin, and arrivals columns:

Solution

```python new_df= refugee_df[['year', 'origin', 'arrivals']] ```

Write code to drop all rows where the origin is Vietnam:

Solution

```python no_burma_df = refugee_df[refugee_df['origin'] != 'Vietnam'] ```

Write code to only keep rows where the arrival state is California:

Solution

```python california_df = refugee_df[refugee_df['state'] == 'California'] ```

Evaluation

How do you select a specific column in a Pandas DataFrame named "df"?

- df.select("column_name") - df.column_name - df.get_column("column_name") - df[["column_name"]]*

What is the primary purpose of the == operator in Python?

- It assigns a value to a variable. - It makes sure two values are different. - It is used for mathematical addition. - It checks if two values or expressions are equal.*

Sort Columns, Groupby Columns, & Calculations

In the following section we will learn how to sort and group columns in order to perform calculations.

Did you know you can perform both quantitative and qualitative manipulations across entire columns in Pandas? You can use the .apply() method along with a lambda function or a user-defined function to transform text values. For instance, apply the .str accessor to access string methods like .lower() or .replace() for case normalization or substitution. The apply function works element-wise, allowing you to process each element in the column independently. This approach empowers you to clean, modify, or extract information from text columns with ease, facilitating comprehensive data preparation for analysis.

Stacking requests

In this lesson, we will be using commands that stack various requests such as methods, parameters, operators, and more to define the command. Pandas encourages this kind of stacking, but it can seem overwhelming at first to beginners. For example, as we will see below, a command could include two or more methods that stack on top of each other, and end with a slice operator to view only the top N rows of the results. In addition, a command can include specific parameters to call out a particular column or to sort the data in descending order.

We will move slowly through each of the following commands to break them down.

Sort columns

To sort a DataFrame, we can use the .sort_values() method with the parameter by= and including the name of the column we want to sort by written in quotation marks.

For example, we can sort the DataFrame by the arrivals column:

refugee_df.sort_values(by='arrivals', ascending=False)[:15]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	origin	state	city	arrivals	percent_total
11359	2009-01-01	Iraq	California	El Cajon	2813	0.418279
8718	2008-01-01	Iran	California	Glendale	2025	0.301107
11360	2010-01-01	Iraq	California	El Cajon	1811	0.269286
8719	2009-01-01	Iran	California	Glendale	1722	0.256053
58761	2012-01-01	Iraq	Michigan	Southfield	1720	0.255755
11362	2012-01-01	Iraq	California	El Cajon	1698	0.252484
8717	2007-01-01	Iran	California	Glendale	1637	0.243414
64090	2006-01-01	Somalia	Minnesota	Minneapolis	1586	0.235830
11358	2008-01-01	Iraq	California	El Cajon	1569	0.233302
25699	2005-01-01	Cuba	Florida	Miami	1537	0.228544
87927	2006-01-01	Somalia	Ohio	Columbus	1319	0.196129
11363	2013-01-01	Iraq	California	El Cajon	1238	0.184084
40503	2014-01-01	Burma	Indiana	Indianapolis	1211	0.180070
25704	2010-01-01	Cuba	Florida	Miami	1151	0.171148
58829	2014-01-01	Iraq	Michigan	Troy	1095	0.162821

Note: In the command above, we used the by= parameter to specify that the data be sorted according to the arrivals column and we added the ascending=False parameter in order to request that the data be displayed with the highest number first. By default, Pandas will sort in ascending order, meaning from the smallest value to the largest value. We also added a Python list slice (i.e., [:15]) to view just the top 15 rows.

- .sort_values() In Pandas, the .sort_values() method is used to sort the rows of a DataFrame based on the values in one or more columns. It allows users to arrange the data in ascending or descending order, facilitating better analysis and visualization of the dataset.

Groupby Columns

We can group data and perform calculations on the groups using the .groupby() method. For example, to see the breakdown of the number of arrivals by country of origin, we can use the following command:

refugee_df.groupby('origin')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9b81ebc100>

This command created a Groupby object—grouped data that we can use to perform calculations such as counting the number of non-blank values in each column for each arrival by country of origin.

Next, we will use the following command to sum the number of refugee arrivals by country of origin, with the output showing the top twenty rows sorted by descending order:

refugee_df.groupby('origin')['arrivals'].sum().sort_values(ascending=False)[:20]

origin
Burma              151875
Iraq               127326
Bhutan              85316
Somalia             71120
Cuba                39946
Iran                36683
Dem. Rep. Congo     25493
Russia              15656
Eritrea             14645
Sudan               12636
Ukraine             11727
Burundi             11152
Vietnam             10769
Liberia              9424
Ethiopia             8487
Afghanistan          6751
Laos                 6609
Moldova              4482
Syria                2359
Colombia             1956
Name: arrivals, dtype: int64

These results show us the total number of arrivals by country of origin across the 2005-2015 period, in descending order, sliced for the top 20 results.

Let’s unpack the command to better understand these results:

We have three stacked methods here: .groupby(), .count(), and .sort_values().
groupby('origin')['arrivals']: For the Groupby object we defined in the previous step, groupby(‘origin’), we are isolating the “arrivals” column. Basically, we are asking to view the number of refugee arrivals by country of origin.
.sum(): This method adds non-blank cells for each column or row. The results we see in the output show the total number of refugee arrivals by country of origin.
.sort_values(ascending=False): This method specifies how we want our output to be sorted. We include the ascending=False parameter in order to request that the data be displayed with the highest percentage first.
[:20]: This Python slide specifies that we just want to see the top 20 rows.

- .groupby() In Pandas, the .groupby() method is used to group rows of a DataFrame based on one or more columns, creating a groupby object. This object allows for the application of various aggregation functions to analyze and summarize data within each group efficiently.

Convert Series Object to Dataframe

You will notice that our output is not a Dataframe. Instead, it's a Series Object, which doesn't allow us to select data or make further calculations on the data. We can convert it to a Dataframe by first storing the command above in a new variable and stacking two extra commands, to_frame and reset_index:

ref_sum_df=refugee_df.groupby('origin')['arrivals'].sum().sort_values(ascending=False)[:20]
ref_sum_df = ref_sum_df.to_frame().reset_index()
ref_sum_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	origin	arrivals
0	Burma	151875
1	Iraq	127326
2	Bhutan	85316
3	Somalia	71120
4	Cuba	39946
5	Iran	36683
6	Dem. Rep. Congo	25493
7	Russia	15656
8	Eritrea	14645
9	Sudan	12636
10	Ukraine	11727
11	Burundi	11152
12	Vietnam	10769
13	Liberia	9424
14	Ethiopia	8487
15	Afghanistan	6751
16	Laos	6609
17	Moldova	4482
18	Syria	2359
19	Colombia	1956

- .to_frame() In Pandas, the .to_frame() method is used to convert a Pandas Series into a DataFrame. It allows for the transformation of a one-dimensional series with an index into a two-dimensional tabular structure, preserving the index and data.

.reset_index() In Pandas, the .reset_index() method is used to reset the index of a DataFrame, converting the index values into columns and providing a default integer index. This operation is useful when you want to remove the current index or revert to the default integer-based index.

Challenge

Write code to create a GroupBy object based on the 'state' column:

Solution

```python refugee_df.groupby('state') ```

Write code to sum the number of refugee arrivals by state with the output showing the top 15 rows sorted by ascending order:

Solution

```python refugee_df.groupby('state')['arrivals'].sum().sort_values(ascending=True)[:15] ```

Evaluation

What is the primary purpose of the .reset_index() method in Pandas?

- It reorganizes the DataFrame based on specified column values. - It converts a Pandas Series into a DataFrame. - It resets the index of a DataFrame, converting index values into columns.* - It sorts the values of a DataFrame based on specified columns.

When might you use the .to_frame() method in Pandas?

- To group rows of a DataFrame based on one or more columns. - To reset the index of a DataFrame - To sort the rows of a DataFrame based on specific values. - To convert a Pandas Series into a DataFrame.*

What is the primary purpose of the .groupby() method in Pandas?

- To sort the rows of a DataFrame based on specific column values. - To group rows of a DataFrame based on one or more columns.* - To reset the index of a DataFrame. - To convert a Pandas Series into a DataFrame.

Which Pandas method is used to sort the rows of a DataFrame based on specified column values?

- .reset_index() - .to_frame() - .groupby() - .sort_vales()*

Basic data visualizations

To create plots and data visualization in Pandas, we can add the .plot() method to any DataFrame or Series object that has appropriate numeric data.

We can specify the title with the title= parameter and the kind of plot by altering the kind= parameter:

bar or barh for bar plots (h is for horizontal)
hist for histogram
box for boxplot
kde or density for density plots
area for area plots
scatter for scatter plots
hexbin for hexagonal bin plots
pie for pie plots

- Data visualizations Data visualizations are important as they provide a clear and intuitive way to communicate complex information, making it easier for individuals to understand patterns, trends, and insights in data. Visualizations enhance decision-making by offering a more accessible and engaging representation of data, enabling quicker and more informed analysis.

.plot() In Pandas, the .plot() method is used to create basic visualizations from DataFrame or Series data. It provides a convenient way to generate various plots, such as line charts, bar plots, and scatter plots, directly from Pandas objects for quick data exploration and analysis.

Python boasts a vibrant ecosystem of open-source libraries for data visualizations, offering a wealth of options for enthusiasts and professionals alike. Matplotlib, the foundation for many other libraries, empowers users with high-quality and customizable plots. Seaborn, built on Matplotlib, enhances aesthetics and simplifies complex visualizations. Plotly, known for its interactivity, enables the creation of dynamic and shareable visualizations. Bokeh stands out for its interactive and real-time streaming capabilities. With these open-source tools, the Python community collaborates and innovates, fostering a dynamic environment for creative and effective data visualization solutions.

Bar Charts

For example, we can visualize the data we got from our Groupby command looking at the total number of refugees by country of arrival as a bar chart:

ref_sum_df.plot(kind='bar', x= 'origin', y='arrivals', 
                title='Total number of refugee arrivals in the U.S. \n by country of origin')

<Axes: title={'center': 'Total number of refugee arrivals in the U.S. \n by country of origin'}, xlabel='origin'>

Let’s unpack the command to better understand these results:

ref_sum_df: This is the variable we created in the previous lesson summing the number of refugee arrivals by country of origin, with the output showing the top twenty rows sorted by descending order
.plot(kind='bar', x= 'origin', y='arrivals', title='Total number of refugee arrivals in the U.S. \n by country of origin'):
- Here we are using the .plot() method to create a visualization, and we are specifying that we want a bar chart with the “kind=’bar’” parameter.
- We also specify the values for the x axis (orgin) and y axis (arrivals)
- We are also giving the chart a title with the title='Total number of refugee arrivals in the U.S. \n by country of origin' parameter.
  - Note: By adding \n in the title text, we signify that the text that follows should be on a new line.

- Bar chart A bar chart is a visual representation of data that uses rectangular bars to show the values of different categories. The length or height of each bar corresponds to the quantity or frequency of the data it represents, making it a popular tool for comparing and displaying categorical information.

'\n' In Python, "\n" is an escape sequence representing a newline character. When encountered in a string, it causes a line break, moving the subsequent text to a new line, making it useful for formatting and organizing output.

Pie Charts

We can also visualize the data as a pie chart:

ref_sum_df.set_index('origin')[:10].plot(kind='pie', y='arrivals', 
                                         title='Refugee arrivals in the U.S.')

<Axes: title={'center': 'Refugee arrivals in the U.S.'}, ylabel='arrivals'>

We start by setting the index of the dataframe to the origin column. This ensures that our legend will show the country names.
We slice our results to show just the top 10 countries. This makes our chart more legible.
Next, we use the .plot() method, specifying pie as the type of plot
We also specify the y values - in our case arrivals

- Pie chart A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical proportions. Each slice represents a proportion of the whole, and the size of each slice is proportional to the corresponding value it represents in the dataset.

Time Series

We can also create time series using the Groupby method. For example, if we wanted to visualize the total number of refugees resettled in the U.S. across the 2005-2015 period, we would first create a Groupby object based on the “year” column (refer back to lesson 7 for more on Groupby objects).

refugee_df.groupby('year')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9bd6807e80>

Next, we can create a new variable calculating the average number of refugees being resettled over time.

total_arrivals_by_year = refugee_df.groupby('year')['arrivals'].sum()

Let’s break this command down:

We have two stacked methods here: .groupby() and .sum()
groupby('year')['arrivals']: For the Groupby object, groupby(year), we are isolating the “arrivals” column. Basically, we are asking to view the number of refugee arrivals by year.
.sum(): This method returns the sum of the values over the requested axis. In our case, it will calculate the total number of refugee arrivals per year.
We store our results in the total_arrivals_by_year variable

Finally, we can add the .plot() method to create a line chart.

total_arrivals_by_year.plot(kind='line', title="Total Number of Refugee Arrivals by Year")

<Axes: title={'center': 'Total Number of Refugee Arrivals by Year'}, xlabel='year'>

In this command, we are adding the .plot() method to request a chart, and specifying that we want a line graph with the “kind=line” parameter. We are also giving the chart a title with the “title='Total Number of Refugee Arrivals by Year'” parameter.

- Line chart A line chart is a graphical representation of data that uses points connected by straight lines to display trends or changes over a continuous interval. It is commonly used to visualize the relationship between two variables and to show how one variable changes in response to the other.

Time series A time series is a sequence of data points collected or recorded over a period of time, typically at regular intervals. It is used to analyze and visualize trends, patterns, and behaviors in data that evolve over time.

Challenge

Write code to visualize the data we got from our Groupby command looking at the total number of refugees by country of arrival as a scatter plot:

Solution

```python ref_sum_df.plot(kind='scatter', x= 'origin', y='arrivals', title='Total number of refugee arrivals in the U.S. \n by country of origin') ```

Evaluation

Which Pandas method is commonly used to create basic visualizations, such as line charts, bar plots, and scatter plots, directly from DataFrame or Series data?

- .format() - .visualize() - .plot()* - .display()

What type of chart is best suited for illustrating numerical proportions in a dataset, where each slice represents a proportion of the whole?

- Line chart - Bar chart - Pie chart* - Time series chart

Write to CSV

To output a new CSV file, we can use the .to_csv method with a name for the file in quotation marks. For example, since we added the percent_total column to the refugee_df DataFrame, we may want to download the updated DataFrame as a CSV file to use it with other programs.

refugee_df.to_csv("new_refugee.csv", encoding='utf-8', index=False)

In addition to a filename, we’re also specifying that the encoding is utf-8 and that the Index (the bolded left-most column) is not included in the CSV file.

- .to_csv In Pandas, the .to_csv() method is used to write the contents of a DataFrame to a CSV (Comma-Separated Values) file. It allows users to save the data in a tabular format that can be easily shared, imported, or analyzed using other tools.

From theory to practice: Building your Pandas skills with the Pandas documentation and other resources

Learning how to ask the right questions in a search engine like Google in order to find the solution to what you are trying to accomplish is the name of the game when you are just starting out with Python. Since Pandas is a popular and well documented Python package, you are bound to find myriads of resources that can help you get where you are going.

Pandas documentation and online tutorials

A good first place to start when you are searching for answers with Pandas is to look at the Pandas documentation, which is fairly accessible to beginners, and is an incredible resource when you want to learn how to use a new command. It also offers a User Guide for beginners with some fun exercises to deepen your learning.

Let’s say you wanted to find out more about the .sort_values method we used and understand the different parameters the method accepts:

You could first search for .sort_values on the Pandas documentation website (https://pandas.pydata.org) and navigate to the “pandas.DataFrame.sort_values” documentation page.
Scroll through the page for the info, and look at the second section for examples of how to use the method and its various parameters.
If you don’t find an answer that makes sense to you on the Pandas documentation page, then look on Google for other resources. Some of our go-to websites for help are Stack Overflow, Geeks for Geeks, and Data to Fish.

Learning with Generative AI platforms

Generative AI platforms like ChatGPT can be incredibly helpful in deepening your understanding of coding concepts in addition to helping you build your code. In this workshop we learned critical vocabulary such as "method" and "dataframe" that you can use to communicate with ChatGPT in order to ask it to build a Python script for you. Make sure to ask the platform to explain what each line of code is accomplishing so you can better understand the output it is giving you.

Prompt examples include:

Explain to me what a dataframe is like I'm an 8th grader
Could you help me build a script to change the names of the columns in my dataframe?
How can I convert a series object into a dataframe in Pandas?

Other Resources

This workshop owes a huge debt to Melanie Walsh’s Introduction to Cultural Analytics & Python. This easy to use and understand textbook provides additional support for using Pandas. This is a great place to start if you want to continue building your Pandas skills.

Files

pandas.md

Latest commit

History

pandas.md

File metadata and controls

Jupyter & JupyterLite

Preparing your workspace and folders

Creating a new folder

Download the Full Code in a Jupyter Notebook & upload it into the code editor

Launching a new JupyterLite Notebook

Getting started with Pandas

Import Pandas and Matplotlib

Read in a CSV file as a DataFrame

Evaluation

Data exploration

Display the DataFrame

Display a selection of the DataFrame

Display a random sample of the DataFrame

Challenge

Solution

Solution

Solution

Evaluation

Basic data cleaning

Data Types

Converting data types

Check for duplicate rows

Evaluation

Rename, select, drop, filter and add new columns

See list of columns

Rename columns

Select columns

Drop columns

Filter columns

Drop Rows

Add columns

Challenge

Solution

Solution

Solution

Evaluation

Sort Columns, Groupby Columns, & Calculations

Stacking requests

Sort columns

Groupby Columns

Convert Series Object to Dataframe

Challenge

Solution

Solution

Evaluation

Basic data visualizations

Bar Charts

Pie Charts

Time Series

Challenge

Solution

Evaluation

Write to CSV

From theory to practice: Building your Pandas skills with the Pandas documentation and other resources

Pandas documentation and online tutorials

Learning with Generative AI platforms

Other Resources