title | cover title | description | programming_language | learning objectives | estimated time | prerequisites | instructors | authors | Editors | readings | ethical considerations | projects | resources | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Data Manipulation in Pandas and Python |
Pandas |
In this workshop, we are going to learn some basic commands in Pandas, an expansive Python library for working with tabular data like CSV files. You can think of Pandas as a more powerful version of Excel that operates within the Python environment, where you can wrangle, clean, analyze, and visualize data. Knowing how to use Pandas is important if you plan on working with datasets that include qualitative and/or quantitative data points. |
jupyter |
|
|
|
|
|
|
|
|
|
|
In this workshop, we are going to use JupyterLite as our Python environment, which is a lightweight version of Jupyter Notebooks.
You can think of Jupyter Notebooks as a comprehensive toolset for interactive computing and data exploration. It's like a fully equipped research laboratory where you can seamlessly blend narrative text, code, and visualizations. JupyterLite is a streamlined, lightweight version of this toolset, offering a more agile and accessible environment.
Jupyter Notebooks - Full Research Lab:
- Versatility: Jupyter Notebooks are like a fully equipped research lab. They support various programming languages, allowing you to conduct diverse computational experiments.
- Interactivity: Much like a lab where you can actively engage with your experiments, Jupyter Notebooks let you interactively run code cells, observe results, and refine your analyses on the fly.
- Rich Outputs: Visualizations, graphs, and other dynamic outputs enrich your narrative, creating a comprehensive and visually engaging research document.
JupyterLite - Portable Research Kit:
- Simplicity: JupyterLite is a simplified, more portable version. It's akin to having a compact research kit that allows you to conduct essential experiments without the full complexity of the lab.
- Accessibility: When you need a quick, accessible environment for coding and exploration, JupyterLite provides a more straightforward setup. It's like having your essential tools ready without the need for a fully equipped lab.
Learning and On-the-Go:
- Educational Tool: JupyterLite is often used for educational purposes or quick experiments. It serves as a friendly entry point for those learning about interactive computing without overwhelming them with the extensive features of the full Jupyter Notebooks.
- Portability: JupyterLite is convenient for situations where the full-scale Jupyter environment might be impractical or unnecessary. It's like having a research companion that fits into your backpack.
In essence, JupyterLite is a more nimble version of Jupyter Notebooks, suitable for scenarios where a lighter computational environment is sufficient. It shares the same core principles of interactive computing but in a more accessible and portable form, making it a valuable tool for quick experiments, learning, and situations where resource constraints or simplicity are paramount.
- Jupyter Notebook Comprehensive toolset for interactive computing and data exploration, blending narrative text, code, and visualizations- JupyterLite A lightweight version of Jupyter Notebook
Keeping all your files for a particular project in a designated file directory will keep your project organized and will make it easier to read in your files.
Let’s create a “pandas_workshop” folder in our JupyterLite code editor workspace:
- Click on "Open Code Editor" in the top right corner of this browser window.
- In the folders toolbar, click the icon for the folder with the '+' on it. This will create a new "Untitled Folder."
- To rename the folder, click on the folder name while pressing the "control" key on your keyboard at the same time (MacOS) or right-clicking on your mousepad (Windows), and selecting "Rename" from the drop-down menu.
- Rename the folder: “pandas_workshop”.
We’ve saved all of the code for this section in a Jupyter Notebook file.
- To access this file, click the following button:
- Open your code editor in the right hand corner and locate the
pandas_workshop_2024.ipynb
file - When prompted, select the "Python Pyodide" kernel
- Finally, drag and drop the
pandas_workshop_2024.ipynb
file into yourpandas_workshop
folder - Once it's in that folder, you should be all set!
In this file you will find all of the workshop commands and the expected outputs. If you ever feel stuck or can’t seem to be able to advance in the workshop, you can open this file and see how we did it.
For the best possible experience, we suggest/encourage you to:
- Follow the workshop typing all the code yourself.
- Avoid copying/pasting the code. Much of learning has to do with you typing yourself.
- Only check the PandasWorkshop.ipynb file if you get lost or if you are not able to get the right output. Before opening it, put some time trying to figure out by yourself why it isn’t working. A big part of coding is learning to identify what we are doing wrong.
- We also caution you against working with both files open at the same time. It is easy to get confused and start modifying the wrong one. But those are only suggestions. Maybe they will work for you, maybe they won’t, so feel free to do what suits you best. You are in charge here!
Finally, let's launch a new JupyterLite Notebook from within the “pandas_workshop” folder.
- Make sure you are in the “pandas_workshop” folder. If not, navigate there.
- From the launcher window, click the "Python (Pyodide)" button
- A new JupyterLite Notebook will open. Rename the notebook by clicking on the notebook name while pressing the "control" key on your keyboard at the same time (MacOS) or right-clicking on your mousepad (Windows), and selecting "Rename" from the drop-down menu.
- Rename the notebook: “pandas_workshop”.
Hooray! You are all set to move forward!
- Jupyter Notebook Comprehensive toolset for interactive computing and data exploration, blending narrative text, code, and visualizations- JupyterLite A lightweight version of Jupyter Notebook
- Pandas is a Python software library
- The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals
- You can think of it as a FREE equivalent to Stata or SPSS for data manipulation and analysis. It is also more powerful than Excel
- Knowing how to use Pandas is important if you plan on working with data organized in spreadsheets (either quantitative or qualitative)
In the first blank cell, type the following command to import the Pandas library into our Jupyter Notebook. To run the command, you can click the “Run” button in the top toolbar, or you can click shift + return.
import pandas as pd
import matplotlib
The first import statement not only imports the Pandas library but also gives it the alias “pd.” Using this alias will save us from having to type out the entire word “Pandas” each time we need to use it. Libraries are sets of instructions that Python can use to perform specialized functions.
The second import, matplotlib
, is used to import the Matplotlib library, which is a plotting library for Python and Pandas. We will use Matplotlib to create visualizations of our data later in the workshop.
If you don’t see an error when you run the notebook—that is, if there is no output—you can move on to the next step. It is not rare in programming that when you do things right, the result will be nothing happening. This is what we like to call a silent success.
- Import In Python, the "import" keyword is used to bring external modules or libraries into your script, making their functionality available for use. Once imported, you can access the functions, classes, or variables defined in the module using the module's name as a prefix.- Silent Success In Python, a "silent success" typically refers to a successful operation or function call that does not produce any visible output or raise explicit errors. It implies that the operation was completed successfully, but the result might not be explicitly displayed, making it important for developers to check return values or logs to confirm the success.
Next, we will read in our dataset saved as a CSV file. We will specifically work with the refugee-arrivals-by-destination.csv dataset, which contains data about refugee arrivals to the United States between 2005 and 2015, including the number of refugees who arrived in each U.S. city and state, the year that they arrived, and the country from which they arrived.
We will connect to the dataset directly from this Github link
import pyodide_http # makes it possible to read https links in pyodide
pyodide_http.patch_all()
url = 'https://raw.githubusercontent.com/GC-DRI/DRI24/main/uploads/pandas/refugee-arrivals-by-destination.csv'
refugee_df = pd.read_csv(url, delimiter=",", encoding='utf-8')
- With this command, we are creating a Pandas DataFrame object, which is a 2-dimensional labeled data structure with columns of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
- It is common practice to abbreviate DataFrame with “df”, as in refugee_df.
-
Delimiter In a CSV (Comma-Separated Values) file, a delimiter is a character or sequence of characters that separates individual fields or data elements within each row. Common delimiters include commas, tabs, or semicolons, and they enable the proper parsing of data when reading or writing CSV files.
-
Encoding In the context of text data, encoding refers to the method used to represent characters as binary data. UTF-8, for example, is a widely used encoding that can represent most characters in the Unicode standard, allowing for the storage and transmission of text in multiple languages.
What is the primary data structure in Pandas for working with tabular data?
- Series - DataFrame* - Array - ListWhat is a DataFrame?
- A two-dimensional data structure* - A three-dimensional data structure - A single-dimensional data structure - A series objectIn the following section we will learn how to view the DataFrame in addition to viewing smaller sections of it.
To display the DataFrame, we can run a cell with the variable name of the DataFrame:
refugee_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals | |
---|---|---|---|---|---|
0 | 2005 | Afghanistan | Alabama | Mobile | 0 |
1 | 2006 | Afghanistan | Alabama | Mobile | 0 |
2 | 2007 | Afghanistan | Alabama | Mobile | 0 |
3 | 2008 | Afghanistan | Alabama | Mobile | 0 |
4 | 2009 | Afghanistan | Alabama | Mobile | 5 |
... | ... | ... | ... | ... | ... |
121240 | 2011 | Vietnam | Wisconsin | Milwaukee | 0 |
121241 | 2012 | Vietnam | Wisconsin | Milwaukee | 4 |
121242 | 2013 | Vietnam | Wisconsin | Milwaukee | 2 |
121243 | 2014 | Vietnam | Wisconsin | Milwaukee | 0 |
121244 | 2015 | Vietnam | Wisconsin | Milwaukee | 0 |
121245 rows × 5 columns
Let’s take a look at a few elements in this DataFame:
- Index
- The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
- By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.
- The default index is a Unique ID - that being said, the index does not have to be a Unique ID. You can set your index to be any column in your DataFrame
- Truncation
- The DataFrame is truncated, signaled by the ellipses in the middle … of every column.
- The DataFrame is truncated because by default, Pandas will display 60 rows and 20 columns. You can change these default settings see documentation
- Rows x Columns
- Pandas reports how many rows and columns are in this dataset at the bottom of the output. Our DataFrame has 121,245 rows × 5 columns.
- NAN
- NaN is the Pandas value for any missing data.
We can also display the first 2 rows of the DataFrame with the .head()
method
refugee_df.head(2)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals | |
---|---|---|---|---|---|
0 | 2005 | Afghanistan | Alabama | Mobile | 0 |
1 | 2006 | Afghanistan | Alabama | Mobile | 0 |
We can accomplish the same task using a slice instead:
refugee_df[:2]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals | |
---|---|---|---|---|---|
0 | 2005 | Afghanistan | Alabama | Mobile | 0 |
1 | 2006 | Afghanistan | Alabama | Mobile | 0 |
We can also display the last 10 rows of the DataFrame with the .tail()
method
refugee_df.tail(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals | |
---|---|---|---|---|---|
121235 | 2006 | Vietnam | Wisconsin | Milwaukee | 5 |
121236 | 2007 | Vietnam | Wisconsin | Milwaukee | 18 |
121237 | 2008 | Vietnam | Wisconsin | Milwaukee | 12 |
121238 | 2009 | Vietnam | Wisconsin | Milwaukee | 4 |
121239 | 2010 | Vietnam | Wisconsin | Milwaukee | 7 |
121240 | 2011 | Vietnam | Wisconsin | Milwaukee | 0 |
121241 | 2012 | Vietnam | Wisconsin | Milwaukee | 4 |
121242 | 2013 | Vietnam | Wisconsin | Milwaukee | 2 |
121243 | 2014 | Vietnam | Wisconsin | Milwaukee | 0 |
121244 | 2015 | Vietnam | Wisconsin | Milwaukee | 0 |
We can accomplish the same task using a slice instead:
refugee_df[-10:]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals | |
---|---|---|---|---|---|
121235 | 2006 | Vietnam | Wisconsin | Milwaukee | 5 |
121236 | 2007 | Vietnam | Wisconsin | Milwaukee | 18 |
121237 | 2008 | Vietnam | Wisconsin | Milwaukee | 12 |
121238 | 2009 | Vietnam | Wisconsin | Milwaukee | 4 |
121239 | 2010 | Vietnam | Wisconsin | Milwaukee | 7 |
121240 | 2011 | Vietnam | Wisconsin | Milwaukee | 0 |
121241 | 2012 | Vietnam | Wisconsin | Milwaukee | 4 |
121242 | 2013 | Vietnam | Wisconsin | Milwaukee | 2 |
121243 | 2014 | Vietnam | Wisconsin | Milwaukee | 0 |
121244 | 2015 | Vietnam | Wisconsin | Milwaukee | 0 |
- .head() & .tail() In Pandas, the .head() method is used to display the first few rows of a DataFrame, providing a quick overview of its structure, while the .tail() method shows the last few rows. Both methods help users inspect data frames efficiently.
We can also look at a random sample of data with the .sample()
method
refugee_df.sample(15)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals | |
---|---|---|---|---|---|
102329 | 2009 | Dem. Rep. Congo | Texas | Cedar Park | 0 |
40110 | 2006 | Ukraine | Illinois | Woodridge | 0 |
45485 | 2013 | Cuba | Louisiana | Avondale | 2 |
62900 | 2015 | Liberia | Minnesota | Brooklyn Center | 1 |
46604 | 2010 | Iraq | Maine | Norway | 10 |
116748 | 2007 | Latvia | Washington | Lynnwood | 0 |
5039 | 2014 | Burma | California | Chula Vista | 0 |
66487 | 2005 | Iraq | Missouri | Saint Louis | 0 |
108591 | 2012 | Iraq | Vermont | Essex Junction | 0 |
94980 | 2008 | Russia | Pennsylvania | Warminster | 2 |
4690 | 2006 | Bhutan | California | Alameda | 0 |
72042 | 2005 | Liberia | New Jersey | Englewood | 0 |
66931 | 2009 | Somalia | Missouri | Kansas City | 141 |
91265 | 2011 | Belarus | Pennsylvania | Millersburg | 0 |
2077 | 2011 | Iraq | Arizona | Goodyear | 1 |
We can tell it's a random sample since the index numbers are completely disorganized.
- .sample() In Pandas, the `.sample()` method is used to randomly select a specified number of rows from a DataFrame. It provides a way to obtain a random subset of data for exploratory analysis or testing purposes.Using both the .head()
method and a slice, write code to display the top 9 rows of the DataFrame:
Write code to display the last 5 rows of the DataFrame (there are two ways to do this):
```python refugee_df.tail(5) refugee_df[-5:] ```Write code to display a random selection of 10 rows in the DataFrame:
```python refugee_df.sample(10) ```What is the purpose of the .head()
method in Pandas?
Which of the following commands would return the top three rows in a dataframe? Select all that apply
- `refugee_df[0:3]`* - `refugee_df[0:4]` - `refugee_df[0:2]` - `refugee_df.head(3)`*In the following sections we will review some basic data cleaning steps including checking and converting our data types in addition to checking and removing duplicate rows.
Data cleaning is a crucial process for refining tabular data. Begin by addressing missing values. Tackle outliers to maintain data integrity and accuracy. Remove duplicates for a clean and unique dataset. Standardize formats for consistency, and finally, engage in feature engineering to enhance data insights and prepare it for analysis. These steps collectively ensure a refined and reliable foundation for effective data exploration and interpretation.We can get information about the columns in the DataFrame by using the .info()
method.
refugee_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121245 entries, 0 to 121244
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 121245 non-null int64
1 origin 121245 non-null object
2 dest_state 121245 non-null object
3 dest_city 121245 non-null object
4 arrivals 121245 non-null int64
dtypes: int64(2), object(3)
memory usage: 4.6+ MB
This report tells us how many non-null, or non-blank, values are in each column, as well as what type of data is in each column.
Pandas uses a different lexicon to describe data types from those we learned in our intro to Python curriculum . Below is a table that explains what each data type means:
Pandas data types | Python data types | Usage |
object | String or mixed | Text or mixed numeric and non-numeric values |
float64 | float | Floating point numbers |
int64 | integer | Integer numbers |
datetime64 | NA | Date and time values |
Keeping this in mind, it looks as though the data type for the year column is a “int64” instead of being “datetime64.”
refugee_df['year'] = pd.to_datetime(refugee_df['year'], format="%Y")
This command translates to: for the “year” column in the “refugee_df” DataFrame, use the to_datetime
method in the Pandas library to convert the values in the “year” column in the “refugee_df” DataFrame to datetime data types. We specify the format as %Y
since this is the format of our data.
We can then check to see if the data type was properly converted using the .dtypes
object, which is similar to the .info()
method, except it only provides information on data types.
refugee_df.dtypes
year datetime64[ns]
origin object
dest_state object
dest_city object
arrivals int64
dtype: object
As we can see, the data in the “year” column was successfully transformed into the datetime64 data type.
- Datetime In Pandas, the datetime data type represents date and time information. It allows for efficient handling, manipulation, and analysis of temporal data in a tabular format, offering functionalities for parsing, formatting, and performing operations on date and time values.As part of our data cleaning process, we want to check for duplicate rows. We can do this by using the .duplicated()
method inside a filter to isolate only the rows in the DataFrame that are exact duplicates. Filtering data by certain values is similar to selecting columns. We add the parameter keep=False
, which will display all the duplicated values in the dataset (meaning none are considered unique) — rather than just the first duplicated value keep='first' or the last duplicated value keep='last'.
refugee_df[refugee_df.duplicated(keep=False)]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals | |
---|---|---|---|---|---|
13 | 2007-01-01 | Burma | Alabama | Auburn | 0 |
14 | 2008-01-01 | Burma | Alabama | Auburn | 0 |
15 | 2009-01-01 | Burma | Alabama | Auburn | 0 |
16 | 2010-01-01 | Burma | Alabama | Auburn | 0 |
17 | 2007-01-01 | Burma | Alabama | Auburn | 0 |
18 | 2008-01-01 | Burma | Alabama | Auburn | 0 |
19 | 2009-01-01 | Burma | Alabama | Auburn | 0 |
20 | 2010-01-01 | Burma | Alabama | Auburn | 0 |
155 | 2013-01-01 | Cuba | Alabama | Russellville | 0 |
156 | 2014-01-01 | Cuba | Alabama | Russellville | 0 |
157 | 2015-01-01 | Cuba | Alabama | Russellville | 0 |
158 | 2005-01-01 | Cuba | Alabama | Theodore | 0 |
159 | 2006-01-01 | Cuba | Alabama | Theodore | 0 |
160 | 2013-01-01 | Cuba | Alabama | Russellville | 0 |
161 | 2014-01-01 | Cuba | Alabama | Russellville | 0 |
162 | 2015-01-01 | Cuba | Alabama | Russellville | 0 |
163 | 2005-01-01 | Cuba | Alabama | Theodore | 0 |
164 | 2006-01-01 | Cuba | Alabama | Theodore | 0 |
236 | 2012-01-01 | Iran | Alabama | Mobile | 1 |
237 | 2013-01-01 | Iran | Alabama | Mobile | 6 |
238 | 2014-01-01 | Iran | Alabama | Mobile | 3 |
239 | 2015-01-01 | Iran | Alabama | Mobile | 5 |
240 | 2005-01-01 | Iran | Alabama | Pelham | 0 |
247 | 2012-01-01 | Iran | Alabama | Mobile | 1 |
248 | 2013-01-01 | Iran | Alabama | Mobile | 6 |
249 | 2014-01-01 | Iran | Alabama | Mobile | 3 |
250 | 2015-01-01 | Iran | Alabama | Mobile | 5 |
251 | 2005-01-01 | Iran | Alabama | Pelham | 0 |
Looks like we have a few duplicate rows in our dataset.
To remove those duplicates, we can use the .drop_duplicates()
method to drop duplicates from the DataFrame and select to keep the first instance of the duplicate or the last instance:
refugee_df = refugee_df.drop_duplicates(keep='first')
We can check to see if the command got rid of the duplicate rows by running the .duplicated() method again:
refugee_df[refugee_df.duplicated(keep=False)]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | dest_state | dest_city | arrivals |
---|
Great news! We successfully removed our duplicate rows!
- .duplicated() The .duplicated() method in Pandas identifies duplicate rows in a DataFrame based on the values in all columns. When keep=False is specified, all occurrences of the duplicated rows are marked as True, meaning none are considered unique, and all duplicates are included in the result. If keep is set to 'first' or 'last', it retains the first or last occurrence of each duplicated set, respectively, while marking the rest as True.What does the keep parameter in the .duplicated() method of Pandas control?
- It specifies the number of duplicates to keep. - It determines whether to display all occurrences or just the first occurrence of duplicated rows.* - It controls the sorting order of duplicated values. - It determines whether to keep or remove duplicated rows.In the following section we will learn how to rename, select, drop, filter, and add new columns.
To see a full list of the columns in our DataFrame, we can run the following command:
refugee_df.columns
Our DataFrame has relatively few columns, so seeing the full list is not absolutely necessary in our case. This step becomes important when you are working with DataFrames with many columns.
To improve the readability of our dataset, we can rename columns. In our case, let’s rename “dest_state” as “state” and “dest_city” as “city”. We will use the .rename()
method and the columns= parameter. Note that in this case we are setting the DataFrame equal to the returned value of the method so as to save the results into the DataFrame.
refugee_df=refugee_df.rename(columns={'dest_state': 'state','dest_city':'city' })
refugee_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | state | city | arrivals | |
---|---|---|---|---|---|
0 | 2005-01-01 | Afghanistan | Alabama | Mobile | 0 |
1 | 2006-01-01 | Afghanistan | Alabama | Mobile | 0 |
2 | 2007-01-01 | Afghanistan | Alabama | Mobile | 0 |
3 | 2008-01-01 | Afghanistan | Alabama | Mobile | 0 |
4 | 2009-01-01 | Afghanistan | Alabama | Mobile | 5 |
... | ... | ... | ... | ... | ... |
121240 | 2011-01-01 | Vietnam | Wisconsin | Milwaukee | 0 |
121241 | 2012-01-01 | Vietnam | Wisconsin | Milwaukee | 4 |
121242 | 2013-01-01 | Vietnam | Wisconsin | Milwaukee | 2 |
121243 | 2014-01-01 | Vietnam | Wisconsin | Milwaukee | 0 |
121244 | 2015-01-01 | Vietnam | Wisconsin | Milwaukee | 0 |
121231 rows × 5 columns
Let’s say we wanted to view data from just one column in the DataFrame. To do this, we could run the following command:
refugee_df[['state']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | |
---|---|
0 | Alabama |
1 | Alabama |
2 | Alabama |
3 | Alabama |
4 | Alabama |
... | ... |
121240 | Wisconsin |
121241 | Wisconsin |
121242 | Wisconsin |
121243 | Wisconsin |
121244 | Wisconsin |
121231 rows × 1 columns
Here we use double brackets around the column name to transform the column from a Series object into a DataFrame. Basically, the interior brackets are for lists, and the outside brackets are indexing operators. If you are curious to see the difference, try the following command instead: refugee_df['state'].
To view additional columns at the same time, you can add them to the list within the square brackets, separated by a comma. However, you can’t select multiple columns as a Series (try: refugee_df['state','city'])
refugee_df[['state','city']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
state | city | |
---|---|---|
0 | Alabama | Mobile |
1 | Alabama | Mobile |
2 | Alabama | Mobile |
3 | Alabama | Mobile |
4 | Alabama | Mobile |
... | ... | ... |
121240 | Wisconsin | Milwaukee |
121241 | Wisconsin | Milwaukee |
121242 | Wisconsin | Milwaukee |
121243 | Wisconsin | Milwaukee |
121244 | Wisconsin | Milwaukee |
121231 rows × 2 columns
To remove a column from the DataFrame, we can use the .drop()
method and include the column name. In our case, we could drop the “city” column and save the result as a new DataFrame “refugee_drop_df” so we don’t override our original DataFrame.
refugee_drop_city_df = refugee_df.drop(columns="city")
refugee_drop_city_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | state | arrivals | |
---|---|---|---|---|
0 | 2005-01-01 | Afghanistan | Alabama | 0 |
1 | 2006-01-01 | Afghanistan | Alabama | 0 |
2 | 2007-01-01 | Afghanistan | Alabama | 0 |
3 | 2008-01-01 | Afghanistan | Alabama | 0 |
4 | 2009-01-01 | Afghanistan | Alabama | 5 |
... | ... | ... | ... | ... |
121240 | 2011-01-01 | Vietnam | Wisconsin | 0 |
121241 | 2012-01-01 | Vietnam | Wisconsin | 4 |
121242 | 2013-01-01 | Vietnam | Wisconsin | 2 |
121243 | 2014-01-01 | Vietnam | Wisconsin | 0 |
121244 | 2015-01-01 | Vietnam | Wisconsin | 0 |
121231 rows × 4 columns
We can filter a Pandas DataFrame to select only certain values. Filtering data by certain values is similar to selecting columns.
We type the name of the DataFrame followed by square brackets and then, instead of inserting a column name, we insert a True/False condition. For example, to select only rows that contain the value “Iraq”, we could run the following command:
refugee_iraq_df = refugee_df[refugee_df['origin'] == 'Iraq']
refugee_iraq_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | state | city | arrivals | |
---|---|---|---|---|---|
267 | 2005-01-01 | Iraq | Alabama | Birmingham | 0 |
268 | 2006-01-01 | Iraq | Alabama | Birmingham | 0 |
269 | 2007-01-01 | Iraq | Alabama | Birmingham | 0 |
270 | 2008-01-01 | Iraq | Alabama | Birmingham | 6 |
271 | 2009-01-01 | Iraq | Alabama | Birmingham | 0 |
... | ... | ... | ... | ... | ... |
120503 | 2011-01-01 | Iraq | Wisconsin | Wauwatosa | 0 |
120504 | 2012-01-01 | Iraq | Wisconsin | Wauwatosa | 0 |
120505 | 2013-01-01 | Iraq | Wisconsin | Wauwatosa | 2 |
120506 | 2014-01-01 | Iraq | Wisconsin | Wauwatosa | 0 |
120507 | 2015-01-01 | Iraq | Wisconsin | Wauwatosa | 6 |
18205 rows × 5 columns
We stored our results in a new variable refugee_iraq_df
If we wanted to exclude all rows containing the value “Iraq”, we could run the following command:
refugee_drop_iraq_df = refugee_df[refugee_df['origin'] != 'Iraq']
refugee_drop_iraq_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | state | city | arrivals | |
---|---|---|---|---|---|
0 | 2005-01-01 | Afghanistan | Alabama | Mobile | 0 |
1 | 2006-01-01 | Afghanistan | Alabama | Mobile | 0 |
2 | 2007-01-01 | Afghanistan | Alabama | Mobile | 0 |
3 | 2008-01-01 | Afghanistan | Alabama | Mobile | 0 |
4 | 2009-01-01 | Afghanistan | Alabama | Mobile | 5 |
... | ... | ... | ... | ... | ... |
121240 | 2011-01-01 | Vietnam | Wisconsin | Milwaukee | 0 |
121241 | 2012-01-01 | Vietnam | Wisconsin | Milwaukee | 4 |
121242 | 2013-01-01 | Vietnam | Wisconsin | Milwaukee | 2 |
121243 | 2014-01-01 | Vietnam | Wisconsin | Milwaukee | 0 |
121244 | 2015-01-01 | Vietnam | Wisconsin | Milwaukee | 0 |
103026 rows × 5 columns
We can also add columns to the DataFrame. For example, we can add a percent_total
column to calculate the percentage of total refugee arrivals for each row.
refugee_df['percent_total'] = (refugee_df['arrivals'] / refugee_df['arrivals'].sum())*100
refugee_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | state | city | arrivals | percent_total | |
---|---|---|---|---|---|---|
0 | 2005-01-01 | Afghanistan | Alabama | Mobile | 0 | 0.000000 |
1 | 2006-01-01 | Afghanistan | Alabama | Mobile | 0 | 0.000000 |
2 | 2007-01-01 | Afghanistan | Alabama | Mobile | 0 | 0.000000 |
3 | 2008-01-01 | Afghanistan | Alabama | Mobile | 0 | 0.000000 |
4 | 2009-01-01 | Afghanistan | Alabama | Mobile | 5 | 0.000743 |
... | ... | ... | ... | ... | ... | ... |
121240 | 2011-01-01 | Vietnam | Wisconsin | Milwaukee | 0 | 0.000000 |
121241 | 2012-01-01 | Vietnam | Wisconsin | Milwaukee | 4 | 0.000595 |
121242 | 2013-01-01 | Vietnam | Wisconsin | Milwaukee | 2 | 0.000297 |
121243 | 2014-01-01 | Vietnam | Wisconsin | Milwaukee | 0 | 0.000000 |
121244 | 2015-01-01 | Vietnam | Wisconsin | Milwaukee | 0 | 0.000000 |
121231 rows × 6 columns
*Note: refugee_df['arrivals'].sum()
calculates the sum of all the values in the arrivals column.
You can read the command we just ran as: create a new column that calculates the number of arrivals in a row divided by the total number of arrivals in the dataset, times 100. The result of this calculation will equal the percentage of total refugee arrivals for each row.
- .sum() In Pandas, the .sum() method is used to calculate the sum of values along a specified axis in a DataFrame or Series. It can be applied to numeric columns, providing the total sum of the values in the specified axis.Write code for selecting the year, origin, and arrivals columns:
```python new_df= refugee_df[['year', 'origin', 'arrivals']] ```Write code to drop all rows where the origin is Vietnam:
```python no_burma_df = refugee_df[refugee_df['origin'] != 'Vietnam'] ```Write code to only keep rows where the arrival state is California:
```python california_df = refugee_df[refugee_df['state'] == 'California'] ```How do you select a specific column in a Pandas DataFrame named "df"?
- df.select("column_name") - df.column_name - df.get_column("column_name") - df[["column_name"]]*What is the primary purpose of the == operator in Python?
- It assigns a value to a variable. - It makes sure two values are different. - It is used for mathematical addition. - It checks if two values or expressions are equal.*In the following section we will learn how to sort and group columns in order to perform calculations.
Did you know you can perform both quantitative and qualitative manipulations across entire columns in Pandas? You can use the .apply() method along with a lambda function or a user-defined function to transform text values. For instance, apply the .str accessor to access string methods like .lower() or .replace() for case normalization or substitution. The apply function works element-wise, allowing you to process each element in the column independently. This approach empowers you to clean, modify, or extract information from text columns with ease, facilitating comprehensive data preparation for analysis.In this lesson, we will be using commands that stack various requests such as methods, parameters, operators, and more to define the command. Pandas encourages this kind of stacking, but it can seem overwhelming at first to beginners. For example, as we will see below, a command could include two or more methods that stack on top of each other, and end with a slice operator to view only the top N rows of the results. In addition, a command can include specific parameters to call out a particular column or to sort the data in descending order.
We will move slowly through each of the following commands to break them down.
To sort a DataFrame, we can use the .sort_values()
method with the parameter by= and including the name of the column we want to sort by written in quotation marks.
For example, we can sort the DataFrame by the arrivals column:
refugee_df.sort_values(by='arrivals', ascending=False)[:15]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
year | origin | state | city | arrivals | percent_total | |
---|---|---|---|---|---|---|
11359 | 2009-01-01 | Iraq | California | El Cajon | 2813 | 0.418279 |
8718 | 2008-01-01 | Iran | California | Glendale | 2025 | 0.301107 |
11360 | 2010-01-01 | Iraq | California | El Cajon | 1811 | 0.269286 |
8719 | 2009-01-01 | Iran | California | Glendale | 1722 | 0.256053 |
58761 | 2012-01-01 | Iraq | Michigan | Southfield | 1720 | 0.255755 |
11362 | 2012-01-01 | Iraq | California | El Cajon | 1698 | 0.252484 |
8717 | 2007-01-01 | Iran | California | Glendale | 1637 | 0.243414 |
64090 | 2006-01-01 | Somalia | Minnesota | Minneapolis | 1586 | 0.235830 |
11358 | 2008-01-01 | Iraq | California | El Cajon | 1569 | 0.233302 |
25699 | 2005-01-01 | Cuba | Florida | Miami | 1537 | 0.228544 |
87927 | 2006-01-01 | Somalia | Ohio | Columbus | 1319 | 0.196129 |
11363 | 2013-01-01 | Iraq | California | El Cajon | 1238 | 0.184084 |
40503 | 2014-01-01 | Burma | Indiana | Indianapolis | 1211 | 0.180070 |
25704 | 2010-01-01 | Cuba | Florida | Miami | 1151 | 0.171148 |
58829 | 2014-01-01 | Iraq | Michigan | Troy | 1095 | 0.162821 |
Note: In the command above, we used the by=
parameter to specify that the data be sorted according to the arrivals
column and we added the ascending=False
parameter in order to request that the data be displayed with the highest number first. By default, Pandas will sort in ascending
order, meaning from the smallest value to the largest value. We also added a Python list slice (i.e., [:15]) to view just the top 15 rows.
We can group data and perform calculations on the groups using the .groupby()
method. For example, to see the breakdown of the number of arrivals by country of origin, we can use the following command:
refugee_df.groupby('origin')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9b81ebc100>
This command created a Groupby object—grouped data that we can use to perform calculations such as counting the number of non-blank values in each column for each arrival by country of origin.
Next, we will use the following command to sum the number of refugee arrivals by country of origin, with the output showing the top twenty rows sorted by descending order:
refugee_df.groupby('origin')['arrivals'].sum().sort_values(ascending=False)[:20]
origin
Burma 151875
Iraq 127326
Bhutan 85316
Somalia 71120
Cuba 39946
Iran 36683
Dem. Rep. Congo 25493
Russia 15656
Eritrea 14645
Sudan 12636
Ukraine 11727
Burundi 11152
Vietnam 10769
Liberia 9424
Ethiopia 8487
Afghanistan 6751
Laos 6609
Moldova 4482
Syria 2359
Colombia 1956
Name: arrivals, dtype: int64
These results show us the total number of arrivals by country of origin across the 2005-2015 period, in descending order, sliced for the top 20 results.
Let’s unpack the command to better understand these results:
- We have three stacked methods here: .groupby(), .count(), and .sort_values().
- groupby('origin')['arrivals']: For the Groupby object we defined in the previous step, groupby(‘origin’), we are isolating the “arrivals” column. Basically, we are asking to view the number of refugee arrivals by country of origin.
- .sum(): This method adds non-blank cells for each column or row. The results we see in the output show the total number of refugee arrivals by country of origin.
- .sort_values(ascending=False): This method specifies how we want our output to be sorted. We include the ascending=False parameter in order to request that the data be displayed with the highest percentage first.
- [:20]: This Python slide specifies that we just want to see the top 20 rows.
You will notice that our output is not a Dataframe. Instead, it's a Series Object, which doesn't allow us to select data or make further calculations on the data. We can convert it to a Dataframe by first storing the command above in a new variable and stacking two extra commands, to_frame
and reset_index
:
ref_sum_df=refugee_df.groupby('origin')['arrivals'].sum().sort_values(ascending=False)[:20]
ref_sum_df = ref_sum_df.to_frame().reset_index()
ref_sum_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
origin | arrivals | |
---|---|---|
0 | Burma | 151875 |
1 | Iraq | 127326 |
2 | Bhutan | 85316 |
3 | Somalia | 71120 |
4 | Cuba | 39946 |
5 | Iran | 36683 |
6 | Dem. Rep. Congo | 25493 |
7 | Russia | 15656 |
8 | Eritrea | 14645 |
9 | Sudan | 12636 |
10 | Ukraine | 11727 |
11 | Burundi | 11152 |
12 | Vietnam | 10769 |
13 | Liberia | 9424 |
14 | Ethiopia | 8487 |
15 | Afghanistan | 6751 |
16 | Laos | 6609 |
17 | Moldova | 4482 |
18 | Syria | 2359 |
19 | Colombia | 1956 |
- .reset_index() In Pandas, the .reset_index() method is used to reset the index of a DataFrame, converting the index values into columns and providing a default integer index. This operation is useful when you want to remove the current index or revert to the default integer-based index.
Write code to create a GroupBy object based on the 'state' column:
```python refugee_df.groupby('state') ```Write code to sum the number of refugee arrivals by state with the output showing the top 15 rows sorted by ascending order:
```python refugee_df.groupby('state')['arrivals'].sum().sort_values(ascending=True)[:15] ```What is the primary purpose of the .reset_index() method in Pandas?
- It reorganizes the DataFrame based on specified column values. - It converts a Pandas Series into a DataFrame. - It resets the index of a DataFrame, converting index values into columns.* - It sorts the values of a DataFrame based on specified columns.When might you use the .to_frame() method in Pandas?
- To group rows of a DataFrame based on one or more columns. - To reset the index of a DataFrame - To sort the rows of a DataFrame based on specific values. - To convert a Pandas Series into a DataFrame.*What is the primary purpose of the .groupby() method in Pandas?
- To sort the rows of a DataFrame based on specific column values. - To group rows of a DataFrame based on one or more columns.* - To reset the index of a DataFrame. - To convert a Pandas Series into a DataFrame.Which Pandas method is used to sort the rows of a DataFrame based on specified column values?
- .reset_index() - .to_frame() - .groupby() - .sort_vales()*To create plots and data visualization in Pandas, we can add the .plot() method to any DataFrame or Series object that has appropriate numeric data.
We can specify the title with the title= parameter and the kind of plot by altering the kind= parameter:
bar
orbarh
for bar plots (h is for horizontal)hist
for histogrambox
for boxplotkde
ordensity
for density plotsarea
for area plotsscatter
for scatter plotshexbin
for hexagonal bin plotspie
for pie plots
- .plot() In Pandas, the .plot() method is used to create basic visualizations from DataFrame or Series data. It provides a convenient way to generate various plots, such as line charts, bar plots, and scatter plots, directly from Pandas objects for quick data exploration and analysis.
For example, we can visualize the data we got from our Groupby command looking at the total number of refugees by country of arrival as a bar chart:
ref_sum_df.plot(kind='bar', x= 'origin', y='arrivals',
title='Total number of refugee arrivals in the U.S. \n by country of origin')
<Axes: title={'center': 'Total number of refugee arrivals in the U.S. \n by country of origin'}, xlabel='origin'>
Let’s unpack the command to better understand these results:
ref_sum_df
: This is the variable we created in the previous lesson summing the number of refugee arrivals by country of origin, with the output showing the top twenty rows sorted by descending order.plot(kind='bar', x= 'origin', y='arrivals', title='Total number of refugee arrivals in the U.S. \n by country of origin')
:- Here we are using the
.plot()
method to create a visualization, and we are specifying that we want a bar chart with the “kind=’bar’” parameter. - We also specify the values for the x axis (orgin) and y axis (arrivals)
- We are also giving the chart a title with the
title='Total number of refugee arrivals in the U.S. \n by country of origin'
parameter.- Note: By adding
\n
in the title text, we signify that the text that follows should be on a new line.
- Note: By adding
- Here we are using the
- '\n' In Python, "\n" is an escape sequence representing a newline character. When encountered in a string, it causes a line break, moving the subsequent text to a new line, making it useful for formatting and organizing output.
We can also visualize the data as a pie chart:
ref_sum_df.set_index('origin')[:10].plot(kind='pie', y='arrivals',
title='Refugee arrivals in the U.S.')
<Axes: title={'center': 'Refugee arrivals in the U.S.'}, ylabel='arrivals'>
- We start by setting the index of the dataframe to the
origin
column. This ensures that our legend will show the country names. - We slice our results to show just the top 10 countries. This makes our chart more legible.
- Next, we use the
.plot()
method, specifyingpie
as the type of plot - We also specify the y values - in our case
arrivals
We can also create time series using the Groupby method. For example, if we wanted to visualize the total number of refugees resettled in the U.S. across the 2005-2015 period, we would first create a Groupby object based on the “year” column (refer back to lesson 7 for more on Groupby objects).
refugee_df.groupby('year')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9bd6807e80>
Next, we can create a new variable calculating the average number of refugees being resettled over time.
total_arrivals_by_year = refugee_df.groupby('year')['arrivals'].sum()
Let’s break this command down:
- We have two stacked methods here:
.groupby()
and.sum()
groupby('year')['arrivals']
: For the Groupby object, groupby(year), we are isolating the “arrivals” column. Basically, we are asking to view the number of refugee arrivals by year..sum()
: This method returns the sum of the values over the requested axis. In our case, it will calculate the total number of refugee arrivals per year.- We store our results in the
total_arrivals_by_year
variable
Finally, we can add the .plot()
method to create a line chart.
total_arrivals_by_year.plot(kind='line', title="Total Number of Refugee Arrivals by Year")
<Axes: title={'center': 'Total Number of Refugee Arrivals by Year'}, xlabel='year'>
In this command, we are adding the .plot()
method to request a chart, and specifying that we want a line graph with the “kind=line” parameter. We are also giving the chart a title with the “title='Total Number of Refugee Arrivals by Year'” parameter.
- Time series A time series is a sequence of data points collected or recorded over a period of time, typically at regular intervals. It is used to analyze and visualize trends, patterns, and behaviors in data that evolve over time.
Write code to visualize the data we got from our Groupby command looking at the total number of refugees by country of arrival as a scatter plot:
```python ref_sum_df.plot(kind='scatter', x= 'origin', y='arrivals', title='Total number of refugee arrivals in the U.S. \n by country of origin') ```Which Pandas method is commonly used to create basic visualizations, such as line charts, bar plots, and scatter plots, directly from DataFrame or Series data?
- .format() - .visualize() - .plot()* - .display()What type of chart is best suited for illustrating numerical proportions in a dataset, where each slice represents a proportion of the whole?
- Line chart - Bar chart - Pie chart* - Time series chartTo output a new CSV file, we can use the .to_csv
method with a name for the file in quotation marks. For example, since we added the percent_total column to the refugee_df DataFrame, we may want to download the updated DataFrame as a CSV file to use it with other programs.
refugee_df.to_csv("new_refugee.csv", encoding='utf-8', index=False)
In addition to a filename, we’re also specifying that the encoding is utf-8 and that the Index (the bolded left-most column) is not included in the CSV file.
- .to_csv In Pandas, the .to_csv() method is used to write the contents of a DataFrame to a CSV (Comma-Separated Values) file. It allows users to save the data in a tabular format that can be easily shared, imported, or analyzed using other tools.From theory to practice: Building your Pandas skills with the Pandas documentation and other resources
Learning how to ask the right questions in a search engine like Google in order to find the solution to what you are trying to accomplish is the name of the game when you are just starting out with Python. Since Pandas is a popular and well documented Python package, you are bound to find myriads of resources that can help you get where you are going.
A good first place to start when you are searching for answers with Pandas is to look at the Pandas documentation, which is fairly accessible to beginners, and is an incredible resource when you want to learn how to use a new command. It also offers a User Guide for beginners with some fun exercises to deepen your learning.
Let’s say you wanted to find out more about the .sort_values method we used and understand the different parameters the method accepts:
- You could first search for .sort_values on the Pandas documentation website (https://pandas.pydata.org) and navigate to the “pandas.DataFrame.sort_values” documentation page.
- Scroll through the page for the info, and look at the second section for examples of how to use the method and its various parameters.
- If you don’t find an answer that makes sense to you on the Pandas documentation page, then look on Google for other resources. Some of our go-to websites for help are Stack Overflow, Geeks for Geeks, and Data to Fish.
Generative AI platforms like ChatGPT can be incredibly helpful in deepening your understanding of coding concepts in addition to helping you build your code. In this workshop we learned critical vocabulary such as "method" and "dataframe" that you can use to communicate with ChatGPT in order to ask it to build a Python script for you. Make sure to ask the platform to explain what each line of code is accomplishing so you can better understand the output it is giving you.
Prompt examples include:
- Explain to me what a dataframe is like I'm an 8th grader
- Could you help me build a script to change the names of the columns in my dataframe?
- How can I convert a series object into a dataframe in Pandas?
- This workshop owes a huge debt to Melanie Walsh’s Introduction to Cultural Analytics & Python. This easy to use and understand textbook provides additional support for using Pandas. This is a great place to start if you want to continue building your Pandas skills.