Skip to content

Latest commit

 

History

History
3139 lines (2622 loc) · 85.4 KB

pandas.md

File metadata and controls

3139 lines (2622 loc) · 85.4 KB
title cover title description programming_language learning objectives estimated time prerequisites instructors authors Editors readings ethical considerations projects resources
Data Manipulation in Pandas and Python
Pandas
In this workshop, we are going to learn some basic commands in Pandas, an expansive Python library for working with tabular data like CSV files. You can think of Pandas as a more powerful version of Excel that operates within the Python environment, where you can wrangle, clean, analyze, and visualize data. Knowing how to use Pandas is important if you plan on working with datasets that include qualitative and/or quantitative data points.
jupyter
Import Pandas and read in a CSV file as a DataFrame
Explore your data, including displaying and sampling the data
Clean your data, including checking for duplicates and converting data types
Filter your data, including renaming, selecting, dropping, and adding columns
Analyze your data by sorting columns, grouping columns, and counting values
Visualize your data with basic bar charts, pie charts, and time series
Write a DataFrame to a CSV file
Build your Pandas skills with the Pandas documentation and other resources
2 hours
python
description required
(required) This workshop relies heavily on concepts from the Intro to Python workshop, and having a basic understanding of how to use the commands discussed in the workshop will be central for anyone who wants to learn about data analysis with Python and Pandas.
true
command-line
description recommended
(recommended) This workshop makes some reference to concepts from the Command Line workshop, and having basic knowledge about how to use the command line will be central for anyone who wants to learn about text analysis with Python and Pandas.
true
data ethics
description recommended
(recommended) This workshop will give you a basis for thinking through the ethical considerations of your programming projects.
true
Rebecca Krisel
Rebecca Krisel
Stephen Zweibel
This workshop uses Jupyter Notebooks to process the Python commands in a clear and visual way. Read this very short introduction to how to use Notebooks [A Beginner’s Tutorial to Jupyter Notebooks](https://towardsdatascience.com/a-beginners-tutorial-to-jupyter-notebooks-1b2f8705888a)
[Guide To Data Cleaning: Definition, Benefits, Components, And How To Clean Your Data](https://www.tableau.com/learn/articles/what-is-data-cleaning)
The dataset we are using for this workshop is from the [U.S. Refugee Arrivals Data](https://github.com/BuzzFeedNews/2015-11-refugees-in-the-united-states/blob/master/data/WRAPS-arrivals-by-destination-2005-2015-clean.csv). This dataset contains data about refugee arrivals to the United States between 2005 and 2015 and was originally compiled from the Department of State’s Refugee Processing Center by Jeremy Singer-Vine for his BuzzFeed article [“Where U.S. Refugees Come From — And Go — In Charts.”](https://www.buzzfeednews.com/article/jsvine/where-us-refugees-come-from-and-go-in-charts#.vooNwy74jO)
As with any dataset, responsible data analysis requires more than just technical tools like Pandas. We also need to interrogate the data. Who collected this data? How and why is this data being collected? What assumptions are baked into this data? What consequences does this data have in the world? What does this data tell us about our world? When exploring this dataset, we can consider the meaning of the status of refugee, who has a right to claim that status, and which refugees are considered eligible to legally resettle in the U.S.. We could also think about who may not be represented in this data, including asylum seekers who migrate to the U.S. and become undocumented immigrants while waiting for their asylum claims to be processed.
The Simplest Data Science Project Using Pandas & Matplotlib
description link
Further deepen your Pandas skills while making beautiful data visualizations in Matplotlib using a dataset (from Kaggle) of forest fires in Brazil.
Make a Gradebook With Python & pandas
description link
Learn how you can automate the process of calculating grades at the end of the semester using Pandas
Jupyter Notebook shortcuts, tips and tricks
description link
Here are some of the keyboard shortcuts and text snippets to help you program faster!

Jupyter & JupyterLite

In this workshop, we are going to use JupyterLite as our Python environment, which is a lightweight version of Jupyter Notebooks.

You can think of Jupyter Notebooks as a comprehensive toolset for interactive computing and data exploration. It's like a fully equipped research laboratory where you can seamlessly blend narrative text, code, and visualizations. JupyterLite is a streamlined, lightweight version of this toolset, offering a more agile and accessible environment.

Jupyter Notebooks - Full Research Lab:

  • Versatility: Jupyter Notebooks are like a fully equipped research lab. They support various programming languages, allowing you to conduct diverse computational experiments.
  • Interactivity: Much like a lab where you can actively engage with your experiments, Jupyter Notebooks let you interactively run code cells, observe results, and refine your analyses on the fly.
  • Rich Outputs: Visualizations, graphs, and other dynamic outputs enrich your narrative, creating a comprehensive and visually engaging research document.

JupyterLite - Portable Research Kit:

  • Simplicity: JupyterLite is a simplified, more portable version. It's akin to having a compact research kit that allows you to conduct essential experiments without the full complexity of the lab.
  • Accessibility: When you need a quick, accessible environment for coding and exploration, JupyterLite provides a more straightforward setup. It's like having your essential tools ready without the need for a fully equipped lab.

Learning and On-the-Go:

  • Educational Tool: JupyterLite is often used for educational purposes or quick experiments. It serves as a friendly entry point for those learning about interactive computing without overwhelming them with the extensive features of the full Jupyter Notebooks.
  • Portability: JupyterLite is convenient for situations where the full-scale Jupyter environment might be impractical or unnecessary. It's like having a research companion that fits into your backpack.

In essence, JupyterLite is a more nimble version of Jupyter Notebooks, suitable for scenarios where a lighter computational environment is sufficient. It shares the same core principles of interactive computing but in a more accessible and portable form, making it a valuable tool for quick experiments, learning, and situations where resource constraints or simplicity are paramount.

- Jupyter Notebook Comprehensive toolset for interactive computing and data exploration, blending narrative text, code, and visualizations
  • JupyterLite A lightweight version of Jupyter Notebook

Preparing your workspace and folders

Keeping all your files for a particular project in a designated file directory will keep your project organized and will make it easier to read in your files.

Creating a new folder

Let’s create a “pandas_workshop” folder in our JupyterLite code editor workspace:

  • Click on "Open Code Editor" in the top right corner of this browser window.

Open Code Editor

  • In the folders toolbar, click the icon for the folder with the '+' on it. This will create a new "Untitled Folder."

Create Folder

  • To rename the folder, click on the folder name while pressing the "control" key on your keyboard at the same time (MacOS) or right-clicking on your mousepad (Windows), and selecting "Rename" from the drop-down menu.
  • Rename the folder: “pandas_workshop”.
- Code editor A code editor is a software application that is specifically designed for writing and editing source code of computer programs. It provides features and tools to make the process of coding more efficient and productive for developers. Code editors are essential tools for software development and come in various forms, ranging from simple text editors to feature-rich integrated development environments (IDEs).

Download the Full Code in a Jupyter Notebook & upload it into the code editor

We’ve saved all of the code for this section in a Jupyter Notebook file.

  • To access this file, click the following button:
  • Open your code editor in the right hand corner and locate the pandas_workshop_2024.ipynb file
  • When prompted, select the "Python Pyodide" kernel
  • Finally, drag and drop the pandas_workshop_2024.ipynb file into your pandas_workshop folder
  • Once it's in that folder, you should be all set!

In this file you will find all of the workshop commands and the expected outputs. If you ever feel stuck or can’t seem to be able to advance in the workshop, you can open this file and see how we did it.

For the best possible experience, we suggest/encourage you to:

  • Follow the workshop typing all the code yourself.
  • Avoid copying/pasting the code. Much of learning has to do with you typing yourself.
  • Only check the PandasWorkshop.ipynb file if you get lost or if you are not able to get the right output. Before opening it, put some time trying to figure out by yourself why it isn’t working. A big part of coding is learning to identify what we are doing wrong.
  • We also caution you against working with both files open at the same time. It is easy to get confused and start modifying the wrong one. But those are only suggestions. Maybe they will work for you, maybe they won’t, so feel free to do what suits you best. You are in charge here!

Launching a new JupyterLite Notebook

Finally, let's launch a new JupyterLite Notebook from within the “pandas_workshop” folder.

  • Make sure you are in the “pandas_workshop” folder. If not, navigate there.
  • From the launcher window, click the "Python (Pyodide)" button

Launch Notebook

  • A new JupyterLite Notebook will open. Rename the notebook by clicking on the notebook name while pressing the "control" key on your keyboard at the same time (MacOS) or right-clicking on your mousepad (Windows), and selecting "Rename" from the drop-down menu.
  • Rename the notebook: “pandas_workshop”.

Rename Notebook

Hooray! You are all set to move forward!

- Jupyter Notebook Comprehensive toolset for interactive computing and data exploration, blending narrative text, code, and visualizations
  • JupyterLite A lightweight version of Jupyter Notebook

Getting started with Pandas

  • Pandas is a Python software library
  • The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals
  • You can think of it as a FREE equivalent to Stata or SPSS for data manipulation and analysis. It is also more powerful than Excel
  • Knowing how to use Pandas is important if you plan on working with data organized in spreadsheets (either quantitative or qualitative)
- Pandas A powerful Python library for data manipulation and analysis, particularly designed to handle structured data. It provides data structures like DataFrames that allow users to easily organize, clean, and analyze tabular data, such as spreadsheets or SQL tables. With Pandas, non-technical users can perform tasks like filtering, aggregating, and visualizing data without extensive programming knowledge, making it a valuable tool for data exploration and preparation.

Import Pandas and Matplotlib

In the first blank cell, type the following command to import the Pandas library into our Jupyter Notebook. To run the command, you can click the “Run” button in the top toolbar, or you can click shift + return.

import pandas as pd
import matplotlib

The first import statement not only imports the Pandas library but also gives it the alias “pd.” Using this alias will save us from having to type out the entire word “Pandas” each time we need to use it. Libraries are sets of instructions that Python can use to perform specialized functions.

The second import, matplotlib, is used to import the Matplotlib library, which is a plotting library for Python and Pandas. We will use Matplotlib to create visualizations of our data later in the workshop.

If you don’t see an error when you run the notebook—that is, if there is no output—you can move on to the next step. It is not rare in programming that when you do things right, the result will be nothing happening. This is what we like to call a silent success.

- Import In Python, the "import" keyword is used to bring external modules or libraries into your script, making their functionality available for use. Once imported, you can access the functions, classes, or variables defined in the module using the module's name as a prefix.
  • Silent Success In Python, a "silent success" typically refers to a successful operation or function call that does not produce any visible output or raise explicit errors. It implies that the operation was completed successfully, but the result might not be explicitly displayed, making it important for developers to check return values or logs to confirm the success.

Read in a CSV file as a DataFrame

Next, we will read in our dataset saved as a CSV file. We will specifically work with the refugee-arrivals-by-destination.csv dataset, which contains data about refugee arrivals to the United States between 2005 and 2015, including the number of refugees who arrived in each U.S. city and state, the year that they arrived, and the country from which they arrived.

We will connect to the dataset directly from this Github link

import pyodide_http # makes it possible to read https links in pyodide
pyodide_http.patch_all()
url = 'https://raw.githubusercontent.com/GC-DRI/DRI24/main/uploads/pandas/refugee-arrivals-by-destination.csv'
refugee_df = pd.read_csv(url, delimiter=",", encoding='utf-8')
  • With this command, we are creating a Pandas DataFrame object, which is a 2-dimensional labeled data structure with columns of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
  • It is common practice to abbreviate DataFrame with “df”, as in refugee_df.
- DataFrame In Python, a DataFrame is a two-dimensional, tabular data structure provided by the Pandas library. It organizes data into rows and columns, similar to a spreadsheet, and offers powerful tools for data manipulation, analysis, and cleaning.
  • Delimiter In a CSV (Comma-Separated Values) file, a delimiter is a character or sequence of characters that separates individual fields or data elements within each row. Common delimiters include commas, tabs, or semicolons, and they enable the proper parsing of data when reading or writing CSV files.

  • Encoding In the context of text data, encoding refers to the method used to represent characters as binary data. UTF-8, for example, is a widely used encoding that can represent most characters in the Unicode standard, allowing for the storage and transmission of text in multiple languages.

Evaluation

What is the primary data structure in Pandas for working with tabular data?

- Series - DataFrame* - Array - List

What is a DataFrame?

- A two-dimensional data structure* - A three-dimensional data structure - A single-dimensional data structure - A series object

Data exploration

In the following section we will learn how to view the DataFrame in addition to viewing smaller sections of it.

Display the DataFrame

To display the DataFrame, we can run a cell with the variable name of the DataFrame:

refugee_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals
0 2005 Afghanistan Alabama Mobile 0
1 2006 Afghanistan Alabama Mobile 0
2 2007 Afghanistan Alabama Mobile 0
3 2008 Afghanistan Alabama Mobile 0
4 2009 Afghanistan Alabama Mobile 5
... ... ... ... ... ...
121240 2011 Vietnam Wisconsin Milwaukee 0
121241 2012 Vietnam Wisconsin Milwaukee 4
121242 2013 Vietnam Wisconsin Milwaukee 2
121243 2014 Vietnam Wisconsin Milwaukee 0
121244 2015 Vietnam Wisconsin Milwaukee 0

121245 rows × 5 columns

Let’s take a look at a few elements in this DataFame:

  • Index
    • The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
    • By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.
    • The default index is a Unique ID - that being said, the index does not have to be a Unique ID. You can set your index to be any column in your DataFrame
  • Truncation
    • The DataFrame is truncated, signaled by the ellipses in the middle … of every column.
    • The DataFrame is truncated because by default, Pandas will display 60 rows and 20 columns. You can change these default settings see documentation
  • Rows x Columns
    • Pandas reports how many rows and columns are in this dataset at the bottom of the output. Our DataFrame has 121,245 rows × 5 columns.
  • NAN
    • NaN is the Pandas value for any missing data.
- Index In a Pandas DataFrame, the index serves as a unique identifier for each row, allowing for easy access and manipulation of data. It provides a label for the rows, facilitating efficient data retrieval, alignment, and merging operations in various analytical tasks.

Display a selection of the DataFrame

We can also display the first 2 rows of the DataFrame with the .head() method

refugee_df.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals
0 2005 Afghanistan Alabama Mobile 0
1 2006 Afghanistan Alabama Mobile 0

We can accomplish the same task using a slice instead:

refugee_df[:2]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals
0 2005 Afghanistan Alabama Mobile 0
1 2006 Afghanistan Alabama Mobile 0

We can also display the last 10 rows of the DataFrame with the .tail() method

refugee_df.tail(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals
121235 2006 Vietnam Wisconsin Milwaukee 5
121236 2007 Vietnam Wisconsin Milwaukee 18
121237 2008 Vietnam Wisconsin Milwaukee 12
121238 2009 Vietnam Wisconsin Milwaukee 4
121239 2010 Vietnam Wisconsin Milwaukee 7
121240 2011 Vietnam Wisconsin Milwaukee 0
121241 2012 Vietnam Wisconsin Milwaukee 4
121242 2013 Vietnam Wisconsin Milwaukee 2
121243 2014 Vietnam Wisconsin Milwaukee 0
121244 2015 Vietnam Wisconsin Milwaukee 0

We can accomplish the same task using a slice instead:

refugee_df[-10:]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals
121235 2006 Vietnam Wisconsin Milwaukee 5
121236 2007 Vietnam Wisconsin Milwaukee 18
121237 2008 Vietnam Wisconsin Milwaukee 12
121238 2009 Vietnam Wisconsin Milwaukee 4
121239 2010 Vietnam Wisconsin Milwaukee 7
121240 2011 Vietnam Wisconsin Milwaukee 0
121241 2012 Vietnam Wisconsin Milwaukee 4
121242 2013 Vietnam Wisconsin Milwaukee 2
121243 2014 Vietnam Wisconsin Milwaukee 0
121244 2015 Vietnam Wisconsin Milwaukee 0
- Function vs. Method In Python, a function is a block of reusable code that can be defined and called independently of any object, while a method is a function associated with an object, often designed to operate on that specific object. Methods are invoked using dot notation on an instance of a class, whereas functions are typically standalone and can be called without an associated object.
  • .head() & .tail() In Pandas, the .head() method is used to display the first few rows of a DataFrame, providing a quick overview of its structure, while the .tail() method shows the last few rows. Both methods help users inspect data frames efficiently.

Display a random sample of the DataFrame

We can also look at a random sample of data with the .sample() method

refugee_df.sample(15)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals
102329 2009 Dem. Rep. Congo Texas Cedar Park 0
40110 2006 Ukraine Illinois Woodridge 0
45485 2013 Cuba Louisiana Avondale 2
62900 2015 Liberia Minnesota Brooklyn Center 1
46604 2010 Iraq Maine Norway 10
116748 2007 Latvia Washington Lynnwood 0
5039 2014 Burma California Chula Vista 0
66487 2005 Iraq Missouri Saint Louis 0
108591 2012 Iraq Vermont Essex Junction 0
94980 2008 Russia Pennsylvania Warminster 2
4690 2006 Bhutan California Alameda 0
72042 2005 Liberia New Jersey Englewood 0
66931 2009 Somalia Missouri Kansas City 141
91265 2011 Belarus Pennsylvania Millersburg 0
2077 2011 Iraq Arizona Goodyear 1

We can tell it's a random sample since the index numbers are completely disorganized.

- .sample() In Pandas, the `.sample()` method is used to randomly select a specified number of rows from a DataFrame. It provides a way to obtain a random subset of data for exploratory analysis or testing purposes.

Challenge

Using both the .head() method and a slice, write code to display the top 9 rows of the DataFrame:

Solution

```python refugee_df.head(9) refugee_df[:9] ```

Write code to display the last 5 rows of the DataFrame (there are two ways to do this):

Solution

```python refugee_df.tail(5) refugee_df[-5:] ```

Write code to display a random selection of 10 rows in the DataFrame:

Solution

```python refugee_df.sample(10) ```

Evaluation

What is the purpose of the .head() method in Pandas?

- To extract the first few rows of a DataFrame* - To display a summary of the data types in a DataFrame - To sort the DataFrame in ascending order - To count the number of rows in a DataFrame

Which of the following commands would return the top three rows in a dataframe? Select all that apply

- `refugee_df[0:3]`* - `refugee_df[0:4]` - `refugee_df[0:2]` - `refugee_df.head(3)`*

Basic data cleaning

In the following sections we will review some basic data cleaning steps including checking and converting our data types in addition to checking and removing duplicate rows.

Data cleaning is a crucial process for refining tabular data. Begin by addressing missing values. Tackle outliers to maintain data integrity and accuracy. Remove duplicates for a clean and unique dataset. Standardize formats for consistency, and finally, engage in feature engineering to enhance data insights and prepare it for analysis. These steps collectively ensure a refined and reliable foundation for effective data exploration and interpretation.

Data Types

We can get information about the columns in the DataFrame by using the .info() method.

refugee_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121245 entries, 0 to 121244
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   year        121245 non-null  int64 
 1   origin      121245 non-null  object
 2   dest_state  121245 non-null  object
 3   dest_city   121245 non-null  object
 4   arrivals    121245 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 4.6+ MB

This report tells us how many non-null, or non-blank, values are in each column, as well as what type of data is in each column.

Pandas uses a different lexicon to describe data types from those we learned in our intro to Python curriculum . Below is a table that explains what each data type means:

Pandas data types Python data types Usage
object String or mixed Text or mixed numeric and non-numeric values
float64 float Floating point numbers
int64 integer Integer numbers
datetime64 NA Date and time values
- Data Type In Python, data types define the type of values that variables can hold, such as integers, floats, strings, and booleans. Each data type comes with specific operations that can be performed on the associated values.

Converting data types

Keeping this in mind, it looks as though the data type for the year column is a “int64” instead of being “datetime64.”

refugee_df['year'] = pd.to_datetime(refugee_df['year'], format="%Y")

This command translates to: for the “year” column in the “refugee_df” DataFrame, use the to_datetime method in the Pandas library to convert the values in the “year” column in the “refugee_df” DataFrame to datetime data types. We specify the format as %Y since this is the format of our data.

We can then check to see if the data type was properly converted using the .dtypes object, which is similar to the .info() method, except it only provides information on data types.

refugee_df.dtypes
year          datetime64[ns]
origin                object
dest_state            object
dest_city             object
arrivals               int64
dtype: object

Data Types

As we can see, the data in the “year” column was successfully transformed into the datetime64 data type.

- Datetime In Pandas, the datetime data type represents date and time information. It allows for efficient handling, manipulation, and analysis of temporal data in a tabular format, offering functionalities for parsing, formatting, and performing operations on date and time values.

Check for duplicate rows

As part of our data cleaning process, we want to check for duplicate rows. We can do this by using the .duplicated() method inside a filter to isolate only the rows in the DataFrame that are exact duplicates. Filtering data by certain values is similar to selecting columns. We add the parameter keep=False, which will display all the duplicated values in the dataset (meaning none are considered unique) — rather than just the first duplicated value keep='first' or the last duplicated value keep='last'.

refugee_df[refugee_df.duplicated(keep=False)]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals
13 2007-01-01 Burma Alabama Auburn 0
14 2008-01-01 Burma Alabama Auburn 0
15 2009-01-01 Burma Alabama Auburn 0
16 2010-01-01 Burma Alabama Auburn 0
17 2007-01-01 Burma Alabama Auburn 0
18 2008-01-01 Burma Alabama Auburn 0
19 2009-01-01 Burma Alabama Auburn 0
20 2010-01-01 Burma Alabama Auburn 0
155 2013-01-01 Cuba Alabama Russellville 0
156 2014-01-01 Cuba Alabama Russellville 0
157 2015-01-01 Cuba Alabama Russellville 0
158 2005-01-01 Cuba Alabama Theodore 0
159 2006-01-01 Cuba Alabama Theodore 0
160 2013-01-01 Cuba Alabama Russellville 0
161 2014-01-01 Cuba Alabama Russellville 0
162 2015-01-01 Cuba Alabama Russellville 0
163 2005-01-01 Cuba Alabama Theodore 0
164 2006-01-01 Cuba Alabama Theodore 0
236 2012-01-01 Iran Alabama Mobile 1
237 2013-01-01 Iran Alabama Mobile 6
238 2014-01-01 Iran Alabama Mobile 3
239 2015-01-01 Iran Alabama Mobile 5
240 2005-01-01 Iran Alabama Pelham 0
247 2012-01-01 Iran Alabama Mobile 1
248 2013-01-01 Iran Alabama Mobile 6
249 2014-01-01 Iran Alabama Mobile 3
250 2015-01-01 Iran Alabama Mobile 5
251 2005-01-01 Iran Alabama Pelham 0

Looks like we have a few duplicate rows in our dataset.

To remove those duplicates, we can use the .drop_duplicates() method to drop duplicates from the DataFrame and select to keep the first instance of the duplicate or the last instance:

refugee_df = refugee_df.drop_duplicates(keep='first')

We can check to see if the command got rid of the duplicate rows by running the .duplicated() method again:

refugee_df[refugee_df.duplicated(keep=False)]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin dest_state dest_city arrivals

Great news! We successfully removed our duplicate rows!

- .duplicated() The .duplicated() method in Pandas identifies duplicate rows in a DataFrame based on the values in all columns. When keep=False is specified, all occurrences of the duplicated rows are marked as True, meaning none are considered unique, and all duplicates are included in the result. If keep is set to 'first' or 'last', it retains the first or last occurrence of each duplicated set, respectively, while marking the rest as True.

Evaluation

What does the keep parameter in the .duplicated() method of Pandas control?

- It specifies the number of duplicates to keep. - It determines whether to display all occurrences or just the first occurrence of duplicated rows.* - It controls the sorting order of duplicated values. - It determines whether to keep or remove duplicated rows.

Rename, select, drop, filter and add new columns

In the following section we will learn how to rename, select, drop, filter, and add new columns.

See list of columns

To see a full list of the columns in our DataFrame, we can run the following command:

refugee_df.columns

List of Columns

Our DataFrame has relatively few columns, so seeing the full list is not absolutely necessary in our case. This step becomes important when you are working with DataFrames with many columns.

Rename columns

To improve the readability of our dataset, we can rename columns. In our case, let’s rename “dest_state” as “state” and “dest_city” as “city”. We will use the .rename() method and the columns= parameter. Note that in this case we are setting the DataFrame equal to the returned value of the method so as to save the results into the DataFrame.

refugee_df=refugee_df.rename(columns={'dest_state': 'state','dest_city':'city' })
refugee_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin state city arrivals
0 2005-01-01 Afghanistan Alabama Mobile 0
1 2006-01-01 Afghanistan Alabama Mobile 0
2 2007-01-01 Afghanistan Alabama Mobile 0
3 2008-01-01 Afghanistan Alabama Mobile 0
4 2009-01-01 Afghanistan Alabama Mobile 5
... ... ... ... ... ...
121240 2011-01-01 Vietnam Wisconsin Milwaukee 0
121241 2012-01-01 Vietnam Wisconsin Milwaukee 4
121242 2013-01-01 Vietnam Wisconsin Milwaukee 2
121243 2014-01-01 Vietnam Wisconsin Milwaukee 0
121244 2015-01-01 Vietnam Wisconsin Milwaukee 0

121231 rows × 5 columns

- .rename() In Pandas, the rename method is used to change the labels (names) of columns or index in a DataFrame. It allows for both selective renaming and modifying all labels using a dictionary or a function.

Select columns

Let’s say we wanted to view data from just one column in the DataFrame. To do this, we could run the following command:

refugee_df[['state']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
state
0 Alabama
1 Alabama
2 Alabama
3 Alabama
4 Alabama
... ...
121240 Wisconsin
121241 Wisconsin
121242 Wisconsin
121243 Wisconsin
121244 Wisconsin

121231 rows × 1 columns

Here we use double brackets around the column name to transform the column from a Series object into a DataFrame. Basically, the interior brackets are for lists, and the outside brackets are indexing operators. If you are curious to see the difference, try the following command instead: refugee_df['state'].

To view additional columns at the same time, you can add them to the list within the square brackets, separated by a comma. However, you can’t select multiple columns as a Series (try: refugee_df['state','city'])

refugee_df[['state','city']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
state city
0 Alabama Mobile
1 Alabama Mobile
2 Alabama Mobile
3 Alabama Mobile
4 Alabama Mobile
... ... ...
121240 Wisconsin Milwaukee
121241 Wisconsin Milwaukee
121242 Wisconsin Milwaukee
121243 Wisconsin Milwaukee
121244 Wisconsin Milwaukee

121231 rows × 2 columns

- DataFrame vs Series Object In Pandas, a Series is a one-dimensional labeled array that can hold any data type, while a DataFrame is a two-dimensional tabular data structure comprising multiple columns, each of which is a Series. Essentially, a DataFrame is a collection of Series that share the same index, providing a structure similar to a table in a relational database.

Drop columns

To remove a column from the DataFrame, we can use the .drop() method and include the column name. In our case, we could drop the “city” column and save the result as a new DataFrame “refugee_drop_df” so we don’t override our original DataFrame.

refugee_drop_city_df = refugee_df.drop(columns="city")
refugee_drop_city_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin state arrivals
0 2005-01-01 Afghanistan Alabama 0
1 2006-01-01 Afghanistan Alabama 0
2 2007-01-01 Afghanistan Alabama 0
3 2008-01-01 Afghanistan Alabama 0
4 2009-01-01 Afghanistan Alabama 5
... ... ... ... ...
121240 2011-01-01 Vietnam Wisconsin 0
121241 2012-01-01 Vietnam Wisconsin 4
121242 2013-01-01 Vietnam Wisconsin 2
121243 2014-01-01 Vietnam Wisconsin 0
121244 2015-01-01 Vietnam Wisconsin 0

121231 rows × 4 columns

- .drop() In Pandas, the drop method is used to remove specified rows or columns from a DataFrame. It returns a new DataFrame with the specified elements removed, and the original DataFrame remains unchanged unless the inplace parameter is set to True or the new DataFrame is stored in a new variable.

Filter columns

We can filter a Pandas DataFrame to select only certain values. Filtering data by certain values is similar to selecting columns.

We type the name of the DataFrame followed by square brackets and then, instead of inserting a column name, we insert a True/False condition. For example, to select only rows that contain the value “Iraq”, we could run the following command:

refugee_iraq_df = refugee_df[refugee_df['origin'] == 'Iraq']
refugee_iraq_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin state city arrivals
267 2005-01-01 Iraq Alabama Birmingham 0
268 2006-01-01 Iraq Alabama Birmingham 0
269 2007-01-01 Iraq Alabama Birmingham 0
270 2008-01-01 Iraq Alabama Birmingham 6
271 2009-01-01 Iraq Alabama Birmingham 0
... ... ... ... ... ...
120503 2011-01-01 Iraq Wisconsin Wauwatosa 0
120504 2012-01-01 Iraq Wisconsin Wauwatosa 0
120505 2013-01-01 Iraq Wisconsin Wauwatosa 2
120506 2014-01-01 Iraq Wisconsin Wauwatosa 0
120507 2015-01-01 Iraq Wisconsin Wauwatosa 6

18205 rows × 5 columns

We stored our results in a new variable refugee_iraq_df

- == In Python, the == operator is used for equality comparison, determining whether two values or expressions are equal. It returns True if the values are equal and False otherwise.

Drop Rows

If we wanted to exclude all rows containing the value “Iraq”, we could run the following command:

refugee_drop_iraq_df = refugee_df[refugee_df['origin'] != 'Iraq']
refugee_drop_iraq_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin state city arrivals
0 2005-01-01 Afghanistan Alabama Mobile 0
1 2006-01-01 Afghanistan Alabama Mobile 0
2 2007-01-01 Afghanistan Alabama Mobile 0
3 2008-01-01 Afghanistan Alabama Mobile 0
4 2009-01-01 Afghanistan Alabama Mobile 5
... ... ... ... ... ...
121240 2011-01-01 Vietnam Wisconsin Milwaukee 0
121241 2012-01-01 Vietnam Wisconsin Milwaukee 4
121242 2013-01-01 Vietnam Wisconsin Milwaukee 2
121243 2014-01-01 Vietnam Wisconsin Milwaukee 0
121244 2015-01-01 Vietnam Wisconsin Milwaukee 0

103026 rows × 5 columns

- != In Python, the != operator is used for inequality comparison, checking whether two values or expressions are not equal. It returns True if the values are different and False if they are equal.

Add columns

We can also add columns to the DataFrame. For example, we can add a percent_total column to calculate the percentage of total refugee arrivals for each row.

refugee_df['percent_total'] = (refugee_df['arrivals'] / refugee_df['arrivals'].sum())*100
refugee_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin state city arrivals percent_total
0 2005-01-01 Afghanistan Alabama Mobile 0 0.000000
1 2006-01-01 Afghanistan Alabama Mobile 0 0.000000
2 2007-01-01 Afghanistan Alabama Mobile 0 0.000000
3 2008-01-01 Afghanistan Alabama Mobile 0 0.000000
4 2009-01-01 Afghanistan Alabama Mobile 5 0.000743
... ... ... ... ... ... ...
121240 2011-01-01 Vietnam Wisconsin Milwaukee 0 0.000000
121241 2012-01-01 Vietnam Wisconsin Milwaukee 4 0.000595
121242 2013-01-01 Vietnam Wisconsin Milwaukee 2 0.000297
121243 2014-01-01 Vietnam Wisconsin Milwaukee 0 0.000000
121244 2015-01-01 Vietnam Wisconsin Milwaukee 0 0.000000

121231 rows × 6 columns

*Note: refugee_df['arrivals'].sum() calculates the sum of all the values in the arrivals column.

You can read the command we just ran as: create a new column that calculates the number of arrivals in a row divided by the total number of arrivals in the dataset, times 100. The result of this calculation will equal the percentage of total refugee arrivals for each row.

- .sum() In Pandas, the .sum() method is used to calculate the sum of values along a specified axis in a DataFrame or Series. It can be applied to numeric columns, providing the total sum of the values in the specified axis.

Challenge

Write code for selecting the year, origin, and arrivals columns:

Solution

```python new_df= refugee_df[['year', 'origin', 'arrivals']] ```

Write code to drop all rows where the origin is Vietnam:

Solution

```python no_burma_df = refugee_df[refugee_df['origin'] != 'Vietnam'] ```

Write code to only keep rows where the arrival state is California:

Solution

```python california_df = refugee_df[refugee_df['state'] == 'California'] ```

Evaluation

How do you select a specific column in a Pandas DataFrame named "df"?

- df.select("column_name") - df.column_name - df.get_column("column_name") - df[["column_name"]]*

What is the primary purpose of the == operator in Python?

- It assigns a value to a variable. - It makes sure two values are different. - It is used for mathematical addition. - It checks if two values or expressions are equal.*

Sort Columns, Groupby Columns, & Calculations

In the following section we will learn how to sort and group columns in order to perform calculations.

Did you know you can perform both quantitative and qualitative manipulations across entire columns in Pandas? You can use the .apply() method along with a lambda function or a user-defined function to transform text values. For instance, apply the .str accessor to access string methods like .lower() or .replace() for case normalization or substitution. The apply function works element-wise, allowing you to process each element in the column independently. This approach empowers you to clean, modify, or extract information from text columns with ease, facilitating comprehensive data preparation for analysis.

Stacking requests

In this lesson, we will be using commands that stack various requests such as methods, parameters, operators, and more to define the command. Pandas encourages this kind of stacking, but it can seem overwhelming at first to beginners. For example, as we will see below, a command could include two or more methods that stack on top of each other, and end with a slice operator to view only the top N rows of the results. In addition, a command can include specific parameters to call out a particular column or to sort the data in descending order.

We will move slowly through each of the following commands to break them down.

Sort columns

To sort a DataFrame, we can use the .sort_values() method with the parameter by= and including the name of the column we want to sort by written in quotation marks.

For example, we can sort the DataFrame by the arrivals column:

refugee_df.sort_values(by='arrivals', ascending=False)[:15]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
year origin state city arrivals percent_total
11359 2009-01-01 Iraq California El Cajon 2813 0.418279
8718 2008-01-01 Iran California Glendale 2025 0.301107
11360 2010-01-01 Iraq California El Cajon 1811 0.269286
8719 2009-01-01 Iran California Glendale 1722 0.256053
58761 2012-01-01 Iraq Michigan Southfield 1720 0.255755
11362 2012-01-01 Iraq California El Cajon 1698 0.252484
8717 2007-01-01 Iran California Glendale 1637 0.243414
64090 2006-01-01 Somalia Minnesota Minneapolis 1586 0.235830
11358 2008-01-01 Iraq California El Cajon 1569 0.233302
25699 2005-01-01 Cuba Florida Miami 1537 0.228544
87927 2006-01-01 Somalia Ohio Columbus 1319 0.196129
11363 2013-01-01 Iraq California El Cajon 1238 0.184084
40503 2014-01-01 Burma Indiana Indianapolis 1211 0.180070
25704 2010-01-01 Cuba Florida Miami 1151 0.171148
58829 2014-01-01 Iraq Michigan Troy 1095 0.162821

Note: In the command above, we used the by= parameter to specify that the data be sorted according to the arrivals column and we added the ascending=False parameter in order to request that the data be displayed with the highest number first. By default, Pandas will sort in ascending order, meaning from the smallest value to the largest value. We also added a Python list slice (i.e., [:15]) to view just the top 15 rows.

- .sort_values() In Pandas, the .sort_values() method is used to sort the rows of a DataFrame based on the values in one or more columns. It allows users to arrange the data in ascending or descending order, facilitating better analysis and visualization of the dataset.

Groupby Columns

We can group data and perform calculations on the groups using the .groupby() method. For example, to see the breakdown of the number of arrivals by country of origin, we can use the following command:

refugee_df.groupby('origin')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9b81ebc100>

GroupBy Object

This command created a Groupby object—grouped data that we can use to perform calculations such as counting the number of non-blank values in each column for each arrival by country of origin.

Next, we will use the following command to sum the number of refugee arrivals by country of origin, with the output showing the top twenty rows sorted by descending order:

refugee_df.groupby('origin')['arrivals'].sum().sort_values(ascending=False)[:20]
origin
Burma              151875
Iraq               127326
Bhutan              85316
Somalia             71120
Cuba                39946
Iran                36683
Dem. Rep. Congo     25493
Russia              15656
Eritrea             14645
Sudan               12636
Ukraine             11727
Burundi             11152
Vietnam             10769
Liberia              9424
Ethiopia             8487
Afghanistan          6751
Laos                 6609
Moldova              4482
Syria                2359
Colombia             1956
Name: arrivals, dtype: int64

GroupBy Results

These results show us the total number of arrivals by country of origin across the 2005-2015 period, in descending order, sliced for the top 20 results.

Let’s unpack the command to better understand these results:

  • We have three stacked methods here: .groupby(), .count(), and .sort_values().
  • groupby('origin')['arrivals']: For the Groupby object we defined in the previous step, groupby(‘origin’), we are isolating the “arrivals” column. Basically, we are asking to view the number of refugee arrivals by country of origin.
  • .sum(): This method adds non-blank cells for each column or row. The results we see in the output show the total number of refugee arrivals by country of origin.
  • .sort_values(ascending=False): This method specifies how we want our output to be sorted. We include the ascending=False parameter in order to request that the data be displayed with the highest percentage first.
  • [:20]: This Python slide specifies that we just want to see the top 20 rows.
- .groupby() In Pandas, the .groupby() method is used to group rows of a DataFrame based on one or more columns, creating a groupby object. This object allows for the application of various aggregation functions to analyze and summarize data within each group efficiently.

Convert Series Object to Dataframe

You will notice that our output is not a Dataframe. Instead, it's a Series Object, which doesn't allow us to select data or make further calculations on the data. We can convert it to a Dataframe by first storing the command above in a new variable and stacking two extra commands, to_frame and reset_index:

ref_sum_df=refugee_df.groupby('origin')['arrivals'].sum().sort_values(ascending=False)[:20]
ref_sum_df = ref_sum_df.to_frame().reset_index()
ref_sum_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
origin arrivals
0 Burma 151875
1 Iraq 127326
2 Bhutan 85316
3 Somalia 71120
4 Cuba 39946
5 Iran 36683
6 Dem. Rep. Congo 25493
7 Russia 15656
8 Eritrea 14645
9 Sudan 12636
10 Ukraine 11727
11 Burundi 11152
12 Vietnam 10769
13 Liberia 9424
14 Ethiopia 8487
15 Afghanistan 6751
16 Laos 6609
17 Moldova 4482
18 Syria 2359
19 Colombia 1956
- .to_frame() In Pandas, the .to_frame() method is used to convert a Pandas Series into a DataFrame. It allows for the transformation of a one-dimensional series with an index into a two-dimensional tabular structure, preserving the index and data.
  • .reset_index() In Pandas, the .reset_index() method is used to reset the index of a DataFrame, converting the index values into columns and providing a default integer index. This operation is useful when you want to remove the current index or revert to the default integer-based index.

Challenge

Write code to create a GroupBy object based on the 'state' column:

Solution

```python refugee_df.groupby('state') ```

Write code to sum the number of refugee arrivals by state with the output showing the top 15 rows sorted by ascending order:

Solution

```python refugee_df.groupby('state')['arrivals'].sum().sort_values(ascending=True)[:15] ```

Evaluation

What is the primary purpose of the .reset_index() method in Pandas?

- It reorganizes the DataFrame based on specified column values. - It converts a Pandas Series into a DataFrame. - It resets the index of a DataFrame, converting index values into columns.* - It sorts the values of a DataFrame based on specified columns.

When might you use the .to_frame() method in Pandas?

- To group rows of a DataFrame based on one or more columns. - To reset the index of a DataFrame - To sort the rows of a DataFrame based on specific values. - To convert a Pandas Series into a DataFrame.*

What is the primary purpose of the .groupby() method in Pandas?

- To sort the rows of a DataFrame based on specific column values. - To group rows of a DataFrame based on one or more columns.* - To reset the index of a DataFrame. - To convert a Pandas Series into a DataFrame.

Which Pandas method is used to sort the rows of a DataFrame based on specified column values?

- .reset_index() - .to_frame() - .groupby() - .sort_vales()*

Basic data visualizations

To create plots and data visualization in Pandas, we can add the .plot() method to any DataFrame or Series object that has appropriate numeric data.

We can specify the title with the title= parameter and the kind of plot by altering the kind= parameter:

  • bar or barh for bar plots (h is for horizontal)
  • hist for histogram
  • box for boxplot
  • kde or density for density plots
  • area for area plots
  • scatter for scatter plots
  • hexbin for hexagonal bin plots
  • pie for pie plots
- Data visualizations Data visualizations are important as they provide a clear and intuitive way to communicate complex information, making it easier for individuals to understand patterns, trends, and insights in data. Visualizations enhance decision-making by offering a more accessible and engaging representation of data, enabling quicker and more informed analysis.
  • .plot() In Pandas, the .plot() method is used to create basic visualizations from DataFrame or Series data. It provides a convenient way to generate various plots, such as line charts, bar plots, and scatter plots, directly from Pandas objects for quick data exploration and analysis.
Python boasts a vibrant ecosystem of open-source libraries for data visualizations, offering a wealth of options for enthusiasts and professionals alike. Matplotlib, the foundation for many other libraries, empowers users with high-quality and customizable plots. Seaborn, built on Matplotlib, enhances aesthetics and simplifies complex visualizations. Plotly, known for its interactivity, enables the creation of dynamic and shareable visualizations. Bokeh stands out for its interactive and real-time streaming capabilities. With these open-source tools, the Python community collaborates and innovates, fostering a dynamic environment for creative and effective data visualization solutions.

Bar Charts

For example, we can visualize the data we got from our Groupby command looking at the total number of refugees by country of arrival as a bar chart:

ref_sum_df.plot(kind='bar', x= 'origin', y='arrivals', 
                title='Total number of refugee arrivals in the U.S. \n by country of origin')
<Axes: title={'center': 'Total number of refugee arrivals in the U.S. \n by country of origin'}, xlabel='origin'>

Bar Chart

Let’s unpack the command to better understand these results:

  • ref_sum_df: This is the variable we created in the previous lesson summing the number of refugee arrivals by country of origin, with the output showing the top twenty rows sorted by descending order
  • .plot(kind='bar', x= 'origin', y='arrivals', title='Total number of refugee arrivals in the U.S. \n by country of origin'):
    • Here we are using the .plot() method to create a visualization, and we are specifying that we want a bar chart with the “kind=’bar’” parameter.
    • We also specify the values for the x axis (orgin) and y axis (arrivals)
    • We are also giving the chart a title with the title='Total number of refugee arrivals in the U.S. \n by country of origin' parameter.
      • Note: By adding \n in the title text, we signify that the text that follows should be on a new line.
- Bar chart A bar chart is a visual representation of data that uses rectangular bars to show the values of different categories. The length or height of each bar corresponds to the quantity or frequency of the data it represents, making it a popular tool for comparing and displaying categorical information.
  • '\n' In Python, "\n" is an escape sequence representing a newline character. When encountered in a string, it causes a line break, moving the subsequent text to a new line, making it useful for formatting and organizing output.

Pie Charts

We can also visualize the data as a pie chart:

ref_sum_df.set_index('origin')[:10].plot(kind='pie', y='arrivals', 
                                         title='Refugee arrivals in the U.S.')
<Axes: title={'center': 'Refugee arrivals in the U.S.'}, ylabel='arrivals'>

Pie Chart

  • We start by setting the index of the dataframe to the origin column. This ensures that our legend will show the country names.
  • We slice our results to show just the top 10 countries. This makes our chart more legible.
  • Next, we use the .plot() method, specifying pie as the type of plot
  • We also specify the y values - in our case arrivals
- Pie chart A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical proportions. Each slice represents a proportion of the whole, and the size of each slice is proportional to the corresponding value it represents in the dataset.

Time Series

We can also create time series using the Groupby method. For example, if we wanted to visualize the total number of refugees resettled in the U.S. across the 2005-2015 period, we would first create a Groupby object based on the “year” column (refer back to lesson 7 for more on Groupby objects).

refugee_df.groupby('year')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9bd6807e80>

GroupBy Object

Next, we can create a new variable calculating the average number of refugees being resettled over time.

total_arrivals_by_year = refugee_df.groupby('year')['arrivals'].sum()

Let’s break this command down:

  • We have two stacked methods here: .groupby() and .sum()
  • groupby('year')['arrivals']: For the Groupby object, groupby(year), we are isolating the “arrivals” column. Basically, we are asking to view the number of refugee arrivals by year.
  • .sum(): This method returns the sum of the values over the requested axis. In our case, it will calculate the total number of refugee arrivals per year.
  • We store our results in the total_arrivals_by_year variable

Finally, we can add the .plot() method to create a line chart.

total_arrivals_by_year.plot(kind='line', title="Total Number of Refugee Arrivals by Year")
<Axes: title={'center': 'Total Number of Refugee Arrivals by Year'}, xlabel='year'>

Time Series

In this command, we are adding the .plot() method to request a chart, and specifying that we want a line graph with the “kind=line” parameter. We are also giving the chart a title with the “title='Total Number of Refugee Arrivals by Year'” parameter.

- Line chart A line chart is a graphical representation of data that uses points connected by straight lines to display trends or changes over a continuous interval. It is commonly used to visualize the relationship between two variables and to show how one variable changes in response to the other.
  • Time series A time series is a sequence of data points collected or recorded over a period of time, typically at regular intervals. It is used to analyze and visualize trends, patterns, and behaviors in data that evolve over time.

Challenge

Write code to visualize the data we got from our Groupby command looking at the total number of refugees by country of arrival as a scatter plot:

Solution

```python ref_sum_df.plot(kind='scatter', x= 'origin', y='arrivals', title='Total number of refugee arrivals in the U.S. \n by country of origin') ```

Evaluation

Which Pandas method is commonly used to create basic visualizations, such as line charts, bar plots, and scatter plots, directly from DataFrame or Series data?

- .format() - .visualize() - .plot()* - .display()

What type of chart is best suited for illustrating numerical proportions in a dataset, where each slice represents a proportion of the whole?

- Line chart - Bar chart - Pie chart* - Time series chart

Write to CSV

To output a new CSV file, we can use the .to_csv method with a name for the file in quotation marks. For example, since we added the percent_total column to the refugee_df DataFrame, we may want to download the updated DataFrame as a CSV file to use it with other programs.

refugee_df.to_csv("new_refugee.csv", encoding='utf-8', index=False)

In addition to a filename, we’re also specifying that the encoding is utf-8 and that the Index (the bolded left-most column) is not included in the CSV file.

- .to_csv In Pandas, the .to_csv() method is used to write the contents of a DataFrame to a CSV (Comma-Separated Values) file. It allows users to save the data in a tabular format that can be easily shared, imported, or analyzed using other tools.

From theory to practice: Building your Pandas skills with the Pandas documentation and other resources

Learning how to ask the right questions in a search engine like Google in order to find the solution to what you are trying to accomplish is the name of the game when you are just starting out with Python. Since Pandas is a popular and well documented Python package, you are bound to find myriads of resources that can help you get where you are going.

Pandas documentation and online tutorials

A good first place to start when you are searching for answers with Pandas is to look at the Pandas documentation, which is fairly accessible to beginners, and is an incredible resource when you want to learn how to use a new command. It also offers a User Guide for beginners with some fun exercises to deepen your learning.

Let’s say you wanted to find out more about the .sort_values method we used and understand the different parameters the method accepts:

  • You could first search for .sort_values on the Pandas documentation website (https://pandas.pydata.org) and navigate to the “pandas.DataFrame.sort_values” documentation page.
  • Scroll through the page for the info, and look at the second section for examples of how to use the method and its various parameters.
  • If you don’t find an answer that makes sense to you on the Pandas documentation page, then look on Google for other resources. Some of our go-to websites for help are Stack Overflow, Geeks for Geeks, and Data to Fish.

Learning with Generative AI platforms

Generative AI platforms like ChatGPT can be incredibly helpful in deepening your understanding of coding concepts in addition to helping you build your code. In this workshop we learned critical vocabulary such as "method" and "dataframe" that you can use to communicate with ChatGPT in order to ask it to build a Python script for you. Make sure to ask the platform to explain what each line of code is accomplishing so you can better understand the output it is giving you.

Prompt examples include:

  • Explain to me what a dataframe is like I'm an 8th grader
  • Could you help me build a script to change the names of the columns in my dataframe?
  • How can I convert a series object into a dataframe in Pandas?

Other Resources

  • This workshop owes a huge debt to Melanie Walsh’s Introduction to Cultural Analytics & Python. This easy to use and understand textbook provides additional support for using Pandas. This is a great place to start if you want to continue building your Pandas skills.