Skip to content

Exploratory Data Analysis on data related to Olympic games(note that the data used goes up to 2016). Aiming at deriving or discovering some key patterns or observations from the data

Notifications You must be signed in to change notification settings

JulienAganze/EDA_Olympic_Games_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Study Of data related to Olympic games

Exploratory Data Analysis Project

Brief introduction

This project is generally about the Exploratory Data Analyssis(EDA), meaning getting a general overview or understand about the data, with the aim of finding its main characterisitcs, identifying patterns and visualizations. And my case, I will be exporing the data related to exploring data related to Olympic games.

introduction of data

The data we are going to holds 120 years of Olympic history including personal information about the athletes as well as the game they participated in. The file athlete_events.csv contains 271116 rows and 15 columns; Each row corresponds to an individual athlete competing in an individual Olympic event. And the columns available in the dataset are as follows:

1. ID - Unique number for each athlete;
2. Name - Athlete's name;
3. Sex - M or F;
4. Age - Integer;
5. Height - In centimeters;
6. Weight - In kilograms;
7. Team - Team name;
8. NOC - National Olympic Committee 3-letter code;
9. Games - Year and season;
10. Year - Integer;
11. Season - Summer or Winter;
12. City - Host city;
13. Sport - Sport;
14. Event - Event;
15. Medal - Gold, Silver, Bronze, or NA.

Study methodology

To perform the required studies we will follow the following steps:

1. Importing data into the dataframe

This is done using pandas read_csv method. And all needed information about how to use this can be obtained from the later provided link.

2. Collecting basic information about the dataset

I need to need how the data looks like for me to be able to make sens out of it. SO for this reason we will explore additional pandas function like head used to look at data in a tabular form, describe performs descriptive statistics, isnull used to find out if there are null values in different columns

3. Cleaning and Completing the data

After knowing how our data looks like, the next step will now bw about dealing with missing values. And for this purpose the following operations will be performed:

● Exclude all records from data where we don’t have any information about medals.
● Fill missing age values with average age of other athletes.
● Fill missing height values for women and men with average height of women and men athletes respectively.
● Fill missing weight values for women and men with average weight of women and men athletes participating in same sports

4. Data visualizations analysis

Visualizing data in different type of graphs will provide us with greater insights into our data. We will explore different options on visualizing our data and find out any patterns within it.We will use here pandas library ass well as other concepts like seaborn and matplotlib.

● And we will base our analysis by looking at the following points
● Gold medals in gymnastic over age
● Medals won by China over years
● Gold medals won by china in summer olympics in sports
● Height of male athletes over years
● Height of female athletes over years.
● Top 5 countries with most medals
● Number of athletes in each olympic game
● Age distribution of male/female in Olympic games
● Variation of age for female over time
● Height and weight ratio of athletes
● Average age of medal winners in olympic games.\

Project details

All the project details and analysis will be in the project notebook.

Olyimpic_games_Analysis

Analysis

Source of the used data

The data we used in this project can be obtained from here data_source

Used libraries

The following libraries will be used in our analysis

Numpy NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Pandas Pandas is a python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Matplotlib Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

About

Exploratory Data Analysis on data related to Olympic games(note that the data used goes up to 2016). Aiming at deriving or discovering some key patterns or observations from the data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published