Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inspected and cleaned data. Numeric or Categorical dtypes assigned #40

Merged
merged 3 commits into from
Mar 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
381 changes: 376 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,382 @@ The team determined that the machine learning model for implementation was the L

### Preliminary Data Preprocessing

1. Import Data into Pandas DataFrame
2. Convert columns to best inferred possible dtypes using dtypes supporting `pd.NA`
3. Convert `winby` column into categorical type
4. Drop the non-beneficial columns
5. Keep only wins and losses (i.e., Red & Blue) in the `winner` column
1. **Imported [Data](https://github.com/seven-foot-two/turtle-time/blob/main/Resources/scraped_data.csv) into a Pandas DataFrame.**
2. **Drop duplicate rows:**
- In order to insure that the newly scraped data doesn't contain any duplicate rows (fights), `.drop_duplicates(["Event_Date", "B_Name", "R_Name"])` is employed; Dropping duplicates by evaluating the `Event_Date`, `B_Name` (Blue fighters name), and `R_Name` (Red fighters name) column values.
3. **Converted `Event_Date` column values to `datetime64` datatype.**
4. **Dropped rows (fights) that happened **before** May 3rd, 2001.**
- The reasoning for dropping the row before 5/3/2001 is to eliminate the fights that had little to no rules. For example, before these major rule changes, there were fights with NO time limit. Another example is the fact that they had fighters of different weight classes (Open Weight) fight; Some fighters had 100+ Lb. weight discrepancies.
5. **Replaced `"--", "---" and "No Time Limit"` with `np.NaN`**
- `No Time Limit` should already not exist due to the date restriction above but if it does, it will be replaced with `NaN`.
- `"--"` and `"---"` represent NO value. **Not** zero; Nothing. An example would be the **take-down percentage** column, where these values are present quite often. This is due to the fact that the fighter didn't even attempt a single take-down. To clarify a little more, if a fighter was to attempt a take-down but failed to land that take-down, they would then have a take-down percentage of 0%.
6. **`R_Draws` and `B_Draws` were split to create a `No_Contest` for each corner color**
- Some of the `Draws` column values contained "(x NC)", where "x" represents the amount of no contests.
- The "x" value was extracted and put into its own `No_Contest` column.
7. **Rearranged columns.**
- With the new `No_Contest` columns created, the DataFrames columns are rearranged to for origination.
- `R_No_Contest` column moved to the position after `R_Draws`.
- `B_No_Contest` column moved to the position after `B_Draws`.
8. **Used `.loc` on the `Weight_Class` column in order to keep the standardized weight classes**
- **Standardized weight classes:** `Heavyweight,
Light Heavyweight,
Middleweight,
Welterweight,
Lightweight,
Featherweight,
Bantamweight,
Flyweight,
Strawweight,
Women's Strawweight,
Women's Flyweight,
Women's Bantamweight,
Women's Featherweight.`
9. **`R_Height` and `B_Height` bucketed using quartile (4 buckets created).**
- `R_Height_Bucket` and `B_Height_Bucket` columns created.
10. **`R_Age` and `B_Age` bucketed using quartile (4 buckets created).**
- `R_Age_Bucket` and `B_Age_Bucket` columns created.
11. **`Gender` column created based on `Weight_Class` column value containing "Women's" or not.**
- If the fighter is a women, the `Gender` column will contain a value of `0`.
- If the fighter is a man, the `Gender` column will contain a value of `1` .
12. **Converted columns to best inferred possible dtypes using `.covert_dtypes` supporting `pd.NA`**
- These inferred data types may not be correct and in our situation, a lot were incorrect.
- Columns with the data type of "string" or "object" were inspected to figure out why they were inferred this way.
- No issues were found in any of the columns so they were converted to the correct data type (Categorical OR Numerical).

**Categorical Data:**
<details>
<summary>View Categorical Columns</summary>

`'Weight_Class',
'Win_By',
'B_Name',
'B_Stance',
'R_Name',
'R_Stance',
'R_Age_Bucket',
'B_Age_Bucket',
'R_Height_Bucket',
'B_Height_Bucket',
'Gender'`.

</details>

**Numerical Data:**<details>
<summary>View Numerical Columns</summary>

`'Max_Rounds',
'Ending_Round',
'B_Age',
'B_Height',
'B_Weight',
'B_Reach',
'B_Wins',
'B_Losses',
'B_Draws',
'B_No_Contest',
'B_Career_Significant_Strikes_Landed_PM',
'B_Career_Striking_Accuracy',
'B_Career_Significant_Strike_Defence',
'B_Career_Takedown_Average',
'B_Career_Takedown_Accuracy',
'B_Career_Takedown_Defence',
'B_Career_Submission_Average',
'B_Knockdowns',
'B_Significant_Strikes_Landed',
'B_Significant_Strikes_Attempted',
'B_Significant_Strike_Perc',
'B_Significant_Strikes_Distance_Landed',
'B_Significant_Strikes_Distance_Attempted',
'B_Significant_Strikes_Clinch_Landed',
'B_Significant_Strikes_Clinch_Attempted',
'B_Significant_Strikes_Ground_Landed',
'B_Significant_Strikes_Ground_Attempted',
'B_Head_Significant_Strikes_Attempted',
'B_Head_Significant_Strikes_Landed',
'B_Body_Significant_Strikes_Attempted',
'B_Body_Significant_Strikes_Landed',
'B_Leg_Significant_Strikes_Attempted',
'B_Leg_Significant_Strikes_Landed',
'B_Total_Strikes_Attempted',
'B_Total_Strikes_Landed',
'B_Takedowns_Attempted',
'B_Takedowns_Landed',
'B_Takedown_Perc',
'B_Submission_Attempts',
'B_Grappling_Reversals',
'B_Round_One_Knockdowns',
'B_Round_One_Significant_Strikes_Landed',
'B_Round_One_Significant_Strikes_Attempted',
'B_Round_One_Significant_Strike_Perc',
'B_Round_One_Significant_Strikes_Distance_Landed',
'B_Round_One_Significant_Strikes_Distance_Attempted',
'B_Round_One_Significant_Strikes_Clinch_Landed',
'B_Round_One_Significant_Strikes_Clinch_Attempted',
'B_Round_One_Significant_Strikes_Ground_Landed',
'B_Round_One_Significant_Strikes_Ground_Attempted',
'B_Round_One_Head_Significant_Strikes_Attempted',
'B_Round_One_Head_Significant_Strikes_Landed',
'B_Round_One_Body_Significant_Strikes_Attempted',
'B_Round_One_Body_Significant_Strikes_Landed',
'B_Round_One_Leg_Significant_Strikes_Attempted',
'B_Round_One_Leg_Significant_Strikes_Landed',
'B_Round_One_Total_Strikes_Attempted',
'B_Round_One_Total_Strikes_Landed',
'B_Round_One_Takedowns_Attempted',
'B_Round_One_Takedowns_Landed',
'B_Round_One_Takedown_Perc',
'B_Round_One_Submission_Attempts',
'B_Round_One_Grappling_Reversals',
'B_Round_Two_Knockdowns',
'B_Round_Two_Significant_Strikes_Landed',
'B_Round_Two_Significant_Strikes_Attempted',
'B_Round_Two_Significant_Strike_Perc',
'B_Round_Two_Significant_Strikes_Distance_Landed',
'B_Round_Two_Significant_Strikes_Distance_Attempted',
'B_Round_Two_Significant_Strikes_Clinch_Landed',
'B_Round_Two_Significant_Strikes_Clinch_Attempted',
'B_Round_Two_Significant_Strikes_Ground_Landed',
'B_Round_Two_Significant_Strikes_Ground_Attempted',
'B_Round_Two_Head_Significant_Strikes_Attempted',
'B_Round_Two_Head_Significant_Strikes_Landed',
'B_Round_Two_Body_Significant_Strikes_Attempted',
'B_Round_Two_Body_Significant_Strikes_Landed',
'B_Round_Two_Leg_Significant_Strikes_Attempted',
'B_Round_Two_Leg_Significant_Strikes_Landed',
'B_Round_Two_Total_Strikes_Attempted',
'B_Round_Two_Total_Strikes_Landed',
'B_Round_Two_Takedowns_Attempted',
'B_Round_Two_Takedowns_Landed',
'B_Round_Two_Takedown_Perc',
'B_Round_Two_Submission_Attempts',
'B_Round_Two_Grappling_Reversals',
'B_Round_Three_Knockdowns',
'B_Round_Three_Significant_Strikes_Landed',
'B_Round_Three_Significant_Strikes_Attempted',
'B_Round_Three_Significant_Strike_Perc',
'B_Round_Three_Significant_Strikes_Distance_Landed',
'B_Round_Three_Significant_Strikes_Distance_Attempted',
'B_Round_Three_Significant_Strikes_Clinch_Landed',
'B_Round_Three_Significant_Strikes_Clinch_Attempted',
'B_Round_Three_Significant_Strikes_Ground_Landed',
'B_Round_Three_Significant_Strikes_Ground_Attempted',
'B_Round_Three_Head_Significant_Strikes_Attempted',
'B_Round_Three_Head_Significant_Strikes_Landed',
'B_Round_Three_Body_Significant_Strikes_Attempted',
'B_Round_Three_Body_Significant_Strikes_Landed',
'B_Round_Three_Leg_Significant_Strikes_Attempted',
'B_Round_Three_Leg_Significant_Strikes_Landed',
'B_Round_Three_Total_Strikes_Attempted',
'B_Round_Three_Total_Strikes_Landed',
'B_Round_Three_Takedowns_Attempted',
'B_Round_Three_Takedowns_Landed',
'B_Round_Three_Takedown_Perc',
'B_Round_Three_Submission_Attempts',
'B_Round_Three_Grappling_Reversals',
'B_Round_Four_Knockdowns',
'B_Round_Four_Significant_Strikes_Landed',
'B_Round_Four_Significant_Strikes_Attempted',
'B_Round_Four_Significant_Strike_Perc',
'B_Round_Four_Significant_Strikes_Distance_Landed',
'B_Round_Four_Significant_Strikes_Distance_Attempted',
'B_Round_Four_Significant_Strikes_Clinch_Landed',
'B_Round_Four_Significant_Strikes_Clinch_Attempted',
'B_Round_Four_Significant_Strikes_Ground_Landed',
'B_Round_Four_Significant_Strikes_Ground_Attempted',
'B_Round_Four_Head_Significant_Strikes_Attempted',
'B_Round_Four_Head_Significant_Strikes_Landed',
'B_Round_Four_Body_Significant_Strikes_Attempted',
'B_Round_Four_Body_Significant_Strikes_Landed',
'B_Round_Four_Leg_Significant_Strikes_Attempted',
'B_Round_Four_Leg_Significant_Strikes_Landed',
'B_Round_Four_Total_Strikes_Attempted',
'B_Round_Four_Total_Strikes_Landed',
'B_Round_Four_Takedowns_Attempted',
'B_Round_Four_Takedowns_Landed',
'B_Round_Four_Takedown_Perc',
'B_Round_Four_Submission_Attempts',
'B_Round_Four_Grappling_Reversals',
'B_Round_Five_Knockdowns',
'B_Round_Five_Significant_Strikes_Landed',
'B_Round_Five_Significant_Strikes_Attempted',
'B_Round_Five_Significant_Strike_Perc',
'B_Round_Five_Significant_Strikes_Distance_Landed',
'B_Round_Five_Significant_Strikes_Distance_Attempted',
'B_Round_Five_Significant_Strikes_Clinch_Landed',
'B_Round_Five_Significant_Strikes_Clinch_Attempted',
'B_Round_Five_Significant_Strikes_Ground_Landed',
'B_Round_Five_Significant_Strikes_Ground_Attempted',
'B_Round_Five_Head_Significant_Strikes_Attempted',
'B_Round_Five_Head_Significant_Strikes_Landed',
'B_Round_Five_Body_Significant_Strikes_Attempted',
'B_Round_Five_Body_Significant_Strikes_Landed',
'B_Round_Five_Leg_Significant_Strikes_Attempted',
'B_Round_Five_Leg_Significant_Strikes_Landed',
'B_Round_Five_Total_Strikes_Attempted',
'B_Round_Five_Total_Strikes_Landed',
'B_Round_Five_Takedowns_Attempted',
'B_Round_Five_Takedowns_Landed',
'B_Round_Five_Takedown_Perc',
'B_Round_Five_Submission_Attempts',
'B_Round_Five_Grappling_Reversals',
'R_Age',
'R_Height',
'R_Weight',
'R_Reach',
'R_Wins',
'R_Losses',
'R_Draws',
'R_No_Contest',
'R_Career_Significant_Strikes_Landed_PM',
'R_Career_Striking_Accuracy',
'R_Career_Significant_Strike_Defence',
'R_Career_Takedown_Average',
'R_Career_Takedown_Accuracy',
'R_Career_Takedown_Defence',
'R_Career_Submission_Average',
'R_Knockdowns',
'R_Significant_Strikes_Landed',
'R_Significant_Strikes_Attempted',
'R_Significant_Strike_Perc',
'R_Significant_Strikes_Distance_Landed',
'R_Significant_Strikes_Distance_Attempted',
'R_Significant_Strikes_Clinch_Landed',
'R_Significant_Strikes_Clinch_Attempted',
'R_Significant_Strikes_Ground_Landed',
'R_Significant_Strikes_Ground_Attempted',
'R_Head_Significant_Strikes_Attempted',
'R_Head_Significant_Strikes_Landed',
'R_Body_Significant_Strikes_Attempted',
'R_Body_Significant_Strikes_Landed',
'R_Leg_Significant_Strikes_Attempted',
'R_Leg_Significant_Strikes_Landed',
'R_Total_Strikes_Attempted',
'R_Total_Strikes_Landed',
'R_Takedowns_Attempted',
'R_Takedowns_Landed',
'R_Takedown_Perc',
'R_Submission_Attempts',
'R_Grappling_Reversals',
'R_Round_One_Knockdowns',
'R_Round_One_Significant_Strikes_Landed',
'R_Round_One_Significant_Strikes_Attempted',
'R_Round_One_Significant_Strike_Perc',
'R_Round_One_Significant_Strikes_Distance_Attempted',
'R_Round_One_Significant_Strikes_Distance_Landed',
'R_Round_One_Significant_Strikes_Clinch_Attempted',
'R_Round_One_Significant_Strikes_Clinch_Landed',
'R_Round_One_Significant_Strikes_Ground_Attempted',
'R_Round_One_Significant_Strikes_Ground_Landed',
'R_Round_One_Head_Significant_Strikes_Attempted',
'R_Round_One_Head_Significant_Strikes_Landed',
'R_Round_One_Body_Significant_Strikes_Attempted',
'R_Round_One_Body_Significant_Strikes_Landed',
'R_Round_One_Leg_Significant_Strikes_Attempted',
'R_Round_One_Leg_Significant_Strikes_Landed',
'R_Round_One_Total_Strikes_Attempted',
'R_Round_One_Total_Strikes_Landed',
'R_Round_One_Takedowns_Attempted',
'R_Round_One_Takedowns_Landed',
'R_Round_One_Takedown_Perc',
'R_Round_One_Submission_Attempts',
'R_Round_One_Grappling_Reversals',
'R_Round_Two_Knockdowns',
'R_Round_Two_Significant_Strikes_Landed',
'R_Round_Two_Significant_Strikes_Attempted',
'R_Round_Two_Significant_Strike_Perc',
'R_Round_Two_Significant_Strikes_Distance_Attempted',
'R_Round_Two_Significant_Strikes_Distance_Landed',
'R_Round_Two_Significant_Strikes_Clinch_Attempted',
'R_Round_Two_Significant_Strikes_Clinch_Landed',
'R_Round_Two_Significant_Strikes_Ground_Attempted',
'R_Round_Two_Significant_Strikes_Ground_Landed',
'R_Round_Two_Head_Significant_Strikes_Attempted',
'R_Round_Two_Head_Significant_Strikes_Landed',
'R_Round_Two_Body_Significant_Strikes_Attempted',
'R_Round_Two_Body_Significant_Strikes_Landed',
'R_Round_Two_Leg_Significant_Strikes_Attempted',
'R_Round_Two_Leg_Significant_Strikes_Landed',
'R_Round_Two_Total_Strikes_Attempted',
'R_Round_Two_Total_Strikes_Landed',
'R_Round_Two_Takedowns_Attempted',
'R_Round_Two_Takedowns_Landed',
'R_Round_Two_Takedown_Perc',
'R_Round_Two_Submission_Attempts',
'R_Round_Two_Grappling_Reversals',
'R_Round_Three_Knockdowns',
'R_Round_Three_Significant_Strikes_Landed',
'R_Round_Three_Significant_Strikes_Attempted',
'R_Round_Three_Significant_Strike_Perc',
'R_Round_Three_Significant_Strikes_Distance_Attempted',
'R_Round_Three_Significant_Strikes_Distance_Landed',
'R_Round_Three_Significant_Strikes_Clinch_Attempted',
'R_Round_Three_Significant_Strikes_Clinch_Landed',
'R_Round_Three_Significant_Strikes_Ground_Attempted',
'R_Round_Three_Significant_Strikes_Ground_Landed',
'R_Round_Three_Head_Significant_Strikes_Attempted',
'R_Round_Three_Head_Significant_Strikes_Landed',
'R_Round_Three_Body_Significant_Strikes_Attempted',
'R_Round_Three_Body_Significant_Strikes_Landed',
'R_Round_Three_Leg_Significant_Strikes_Attempted',
'R_Round_Three_Leg_Significant_Strikes_Landed',
'R_Round_Three_Total_Strikes_Attempted',
'R_Round_Three_Total_Strikes_Landed',
'R_Round_Three_Takedowns_Attempted',
'R_Round_Three_Takedowns_Landed',
'R_Round_Three_Takedown_Perc',
'R_Round_Three_Submission_Attempts',
'R_Round_Three_Grappling_Reversals',
'R_Round_Four_Knockdowns',
'R_Round_Four_Significant_Strikes_Landed',
'R_Round_Four_Significant_Strikes_Attempted',
'R_Round_Four_Significant_Strike_Perc',
'R_Round_Four_Significant_Strikes_Distance_Attempted',
'R_Round_Four_Significant_Strikes_Distance_Landed',
'R_Round_Four_Significant_Strikes_Clinch_Attempted',
'R_Round_Four_Significant_Strikes_Clinch_Landed',
'R_Round_Four_Significant_Strikes_Ground_Attempted',
'R_Round_Four_Significant_Strikes_Ground_Landed',
'R_Round_Four_Head_Significant_Strikes_Attempted',
'R_Round_Four_Head_Significant_Strikes_Landed',
'R_Round_Four_Body_Significant_Strikes_Attempted',
'R_Round_Four_Body_Significant_Strikes_Landed',
'R_Round_Four_Leg_Significant_Strikes_Attempted',
'R_Round_Four_Leg_Significant_Strikes_Landed',
'R_Round_Four_Total_Strikes_Attempted',
'R_Round_Four_Total_Strikes_Landed',
'R_Round_Four_Takedowns_Attempted',
'R_Round_Four_Takedowns_Landed',
'R_Round_Four_Takedown_Perc',
'R_Round_Four_Submission_Attempts',
'R_Round_Four_Grappling_Reversals',
'R_Round_Five_Knockdowns',
'R_Round_Five_Significant_Strikes_Landed',
'R_Round_Five_Significant_Strikes_Attempted',
'R_Round_Five_Significant_Strike_Perc',
'R_Round_Five_Significant_Strikes_Distance_Attempted',
'R_Round_Five_Significant_Strikes_Distance_Landed',
'R_Round_Five_Significant_Strikes_Clinch_Attempted',
'R_Round_Five_Significant_Strikes_Clinch_Landed',
'R_Round_Five_Significant_Strikes_Ground_Attempted',
'R_Round_Five_Significant_Strikes_Ground_Landed',
'R_Round_Five_Head_Significant_Strikes_Attempted',
'R_Round_Five_Head_Significant_Strikes_Landed',
'R_Round_Five_Body_Significant_Strikes_Attempted',
'R_Round_Five_Body_Significant_Strikes_Landed',
'R_Round_Five_Leg_Significant_Strikes_Attempted',
'R_Round_Five_Leg_Significant_Strikes_Landed',
'R_Round_Five_Total_Strikes_Attempted',
'R_Round_Five_Total_Strikes_Landed',
'R_Round_Five_Takedowns_Attempted',
'R_Round_Five_Takedowns_Landed',
'R_Round_Five_Takedown_Perc',
'R_Round_Five_Submission_Attempts',
'R_Round_Five_Grappling_Reversals'`

</details>





### Feature Engineering
Expand Down
Loading