Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

185 html report always cleaning #248

Merged
merged 11 commits into from
May 1, 2024
Merged

Conversation

CBROWN-ONS
Copy link
Collaborator

Description

This PR makes it so that gtfs::validation::GtfsIntance.html_report() does not clean when the clean_feed param is set to False

Fixes #185

Motivation and Context

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

How Has This Been Tested?

Test configuration details:

  • OS: Windows 10
  • Python version: 3.9.13
  • Java version: N/A
  • Python management system: MiniConda

Advice for reviewer

Checklist:

  • My code follows the intended structure of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Additional comments

@codecov-commenter
Copy link

codecov-commenter commented Feb 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.18%. Comparing base (51ee214) to head (9f91aad).

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #248   +/-   ##
=======================================
  Coverage   98.18%   98.18%           
=======================================
  Files          21       21           
  Lines        1924     1927    +3     
=======================================
+ Hits         1889     1892    +3     
  Misses         35       35           
Flag Coverage Δ
unittests 98.18% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@SergioRec SergioRec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running this with the clipped GTFS for Newport, the new flag clean_feed works as expected, with the report including some errors and warnings when the flag is set to False that would be removed otherwise.

However, when running the function with clean_feed == False in larger GTFS files (all Wales or Leeds), I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File Untitled-1:2
      [1](untitled-1:1) # %%
----> [2](untitled-1:2) instance.html_report(overwrite=True, clean_feed=False)

File [~/src/transport_performance/gtfs/validation.py:1501](~/src/transport_performance/gtfs/validation.py:1501), in GtfsInstance.html_report(self, report_dir, overwrite, summary_type, extended_validation, clean_feed)
   [1499](~/src/transport_performance/gtfs/validation.py:1499) # create extended reports if requested
   [1500](~/src/transport_performance/gtfs/validation.py:1500) if extended_validation:
-> [1501](~/src/transport_performance/gtfs/validation.py:1501)     self._extended_validation(output_path=report_dir)
   [1502](~/src/transport_performance/gtfs/validation.py:1502)     info_href = (
   [1503](~/src/transport_performance/gtfs/validation.py:1503)         validation_dataframe["message"].apply(
   [1504](~/src/transport_performance/gtfs/validation.py:1504)             lambda x: "_".join(x.split(" "))
   (...)
   [1508](~/src/transport_performance/gtfs/validation.py:1508)         + ".html"
   [1509](~/src/transport_performance/gtfs/validation.py:1509)     )
   [1510](~/src/transport_performance/gtfs/validation.py:1510)     validation_dataframe["info"] = [
   [1511](~/src/transport_performance/gtfs/validation.py:1511)         f"""<a href="{href}"> Further Info</a>"""
   [1512](~/src/transport_performance/gtfs/validation.py:1512)         if len(rows) > 1
   [1513](~/src/transport_performance/gtfs/validation.py:1513)         else "Unavailable"
   [1514](~/src/transport_performance/gtfs/validation.py:1514)         for href, rows in zip(info_href, validation_dataframe["rows"])
   [1515](~/src/transport_performance/gtfs/validation.py:1515)     ]

File [~/src/transport_performance/gtfs/validation.py:1375](~/src/transport_performance/gtfs/validation.py:1375), in GtfsInstance._extended_validation(self, output_path, scheme)
   [1370](~/src/transport_performance/gtfs/validation.py:1370)         duplicate_counts[col] = impacted_rows[
   [1371](~/src/transport_performance/gtfs/validation.py:1371)             impacted_rows[f"{col}_original"]
   [1372](~/src/transport_performance/gtfs/validation.py:1372)             == impacted_rows[f"{col}_duplicate"]
   [1373](~/src/transport_performance/gtfs/validation.py:1373)         ].shape[0]
   [1374](~/src/transport_performance/gtfs/validation.py:1374) else:
-> [1375](~/src/transport_performance/gtfs/validation.py:1375)     impacted_rows = table_map[table].copy().iloc[rows]
   [1377](~/src/transport_performance/gtfs/validation.py:1377) # create the html to display the impacted rows (clean possibly)
   [1378](~/src/transport_performance/gtfs/validation.py:1378) table_html = f"""
   [1379](~/src/transport_performance/gtfs/validation.py:1379) <head>
   [1380](~/src/transport_performance/gtfs/validation.py:1380)     <link rel="stylesheet" href="styles.css">
   (...)
   [1389](~/src/transport_performance/gtfs/validation.py:1389)             {msg_type}</span>
   [1390](~/src/transport_performance/gtfs/validation.py:1390) </h1>"""

KeyError: 'full_stop_schedule'

I suspect there's something in those GTFS that doesn't work with the HTML report, and it was getting cleaned by default. Setting the new flag to False is causing the report to fail. Could you please investigate? Thanks!

PS: did some minor changes (correcting a typo and merging with dev).

Copy link
Contributor

@SergioRec SergioRec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CBROWN-ONS. I am now getting a different KeyError using Leeds:

KeyError: 'multiple_stops_invalid'.

Is this a case where we will need to manually add objects to the table map manually if a new issue appears (e.g. with other unfamiliar GTFS), or will there be a fixed number of them? Just wondering if this issue may appear again when trying a new location (e.g. Germany).

@CBROWN-ONS
Copy link
Collaborator Author

Hi @CBROWN-ONS. I am now getting a different KeyError using Leeds:

KeyError: 'multiple_stops_invalid'.

Is this a case where we will need to manually add objects to the table map manually if a new issue appears (e.g. with other unfamiliar GTFS), or will there be a fixed number of them? Just wondering if this issue may appear again when trying a new location (e.g. Germany).

HI Sergio. This shouldn't be the case of 1 by 1 adding unless we add new validation tables manually. Do you have the full error message?
Thanks

@SergioRec
Copy link
Contributor

Here's the code I ran:

# %%
from transport_performance.gtfs.multi_validation import MultiGtfsInstance
from pyprojroot import here

# %%
t = MultiGtfsInstance(here('data/interim/gtfs/itm_leeds_filtered_gtfs.zip'))
s = t.instances[0]
# %%
s.html_report(overwrite=True, clean_feed=False)
# %%

Here's the full traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], [line 2](vscode-notebook-cell:?execution_count=3&line=2)
      [1](vscode-notebook-cell:?execution_count=3&line=1) # %%
----> [2](vscode-notebook-cell:?execution_count=3&line=2) s.html_report(overwrite=True, clean_feed=False)

File [~/src/transport_performance/gtfs/validation.py:1502](~//src/transport_performance/gtfs/validation.py:1502), in GtfsInstance.html_report(self, report_dir, overwrite, summary_type, extended_validation, clean_feed)
   [1500](~/src/transport_performance/gtfs/validation.py:1500) # create extended reports if requested
   [1501](~/src/transport_performance/gtfs/validation.py:1501) if extended_validation:
-> [1502](~/src/transport_performance/gtfs/validation.py:1502)     self._extended_validation(output_path=report_dir)
   [1503](~/src/transport_performance/gtfs/validation.py:1503)     info_href = (
   [1504](~/src/transport_performance/gtfs/validation.py:1504)         validation_dataframe["message"].apply(
   [1505](~/src/transport_performance/gtfs/validation.py:1505)             lambda x: "_".join(x.split(" "))
   (...)
   [1509](~/src/transport_performance/gtfs/validation.py:1509)         + ".html"
   [1510](~/src/transport_performance/gtfs/validation.py:1510)     )
   [1511](~/src/transport_performance/gtfs/validation.py:1511)     validation_dataframe["info"] = [
   [1512](~/src/transport_performance/gtfs/validation.py:1512)         f"""<a href="{href}"> Further Info</a>"""
   [1513](~/src/transport_performance/gtfs/validation.py:1513)         if len(rows) > 1
   [1514](~/src/transport_performance/gtfs/validation.py:1514)         else "Unavailable"
   [1515](~/src/transport_performance/gtfs/validation.py:1515)         for href, rows in zip(info_href, validation_dataframe["rows"])
   [1516](~/src/transport_performance/gtfs/validation.py:1516)     ]

File [~/src/transport_performance/gtfs/validation.py:1376](~/src/transport_performance/gtfs/validation.py:1376), in GtfsInstance._extended_validation(self, output_path, scheme)
   [1371](~/src/transport_performance/gtfs/validation.py:1371)         duplicate_counts[col] = impacted_rows[
   [1372](~/src/transport_performance/gtfs/validation.py:1372)             impacted_rows[f"{col}_original"]
   [1373](~/src/transport_performance/gtfs/validation.py:1373)             == impacted_rows[f"{col}_duplicate"]
   [1374](~/src/transport_performance/gtfs/validation.py:1374)         ].shape[0]
   [1375](~/src/transport_performance/gtfs/validation.py:1375) else:
-> [1376](~/src/transport_performance/gtfs/validation.py:1376)     impacted_rows = table_map[table].copy().iloc[rows]
   [1378](~/src/transport_performance/gtfs/validation.py:1378) # create the html to display the impacted rows (clean possibly)
   [1379](~/src/transport_performance/gtfs/validation.py:1379) table_html = f"""
   [1380](~/src/transport_performance/gtfs/validation.py:1380) <head>
   [1381](~/src/transport_performance/gtfs/validation.py:1381)     <link rel="stylesheet" href="styles.css">
   (...)
   [1390](~/src/transport_performance/gtfs/validation.py:1390)             {msg_type}</span>
   [1391](~/src/transport_performance/gtfs/validation.py:1391) </h1>"""

KeyError: 'multiple_stops_invalid'

@CBROWN-ONS
Copy link
Collaborator Author

Here's the code I ran:

# %%
from transport_performance.gtfs.multi_validation import MultiGtfsInstance
from pyprojroot import here

# %%
t = MultiGtfsInstance(here('data/interim/gtfs/itm_leeds_filtered_gtfs.zip'))
s = t.instances[0]
# %%
s.html_report(overwrite=True, clean_feed=False)
# %%

Here's the full traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], [line 2](vscode-notebook-cell:?execution_count=3&line=2)
      [1](vscode-notebook-cell:?execution_count=3&line=1) # %%
----> [2](vscode-notebook-cell:?execution_count=3&line=2) s.html_report(overwrite=True, clean_feed=False)

File [~/src/transport_performance/gtfs/validation.py:1502](~//src/transport_performance/gtfs/validation.py:1502), in GtfsInstance.html_report(self, report_dir, overwrite, summary_type, extended_validation, clean_feed)
   [1500](~/src/transport_performance/gtfs/validation.py:1500) # create extended reports if requested
   [1501](~/src/transport_performance/gtfs/validation.py:1501) if extended_validation:
-> [1502](~/src/transport_performance/gtfs/validation.py:1502)     self._extended_validation(output_path=report_dir)
   [1503](~/src/transport_performance/gtfs/validation.py:1503)     info_href = (
   [1504](~/src/transport_performance/gtfs/validation.py:1504)         validation_dataframe["message"].apply(
   [1505](~/src/transport_performance/gtfs/validation.py:1505)             lambda x: "_".join(x.split(" "))
   (...)
   [1509](~/src/transport_performance/gtfs/validation.py:1509)         + ".html"
   [1510](~/src/transport_performance/gtfs/validation.py:1510)     )
   [1511](~/src/transport_performance/gtfs/validation.py:1511)     validation_dataframe["info"] = [
   [1512](~/src/transport_performance/gtfs/validation.py:1512)         f"""<a href="{href}"> Further Info</a>"""
   [1513](~/src/transport_performance/gtfs/validation.py:1513)         if len(rows) > 1
   [1514](~/src/transport_performance/gtfs/validation.py:1514)         else "Unavailable"
   [1515](~/src/transport_performance/gtfs/validation.py:1515)         for href, rows in zip(info_href, validation_dataframe["rows"])
   [1516](~/src/transport_performance/gtfs/validation.py:1516)     ]

File [~/src/transport_performance/gtfs/validation.py:1376](~/src/transport_performance/gtfs/validation.py:1376), in GtfsInstance._extended_validation(self, output_path, scheme)
   [1371](~/src/transport_performance/gtfs/validation.py:1371)         duplicate_counts[col] = impacted_rows[
   [1372](~/src/transport_performance/gtfs/validation.py:1372)             impacted_rows[f"{col}_original"]
   [1373](~/src/transport_performance/gtfs/validation.py:1373)             == impacted_rows[f"{col}_duplicate"]
   [1374](~/src/transport_performance/gtfs/validation.py:1374)         ].shape[0]
   [1375](~/src/transport_performance/gtfs/validation.py:1375) else:
-> [1376](~/src/transport_performance/gtfs/validation.py:1376)     impacted_rows = table_map[table].copy().iloc[rows]
   [1378](~/src/transport_performance/gtfs/validation.py:1378) # create the html to display the impacted rows (clean possibly)
   [1379](~/src/transport_performance/gtfs/validation.py:1379) table_html = f"""
   [1380](~/src/transport_performance/gtfs/validation.py:1380) <head>
   [1381](~/src/transport_performance/gtfs/validation.py:1381)     <link rel="stylesheet" href="styles.css">
   (...)
   [1390](~/src/transport_performance/gtfs/validation.py:1390)             {msg_type}</span>
   [1391](~/src/transport_performance/gtfs/validation.py:1391) </h1>"""

KeyError: 'multiple_stops_invalid'

This should be fixed now. It was the same error as before (an unfinished TODO)

@r-leyshon r-leyshon self-assigned this May 1, 2024
Copy link
Contributor

@r-leyshon r-leyshon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked reports against fixtures. Behaves as expected with a few caveats. Please see issues below for reference:

#262
#263
#264

@r-leyshon r-leyshon merged commit 4300b44 into dev May 1, 2024
9 checks passed
@r-leyshon r-leyshon deleted the 185-html-report-always-cleaning branch May 1, 2024 13:13
github-actions bot pushed a commit that referenced this pull request May 1, 2024
* feat: add doctstring;add param;add type defence

* feat: add condition on whether to clean or not

* fix: fixed small typo in error message

* fix: changed match in test to fit error message

* fix: update gtfs attr table for html reports

* test: Refactor tests that are now warning about expired feed

* refactor: Test targets new row thanks to feed expired warning

* refactor: Tests assert against fixture with expired feed warning

---------

Co-authored-by: Sergio Recio <[email protected]>
Co-authored-by: r-leyshon <[email protected]> 4300b44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

html_report() always cleaning
4 participants