Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the robustness of police.hu crawler #14

Open
snorbi07 opened this issue Oct 22, 2019 · 0 comments
Open

Improve the robustness of police.hu crawler #14

snorbi07 opened this issue Oct 22, 2019 · 0 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@snorbi07
Copy link
Collaborator

snorbi07 commented Oct 22, 2019

This is a larger issue and serves as an aggregation of know issues or missing features. During implementation, we need to split this up into multiple separate issues or PRs.

Primarily there are 3 missing features:

  1. currently if we have an unsupported format in the parsing phase, it will lead to either an empty string or a crash. This needs to be more consistent and reliable. In case of an unsupported format, parts should fall back to a default value. For unsupported corner cases, the event should be logged with a warning level containing the string that led to a fallback value so we can provide support for it in the future.
  2. the fetched and parsed HTML should be stored in a raw format as well, so we don't loose data if the parsing fails for whatever reason. The storage should happen in such a way that we can easily pair the parsed and raw HTML formats together.
  3. the crawler messes up and losses the timezone information when storing the parsed data. It only supports UTC, without taking into consideration the timezone info of HU. The stored and parsed data should clearly reflect when it was captured (CEST, CET).

Definition of done:

  • extend the crawler to store the fetched HTML page and not just the result of the parsing to avoid data loss.
  • extend the various parsing steps to handle malformed inputs, by returning a default value and logging the input that it failed to handle.
  • cover the malformed input parsing cases with unit tests.
@snorbi07 snorbi07 added bug Something isn't working enhancement New feature or request labels Oct 22, 2019
MilanCsore added a commit to MilanCsore/6ar that referenced this issue Mar 30, 2020
snorbi07 pushed a commit that referenced this issue Mar 30, 2020
* WIP:#14_Malformed_Location_Names_and_Queue_times_have_been_handled_and_corner_cases_have_been_covered_with_unit_tests

* #14_Malformed_Location_Names_and_Queue_times_have_been_handled_and_corner_cases_have_been_covered_with_unit_tests
MilanCsore added a commit to MilanCsore/6ar that referenced this issue Apr 14, 2020
MilanCsore added a commit to MilanCsore/6ar that referenced this issue Apr 15, 2020
MilanCsore added a commit to MilanCsore/6ar that referenced this issue Apr 15, 2020
snorbi07 pushed a commit that referenced this issue Apr 22, 2020
* #14 unparsed data is saved in RAW HTML format at the same time as parsed data, to a different folder.

* #14 CLI has been enhanced to save parsed and unparsed files at the same time to different places.
MilanCsore added a commit to MilanCsore/6ar that referenced this issue May 11, 2020
MilanCsore added a commit to MilanCsore/6ar that referenced this issue May 21, 2020
MilanCsore added a commit to MilanCsore/6ar that referenced this issue May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant