Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional errors #106

Closed
wants to merge 8 commits into from
Closed

Conversation

tdelmas
Copy link
Collaborator

@tdelmas tdelmas commented Jun 26, 2023

Add non-jsonSchema validations

Checked errors and warnings are listed in README.md

Display a condensed summary of errors and warnings grouped by types and paths

Fix #92

Fix #93 by displaying a warning

Exemple: https://deploy-preview-106--gbfs-validator.netlify.app/validator?url=https%3A%2F%2Fdubai.publicbikesystem.net%2Fcustomer%2Fgbfs%2Fv2%2Fgbfs.json

image

@netlify
Copy link

netlify bot commented Jun 26, 2023

Deploy Preview for gbfs-validator ready!

Name Link
🔨 Latest commit eb20162
🔍 Latest deploy log https://app.netlify.com/sites/gbfs-validator/deploys/64f835537bdeb80007c92200
😎 Deploy Preview https://deploy-preview-106--gbfs-validator.netlify.app/
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@PierrickP PierrickP requested review from davidgamez and richfab June 26, 2023 13:24
@tdelmas tdelmas force-pushed the additional-errors branch from 12da2eb to 013f100 Compare July 12, 2023 11:51
@richfab
Copy link
Contributor

richfab commented Jul 12, 2023

Happy to see the wording nonSchema in the code 👍
Due to limited resources, we will only be able to take a closer look in a few weeks.
Please let us know if we are blocking you in any way.
Thank you for your patience 😌

@richfab
Copy link
Contributor

richfab commented Jul 17, 2023

Hi @PierrickP and @tdelmas,
Thank you very much for this very valuable contribution 🙇

I wonder if we could make the interface more consistent for users when there are both errors and warnings.
What do you think about these UI suggestions? (the "After" screenshots are collages, not implemented)

  • (summary collapsed) Reuse the danger and warning alert components for consistency and visibility
  • (summary expanded) Merge the schema and non-schema error arrays if this distinction is not useful to the users

cc @josee-sabourin @isabelle-dr

State Before After (suggestion)
summary collapsed image image
summary expanded image image

@isabelle-dr
Copy link
Contributor

For what it's worth, I am generally in favor of having the errors (schema and non-schema) in one place, because from the user perspective, it's a spec violation, regardless of if generated by the JSON Schema or the custom code.

@hbruch
Copy link

hbruch commented Jul 25, 2023

Great extension. I just found this PR searching for a check on duplicate_vehicle_type_id, which I encounter currently with this feed, and which could be a nice addition to the nonSchema tests.

@PierrickP
Copy link
Collaborator

@richfab @isabelle-dr i regrouped errors together and update the design

@hbruch I've added a new rule (duplicate_vehicle_type_id) to check duplicated vehicle_type_id on vehicle_types.json

Copy link
Contributor

@richfab richfab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive work @tdelmas !

Thank you very much for this extremely valuable addition to the validator. I am sure many producers will appreciate this.

Your comments on my suggestions are welcome.

cc @PierrickP @davidgamez

gbfs-validator/nonSchemaValidation/check_pricing_plans.js Outdated Show resolved Hide resolved
gbfs-validator/nonSchemaValidation/check_pricing_plans.js Outdated Show resolved Hide resolved
gbfs-validator/gbfs.js Outdated Show resolved Hide resolved
gbfs-validator/gbfs.js Outdated Show resolved Hide resolved
gbfs-validator/gbfs.js Outdated Show resolved Hide resolved
gbfs-validator/nonSchemaValidation/check_vehicle_types.js Outdated Show resolved Hide resolved
gbfs-validator/nonSchemaValidation/README.md Show resolved Hide resolved
gbfs-validator/nonSchemaValidation/README.md Outdated Show resolved Hide resolved
gbfs-validator/nonSchemaValidation/README.md Outdated Show resolved Hide resolved
gbfs-validator/__test__/gbfs.v3.0-RC.test.js Outdated Show resolved Hide resolved
@tdelmas
Copy link
Collaborator Author

tdelmas commented Sep 6, 2023

Thank you all for your detailed review.

I've pushed a commit that apply most suggestions of your reviews, and marked them as "resolved" for readability.

Comments not marked as resolved are not fixed, but as they require more tests, I strongly feel they are more suited in a follow-up pull request.

@richfab
Copy link
Contributor

richfab commented Sep 14, 2023

Thank you @tdelmas for taking most suggestions into consideration. This is amazing!

I will review your changes as soon as possible.

Copy link
Collaborator

@testower testower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments but would like to highlight some issues:

None of the new rules are implemented using partials / schema patching, although after skimming through them, I think a lot of them could well have been (?) - is this a strategic decision? What about the rules that are already implemented that way, are they going to be left like this or ported to the imperative style? The benefit of using the "old" convention is that the output automatically conforms to the regular schema validation error outputs, whereas in the imperative style, they have to be constructed manually. Especially since this project has no type-checking, this is prone to errors.

There is a breaking change in the response format of the validator api - which prompts me to raise the question of standardising this format for portability purposes. If validators respond the same way, frontends and validators can be used interchangeably, and validation reports can be consumed later by any "conformant" UI.

Last, the most serious issue for me, this PR introduces a number of warnings based on assumptions about a number of things that I don't think a generic / canonical GBFS validator should be concerned with. Business rules that depend on operational context should be left out of the validator.

If this validator was a library, one could imagine a feature that allowed library users to provide their own additional rulesets specific to their operational context - but as it stands now, this is just speculative.

Don't get me wrong, I applaud the good intentions behind this, but I don't think it belongs in the validator.

@@ -1,19 +1,92 @@
const o = require('../nonSchemaValidation')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is o the name of this variable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After skimming the list of "non-schema" rules, I'm a bit curious. There's already the concept of schema-patching in gbfs-validator/versions/partials, and it seems to be that a lot of these rules could very well be implemented there. Why was not that convention followed where this was possible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confused by this. Are we producing validation warnings based on some assumptions about what different types of trips typically cost? Where do these assumptions come from? What is the currency? This seems very speculative imho.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm skeptical about the concept of giving warnings based on some arbitrary assumptions about what is a high number of available vehicles and docks.

@@ -1,5 +1,7 @@
const fastify = require('fastify')

const last_updated_fresh = Math.floor(Date.now() / 1000) - 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule of thumb, testing based on an actual clock could lead to unexpected results and false negatives / positives. Using a mocked clock is preferable.

function build(opts = {}) {
const app = fastify(opts)

app.get('/gbfs.json', async function(request, reply) {
app.get('/gbfs.json', async function (request, reply) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nit-picky, but why add a whitespace here?

expect(get_errors(result)).toEqual({
errors: [],
nonSchemaErrors: [],
warnings: []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The taxonomy here feels odd. There are errors and there are warnings. But there are also a specific kind of error, which is not from the schema. Where do warnings come from?

My hope is that validation results become standardised for portability reasons, so I hope we can spend the extra effort to consider a rational api.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the risk of repeating myself, again, I find the assumptions made here to produce warnings are highly speculative and I don't think they belong in a validator. They are business rules that depend on operational and geographic circumstances. This just adds unnecessary complexity.

@tdelmas
Copy link
Collaborator Author

tdelmas commented Sep 15, 2023

@testower thank you for your review.

I'll try to answer your concerns.

None of the new rules are implemented using partials / schema patching, although after skimming through them, I think a lot of them could well have been (?) - is this a strategic decision?

Yes. Logic errors are much more difficult to write using schemas, and even where they are, it leads to some problems and bugs. For example, if two rules needs to patch the same part of a schema, the results will be unpredictable. Or if the original schema changes. Also, most of them are cross-files, which least to reading a file A, then creating the schema for B and vice versa.

What about the rules that are already implemented that way, are they going to be left like this or ported to the imperative style?

I feel like some should, in the future.

The benefit of using the "old" convention is that the output automatically conforms to the regular schema validation error outputs, whereas in the imperative style, they have to be constructed manually. Especially since this project has no type-checking, this is prone to errors.

I agree, this could be improved in a future PR.

There is a breaking change in the response format of the validator api - which prompts me to raise the question of standardizing this format for portability purposes. If validators respond the same way, frontends and validators can be used interchangeably, and validation reports can be consumed later by any "conformant" UI.

That's a good goal.

Last, the most serious issue for me, this PR introduces a number of warnings based on assumptions about a number of things that I don't think a generic / canonical GBFS validator should be concerned with. Business rules that depend on operational context should be left out of the validator.

Warning are just warnings: here to catch the attention, because it caught something that may be an error. I insist on the may.

I strongly feel they are important because they help to catch subtle mistakes, such as prices in cents instead of base units, or meters vs kilometers, because some feeds had those problems.

If this validator was a library, one could imagine a feature that allowed library users to provide their own additional rulesets specific to their operational context - but as it stands now, this is just speculative.

A separation could be done, it's an interesting idea for a future PR

@testower
Copy link
Collaborator

Let's take a step back to look at first principles. The GBFS validator is, according to the README

A General Bikeshare Feed Specification dataset validator

furthermore

The Canonical GBFS Validator is a tool to check the conformity of a GBFS feed against the official specification.

The word "canonical" is instructive if taken to mean: "according to the canon", the canon being the "the standard, rule or primary source that is accepted as authoritative"[1]. The "canon" in this context is obviously the GBFS specification.

I will interpret this to mean that the validator is intended to be isomorphic[2] to the specification it validates against. Any departure from that principle should be seriously discussed.

Let me elaborate on why I think it's a bad idea to introduce validation rules that break this isomorphism. My argument is that any validation rule that doesn't have a corollary in the specification, is a business rule only relevant for a specific operational context.

So while it may be useful to get a warning if a car is reported to have room for more than 5 passengers, or a moped has a max range of more than 100km, or a car rental costs more than 50 USD/EUR for an hour (?) - just to name a few examples - these are useful in a specific operational context (perhaps yours?), but it's hard to argue that they are universal truisms about the domain over which the GBFS specification is meant to govern. If they were, they would have been encoded in the specification already.

Now I'm not saying that such rules don't have merit, quite the contrary. But I'm saying that they belong in the specific systems that operate in the specific contexts that give rise to their usefulness. Anywhere else, they create clutter and noise.

For the sake of argument, I shall entertain the idea that these rules should be in the validator. Now, what process or rationale should determine what these rules and their threshold values should be? Is it consensus-based, or democratic? The greatest common denominator of all producers? Who shall bear the burden of maintaining these over time?

As an exercise, imagine that ever more currencies are added to the "typical cost" structure. How is it possible to keep this structure updated and relevant as prices change and currencies inflate and deflate.

I believe that the best candidates for taking that responsibility are those who operate in the context where those rules, thresholds and values are useful and relevant. Offloading that to the canonical validator is a recipe for future decay and ultimately complete irrelevance of the validator as a general-purpose tool to validate against the specification.

If I were to pretend to be positive about accepting these new rules, I would still expect the following preconditions to be met:

  • The codebase clearly separates them from other schema or non-schema specification rules
  • The UI clearly separates the output in the same way. Marking them as warnings is not sufficient. They must be clearly labelled as not being part of the specification validation, and that they should be interpreted as advisory
  • Each specific rule should be opt-in as opposed to all of them opt-out (or as it is now, non-optional)
  • Each specific rule must be documented both in code and in the UI, the purpose must be clear
  • The threshold values should be configurable

Even with these preconditions met, I'm not enthusiastic about them, since they still introduce a maintenance burden. I.e. someone must take responsibility for keeping them up-to-date and relevant, and someone must make sure that the total set of such non-specification rules is kept at a reasonable level. I'm particularly skeptical about the "typical cost" check. For one because it equates USD with EUR which could quickly change - but more importantly because it opens the possibility to add checks for any number of currencies, which, needless to say, would very quickly grow to unmanageable scales. Barring additional currencies from being added to avoid this problem, just amounts to discrimination, so I just don't see a way to accept it in the first place.

Related to the point I made about the output from the validator being standardized, I'm not sure how this fits in. The taxonomy is very unclear and ambiguous about what constitutes a validation error based on the specification and what is a result of an opinionated "sanity check".

All in all, I think this marks a very significant shift in the goal and purpose of the validator that we should not take on lightly. I ask the community to think very carefully about accepting these kinds of changes.

[1] https://en.wikipedia.org/wiki/Canonical
[2] Isomorphic in the set-theoretic sense: The set of validation rules should be identical to the set of rules in the specification.

@tdelmas
Copy link
Collaborator Author

tdelmas commented Sep 18, 2023

So while it may be useful to get a warning if a car is reported to have room for more than 5 passengers, or a moped has a max range of more than 100km, or a car rental costs more than 50 USD/EUR for an hour (?) - just to name a few examples - these are useful in a specific operational context (perhaps yours?)

As I said, those rules are not here to detect "high" or "low" prices, but cases when the producer made a conversion error (between meters vs kilometers, $ vs cents, etc.). But yes, it's not a possible to detect that perfectly.

@testower Is there a list of rules we can remove from this PR, so it could be acceptable for you?

@JohanEntur
Copy link

@tdelmas The issue is that there is a severe break in principle between a format-driven validation, and inclusion of subjective detection of possible user errors. It's not about what is acceptable for @testower personally, it's about having a whitelabel validator that works equally well for any user. That is the main point here.

Validation rules (the warnings), like these, do have a role to play but they would work better as a separate validator, perhaps as a template that each user can configure for local use. That way, each user can define their own limitations in weight, size, doors, and prices according to their own needs - but this still has nothing to do with the GBFS standard.

Our intention is not to be troublesome but to ensure that certain principles are maintained.

@testower
Copy link
Collaborator

@testower Is there a list of rules we can remove from this PR, so it could be acceptable for you?

Yes, all the warnings.

@richfab
Copy link
Contributor

richfab commented Sep 20, 2023

Hello everyone,

Thank you all for sharing your views on the validator with such enthusiasm and quality arguments. It’s very exciting to see that this tool is getting the attention it deserves from the community.

Thank you @testower for reminding the purpose of the canonical GBFS validator, which is to validate the conformity of a GBFS feed with the specification. MobilityData will make sure to keep this at the forefront of our decisions.

To ensure isomorphism between the specification and the validator, MobilityData recommends continuing to include in the validator only the rules that are in the specification.

The data sanity checks proposed by @tdelmas is a very good idea and could have a place in a separate validator configurable by users, as suggested by @JohanEntur.

Here is MobilityData’s proposal:

  1. The validator should return an error for these rules defined as mandatory in the specification:
Rule Proposal GBFS spec extract
last_updated_future Keep in this PR "Indicates the last time data in the feed was updated."
additional_properties Keep in this PR "Field names of extensions SHOULD be prefixed with an underscore ( _ ) character”
missing_translation Keep in this PR "Translations MUST be provided for each supported language for all translatable fields of type Array"
unknown_language Keep in this PR "Must match one of the values specified by the languages field in system_information.json.”
missing_default_pricing_plan_id Keep in this PR "REQUIRED if system_pricing_plans.json is defined.”
missing_station_information Keep in this PR "Any station that is represented in station_status.json MUST have a corresponding entry in station_information.json.”
missing_station_status Keep in this PR "Any station that is represented in station_information.json MUST have a corresponding entry in station_status.json.”
duplicate_station_id Keep in this PR "Array that contains one object per station as defined below.”
default_reserve_time Keep in this PR "REQUIRED if reservation_price_per_min or reservation_price_flat_rate are defined.”
unclosed_polygon Keep in this PR "The first and last positions are equivalent, and they MUST contain identical values (https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.6)”
duplicate_bike_id Keep in this PR "Array that contains one object per vehicle”
duplicate_vehicle_id Keep in this PR "Array that contains one object per vehicle”
invalid_vehicle_type_id Keep in this PR "The vehicle_type_id of this vehicle, as described in vehicle_types.json.”
duplicate_vehicle_type_id Keep in this PR "Array that contains one object per vehicle type”
unknown_plan_id, missing_system_pricing_plans and invalid_pricing_plan_id Combine the 3 alerts as they serve the same purpose "A plan_id, as defined in system_pricing_plans.json”
  1. The validator should return a warning for these recommendations defined in the specification:
Rule Proposal GBFS spec extract
last_updated_outdated Return a warning instead of an error "in no case should it be more than 5 minutes out-of-date”
num_vehicles_available_incorrect Return a warning instead of an error "The total number of vehicles from each of these objects SHOULD add up to match the value specified in the num_vehicles_available field.”
num_docks_available_incorrect Return a warning instead of an error "The total number of docks from each of these objects SHOULD add up to match the value specified in the num_docks_available field.”
  1. The validator should not return any error or warning for something that is not currently part of the specification: duplicate_translation, duplicate_vehicle_accessory, multiple_door_counts, num_docks_available_high, num_vehicles_available_high, station_capacity_too_low, unexpected_rider_capacity, unexpected_cargo_volume_capacity, unexpected_cargo_load_capacity, unexpected_propulsion_type, unexpectedly_low_range_meters, unexpectedly_high_range_meters, unexpected_vehicle_accessory, unexpectedly_low_wheel_count, unexpectedly_high_wheel_count, low_cost, high_cost.
    We suggest creating a separate issue to keep a list of potential errors/warnings that would be outside of the specification where others can add so that we have a good basis when/if we start to add data sanity checks to the validator.

Please let us know what you all think.
Thank you!

@testower
Copy link
Collaborator

@richfab I think that's a very good solution and it answers my concerns 🥇 if these changes are made I think we can move on to the more technical aspects of the review

@richfab
Copy link
Contributor

richfab commented Oct 5, 2023

Hi @tdelmas, please let me know if you have any questions or concerns about the proposal. Thank you!

@tdelmas
Copy link
Collaborator Author

tdelmas commented Oct 6, 2023

Thank you for the feedbacks and your ping.

  • I don't have any objections to those changes, I understand the rationale behind it.
  • Unfortunately, I don't have the time to work on it at the moment. If any of you want to take over this PR it will be appreciated

@richfab richfab added the help wanted Extra attention is needed label Oct 6, 2023
@richfab richfab mentioned this pull request Oct 19, 2023
@richfab
Copy link
Contributor

richfab commented Oct 19, 2023

Thank you @tdelmas and @testower. Since consensus was reached, I am planning to work on removing the code related to data sanity rules from this PR next week. The code will not be lost as it will remain in the commits.

The discussion about data sanity rules was moved to #138.

Thank you!

This was referenced Oct 24, 2023
@richfab
Copy link
Contributor

richfab commented Oct 24, 2023

if these changes are made I think we can move on to the more technical aspects of the review

@testower @tdelmas A technical review would be very welcome in #141, if you have the bandwidth.

Thank you in advance for making the GBFS validator even more useful to the community 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for v3.0 conditionally required fields Validator does not flag additional fields as invalid
7 participants