Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be (even) more lenient #5

Open
8 of 10 tasks
simonpoole opened this issue Jun 29, 2016 · 15 comments
Open
8 of 10 tasks

Be (even) more lenient #5

simonpoole opened this issue Jun 29, 2016 · 15 comments

Comments

@simonpoole
Copy link
Owner

simonpoole commented Jun 29, 2016

Potential further improvements could be made by consuming:

  • times with missing trailing zeros
  • times with unusual separators (mainly ".",) (there are slightly less than 1000 such OH specs in the test data)
  • times with "h" as separator, see above
  • 4 digit times
  • PH/SH in weekday list (173 occurrences)
  • utf white space
  • "to" instead of "-"
  • times specified with am/pm
  • interpret "wd" as weekdays
  • translations of "to"

See https://github.com/simonpoole/OpeningHoursParser/blob/master/test-data/oh.txt-fail for list of strings that currently fail in non-strict mode.

@ypid
Copy link

ypid commented Jun 29, 2016

Sounds good. However note that this is not covered by the spec. Refer to opening_hours.js which supports all of those things but will give proper warnings.

@simonpoole
Copy link
Owner Author

@TannerYoung
Copy link

Should this include a closing time of 00:00 as well as 24:00? I've noticed that in a number of nodes and I'm not sure if that is an error in the data, or if this parser isn't handling it properly.

@simon04
Copy link
Contributor

simon04 commented Mar 3, 2020

My comment from https://josm.openstreetmap.de/ticket/18140#comment:17

Concerning "Wd": `Wd is used as a symbol for describing the syntax in ​https://wiki.openstreetmap.org/wiki/Key:opening_hours, but it is not an allowed value. Also, "wd" is negligible concerning the usage statistics: ​https://taginfo.openstreetmap.org/keys/opening_hours#values

@simonpoole
Copy link
Owner Author

simonpoole commented Mar 11, 2020

In principle Wd could be supported in a non-conflicting way, but it would need a clear definition (and even if that boils down to "the days considered as work days is defined by local convention".

Because now the problem is that the only way to produce a OH grammar conforming output would be to transform it to a list of normal weekdays, so say for DACH:

Wd -> Mo,Tu,We,Th,Fr

or would that be

Wd -> Mo,Tu,We,Th,Fr,Sa

? :-)

See also https://en.wikipedia.org/wiki/Workweek_and_weekend

@westnordost
Copy link
Contributor

westnordost commented Aug 22, 2020

Some ideas:

  • also support em dash (—)
  • some values are written like this Mo-Fr 10:00-12:00 Sa 12:00-14:00 (missing separator). Supporting will be problematic in the rare case where it would actually make a difference if either the , or ; makes a difference - where the weekdays collide. So proper handling of it would make it necessary to analyze after the parsing if that is the case, so might be more trouble than it is worth
  • there are some values with fully-written weekday names, "Monday" etc.
  • same for Spanish etc.
    -a is used I think in almost all romanic languages, might be worth to include, f.e. Lunes a Jueves 08:00-16:00

@simonpoole
Copy link
Owner Author

732edc6 adds some translations of "to" for use in non-strict mode.

I've had a look at supporting 4 digit times, but it is a bad case of diminishing returns, ~10 changes in 160'000 examples and two of those would be regressions because of mix up with years. As this would need some fiddling around with semantic lookup ahead to fix, I think we shouldn't do this.

@westnordost
Copy link
Contributor

What do you mean with

~10 changes in 160'000 examples

?

@simonpoole
Copy link
Owner Author

What do you mean with

~10 changes in 160'000 examples

?

It changed the parsing of ~10 of the values from our test set (there are more values that use 4 digit times but they continue to fail for other reasons).

@westnordost
Copy link
Contributor

westnordost commented Jun 19, 2021

Ah well, isn't the problem with the test set, that it only includes unique values? For example, the opening hours Mo-Fr 09:00-17:00 are used 15513 times. So it should be 15513 times more important to be able to parse that than for example Mo-su 09:00-22:00 (used 1 time).

In other words, the question is, how many opening hours use 4 digits?

@simonpoole
Copy link
Owner Author

Ah well, isn't the problem with the test set, that it only includes unique values? For example, the opening hours Mo-Fr 09:00-17:00 are used 15513 times. So it should be 15513 times more important to be able to parse that than for example Mo-su 09:00-22:00 (used 1 time).

In other words, the question is, how many opening hours use 4 digits.

I don't think your argument really holds water, because in the end we are testing constructs, not individual values. And I would say that it is just as important to parse Mo-Fr 09:00-17:00 correctly as it is Mo-su 09:00-22:00.

My argument was that handling 4 digit times causes issues with -legit- values with year values, and that handling illegal values in this case doesn't seem to be worth the effort because the gains are likely to be very small.

@westnordost
Copy link
Contributor

Let's say there is a city that has 15514 shops and all of these have the opening hours Mo-Fr 09:00-17:00, except one, which has Mo-su 09:00-22:00. For an app that is being used in that city to display the opening hours, it would certainly be more important that the former opening hours are parsed correctly than the latter.

The test set does not test constructs, but unique taggings. This is insofar a difference as testing for the construct, you'd need to test times like 0600-1200, 0601-1200, 0602-1200, 0603-1200 etc.

@simonpoole
Copy link
Owner Author

If you want to argue semantics, then lets say "constructs in use", the goal was and is to parse all legal constructs. Given that enumerating all of them is not possible, testing all that are actually used (plus additional specific tests) is a reasonable place holder. Weighting is irrelevant as not parsing a legal construct is a failure regardless of it it occurs once or 1'000'000 times.

But in any case I might actually add this in, because I've found some more cases which can be handled by this which might make the effort of avoiding trashing years worth while.

@westnordost
Copy link
Contributor

westnordost commented Jun 19, 2021

I created this Kotlin script that parses every single instance of an opening hours value in OSM and reports the result. In Java, the code would look very similar. Maybe you are interested in this:

https://gist.github.com/westnordost/b43d2e5100e913cdd2eb8d2d2eb2d296

With the current version of OpeningHoursParser, it outputs

.....
535447 opening hours
Of these, 97.508272% are supported.

@westnordost
Copy link
Contributor

westnordost commented Jun 19, 2021

I just noticed that the retrieved number of opening hours strings by that script is less than one fourth of the actual count of opening hours tagged. Probably just a temporary bug/issue with sophox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants