Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add utility to abbreviate road names #14

Open
3 tasks
msbarry opened this issue Oct 29, 2021 · 23 comments
Open
3 tasks

[FEATURE] Add utility to abbreviate road names #14

msbarry opened this issue Oct 29, 2021 · 23 comments

Comments

@msbarry
Copy link
Contributor

msbarry commented Oct 29, 2021

When implementing the basemap layer, I did not port road name abbreviations from https://github.com/giggls/mapnik-german-l10n because of the licensing and just pass road names from OpenStreetMap directly. This is not ideal because clients like MapLibre won't show labels on a line if we can't fit the entire string, so overall shorter label names will increase label density.

I think it would be best to abbreviate all road names (not just long ones), for example:

Input Output
Park Avenue Park Ave
Northeast Boulevard NE Blvd
East 61st Street E 61st St
First Street 1st St
South Park Street S Park St
South Street South St

Planning to use https://github.com/Project-OSRM/osrm-text-instructions/tree/master/languages/abbreviations as a starting data source (other possibilities from OpenCageData address formatter and geocoder-abbreviations but search-specific datasets will likely be too aggressive)

Subtassk to build/test:

  • (coordinate) => (admin level 2/admin level 4) or (adm2/adm4)[] - this will support pluggable datasets (natural earth, overture division areas) to choose a resolution and POV and fall back to country-coder geojson files. It will also be usable for profiles to get the country/region for a feature for other purposes besides abbreviation
  • (element, country, region?) => language or languages[] to infer what language a name tag is based on its location and other tags on it
  • (name, language, country?, region?) => abbreviation - that applies rules to a name based on that language
@1ec5
Copy link

1ec5 commented Feb 16, 2022

mapnik-german-l10n doesn’t produce very good results in English because it has to avoid stepping on the toes of the French abbreviation code and also abbreviates words out of context (e.g., “Court Street” becomes “Ct Street”). I’d imagine it would be straightforward to write more robust abbreviation code from scratch. Ideally, the abbreviator would know the country that the feature is in, allowing it to make language-specific assumptions about name and perhaps avoid abbreviating a French name in France that was copied to name:en.

/ref openmaptiles/openmaptiles#1360

@1ec5
Copy link

1ec5 commented Feb 16, 2022

Some possible data sources:

These projects have different use cases, so they apply different inclusion criteria. The abbreviations in OSRM Text Instructions are used by the Mapbox Directions API, which tags words in name or destination with potential abbreviations that the Mapbox Navigation SDK can progressively apply1 until the text fits the allotted space. Priority is given to directions abbreviations, then classifications abbreviations, then abbreviations abbreviations in a last-ditch attempt to make the text fit.

Mapnik has a similar ability to progressively abbreviate labels, but as you’ve noted, MapLibre does not yet have this capability. If you use OSRM Text Instructions’ abbreviations without progressive abbreviation, avoid the abbreviations table, which would make the results look desperate but probably wouldn’t salvage many labels of borderline length.

Footnotes

  1. This is a link to the last open-source version of the code in question.

@msbarry
Copy link
Contributor Author

msbarry commented Feb 16, 2022

Thanks for the feedback @1ec5 ! Of those 3 options above, I was leaning towards geocoder-abbreviations since it includes things like "one" -> "1" and "fifteenth" -> "15th" with something like: point -> country -> default language for that country -> tokens file for that language. Do you foresee any issues going that route? Or alternatively are you aware of any other better data sources to power these kinds of abbreviations?

@1ec5
Copy link

1ec5 commented Feb 16, 2022

geocoder-abbreviations is more comprehensive; however, it doesn’t distinguish the more aggressive abbreviations from the less aggressive ones. You’d need to be careful about applying abbreviations only to words that are unlikely to be the base name, so that “South Park Street” would get abbreviated as “S Park St” rather than “S Pk St”, which wouldn’t be very recognizable. This touches on a broader problem with OSM’s unstructured representation of street names, combined with insisting on spelled-out words in name, but some heuristics like avoiding abbreviating the middle word(s) could help.

@msbarry
Copy link
Contributor Author

msbarry commented Feb 16, 2022

Good points, I'll keep a running list of test cases that should pass at the top of this issue.

Another thought I had was using libpostal (https://github.com/openvenues/jpostal) to try to extract some more structure from raw street names. Not sure it would handle street names and not addresses though...?

@msbarry
Copy link
Contributor Author

msbarry commented Jul 29, 2022

Moved the openmaptiles profile to https://github.com/openmaptiles/planetiler-openmaptiles. This ticket will track adding the generic capability of abbreviating road-names, and openmaptiles/planetiler-openmaptiles#17 will track using that from the openmaptiles profile.

@1ec5
Copy link

1ec5 commented Sep 10, 2024

@msbarry
Copy link
Contributor Author

msbarry commented Sep 10, 2024

Thanks - I was wondering if this might even make more sense as a client-side capability - possibly requiring a new hook added to maplibre? The raw data is the full length name, it seems like a client may either want to render abbreviated or un-abbreviated versions from the same raw tile data?

Either way, we need a datasource that describes the abbreviation rules by locale...

@1ec5
Copy link

1ec5 commented Sep 10, 2024

The client-side renderer doesn’t have as much context as the tile generator – and the string processing capabilities inside MapLibre are comically limited. I think ideally the tiles would contain both the unabbreviated and abbreviated forms, maybe even a series of progressively abbreviated forms, that the client would apply as space allows.

In osm-americana/openstreetmap-americana#793 (comment), I started prototyping a purely client-side solution for dropping less important words from a street name, which is a similar problem. Identifying the words to delete and then deleting them is massively difficult, but choosing the best-fitting name turned out to be quite feasible, with the caveat of increased CPU usage due to extra symbol placement.

If the tiles already contain the candidate names, displaying the best one should be pretty straightforward with MapLibre’s existing capabilities. If MapLibre then adds a built-in way to try different labels as space allows, as Mapnik does, then it would be all that much easier.

@wipfli
Copy link
Contributor

wipfli commented Sep 10, 2024

The feature properties transform hook could be good for this. I should finish the pull request at some point... maplibre/maplibre-gl-js#4199

@1ec5
Copy link

1ec5 commented Sep 10, 2024

The transform hook would be desirable – it would make it easier to work around the lack of string processing expression operators. However, there still isn’t very much context, so the client-side code would need to make lots of assumptions, as I did in my proof of concept above. As with the other stuff that OSM Americana does with runtime styling, it isn’t very portable and makes it harder to integrate the stylesheet into an application centered around “basemap layers”. A tile generator is in a much better position to precalculate the abbreviations. It can always stick them in a separate field, just as with localized names, to maximize the tileset’s versatility.

@msbarry
Copy link
Contributor Author

msbarry commented Sep 11, 2024

When you say "context" are you referring to just the country/default language that the feature is in?

Including both abbreviated and unabbreviated names in the tiles solves for my concern of letting the client choose which one to use. We could also possibly default to name:abbreviation short_name or ref when available instead of computing it?

What do you think the best way to get started on this would be @1ec5 ? TBH the main reason I haven't done this yet is that your earlier comments made it seem like it might not even be possible to do correctly 😆 Maybe we could start with a conservative subset of one of those data sources (ie. just the ones in your label density branch) and gradually add more?

@1ec5
Copy link

1ec5 commented Sep 11, 2024

When you say "context" are you referring to just the country/default language that the feature is in?

Yes, the language and country are important context. Otherwise, we don’t know whether to abbreviate “boulevard” as “Bd.” (French) or as “Blvd.” (English), or whether to abbreviate “Calle” to “C.” (Spanish, but not when borrowed into English). For schemas such as OpenMapTiles, we know the language of most of the name tags but not name, so that would have to come from heuristics, like finding a matching name:* tag or a matching Wikidata label, or detecting certain character patterns, and then breaking any ties based on the country it’s in.

We could also possibly default to name:abbreviation short_name or ref when available instead of computing it?

Yes, short_name should be one candidate abbreviation, though maybe not the final abbreviation. For example, a street named Dr. Martin Luther King Dr. might have a short name of “MLK Drive”.

TBH the main reason I haven't done this yet is that your earlier comments made it seem like it might not even be possible to do correctly 😆 Maybe we could start with a conservative subset of one of those data sources (ie. just the ones in your label density branch) and gradually add more?

My branch took advantage of the style’s focus on North America and only targeted English users, but if Planetiler has context about the country and can guess the language, then it doesn’t have to make such assumptions and doesn’t need to be particularly conservative. That said, a small hard-coded list is better than nothing, as long as it’s a coherent list, rather than the arbitrary one OpenMapTiles uses.

@msbarry
Copy link
Contributor Author

msbarry commented Sep 12, 2024

Thanks, so a rough approach in planetiler would be:

  • pick a name to start with: prefer short_name, otherwise name:abbreviation, otherwise name
  • collect context about the feature:
    • what country it is in (this can be a controversial topic 😬 )
    • best guess language (match a translation from osm or wikidata, otherwise fall back to ICU language detection breaking ties with predominant language in that country)
  • apply language (and country?) specific rules to abbreviate that name

I guess it would also be possible to generate abbreviations for each name translation like name:de -> name:abbrev:de which would let us bypass language detection but I assume the coverage wouldn't be as high so clients would want to fall back to the main abbreviated name.

Do you think it would be best to start from one of those existing data sources? Or build up a map-rendering-specific abbreviation data source from scratch?

@1ec5
Copy link

1ec5 commented Sep 12, 2024

The most important heuristic for language identification would be finding a name:* that matches name.

what country it is in (this can be a controversial topic 😬 )

This is about what language name is in for the country in OSM in practice, not what it should be. In cases like India where The name language is in English despite the “on the ground” principle, this functionality would go with English. I think more than any controversy, we’ll find many edge cases where there just isn’t a single national language, or where the name contains multiple languages. The conservative approach would ignore these regions for now, since abbreviation isn’t super critical.

In theory, default_language on the surrounding boundary can give you a good idea of the language, but in practice the values are often wrong, and I don’t think it’s a good idea to create that broad and deep a target for vandalism. I think it could be a starting point for a dataset, but I’d be wary of using it uncritically.

I guess it would also be possible to generate abbreviations for each name translation like name:de -> name:abbrev:de which would let us bypass language detection but I assume the coverage wouldn't be as high so clients would want to fall back to the main abbreviated name.

Coverage for streets would be very low except in some bilingual cities. Also, the OpenMapTiles schema currently only stores road names in English and German, ignoring other languages. A possible half-measure would be to always produce a name:en:abbr based on name even if there’s no name:en, but this circumvents the language fallback rules that are currently the client’s responsibility.

Do you think it would be best to start from one of those existing data sources? Or build up a map-rendering-specific abbreviation data source from scratch?

Either is fine I guess, but if we use an existing source, it should be one intended for display rather than search. A search-oriented dataset will tend to be too aggressive.

@wipfli
Copy link
Contributor

wipfli commented Sep 13, 2024

I think if you give an name to an LLM together with a geolocation it can guess the language quite well.

@msbarry
Copy link
Contributor Author

msbarry commented Sep 13, 2024

Either is fine I guess, but if we use an existing source, it should be one intended for display rather than search. A search-oriented dataset will tend to be too aggressive.

Yeah the geocoder one (https://github.com/geocoding/geocoder-abbreviations/) seems like it's going to be too aggressive. The OSRM one (https://github.com/Project-OSRM/osrm-text-instructions/tree/master/languages/abbreviations) seems closer to what we want, except the last update was by @1ec5 6 years ago 😆

To Oliver's point - I don't want to be invoking an LLM from planetiler directly, but I could see using one to generate static config files to start from and maybe test cases. It might be 90% of the way there and we can correct from that.

@1ec5
Copy link

1ec5 commented Sep 13, 2024

The OSRM one (https://github.com/Project-OSRM/osrm-text-instructions/tree/master/languages/abbreviations) seems closer to what we want, except the last update was by @1ec5 6 years ago 😆

It’s pretty stable; most languages don’t change their abbreviations very often. The nice thing about that project is that volunteers can sign up to add their language on Transifex today using a decently user-friendly interface.

To Oliver's point - I don't want to be invoking an LLM from planetiler directly, but I could see using one to generate static config files to start from and maybe test cases. It might be 90% of the way there and we can correct from that.

I don’t think we need to treat OSM like a black box and rely on an LLM to interpret it for us. My only point was that default_language is unreliable – because it lacks visibility and nuance. That nuance is probably captured on this wiki page, which, sure, we could use an LLM to reformat into something more structured. For any gaps in coverage, CLDR maintains an open-source, industry-standard set of country-language mappings, and there’s no shortage of heuristic-based tools for language ID.

@msbarry
Copy link
Contributor Author

msbarry commented Sep 15, 2024

OK cool, trying to think how to decouple this into simpler subtasks...it seems like there are 3 distinct steps here:

  1. (name, language) => abbreviated name should be relatively straightforward based on a data source like OSRM
  2. (element, country) => language or language[] most complex based on those rules you were describing
  3. (coordinate) => country or country[] can be controversial and I'd rather use a pre-constructed data source than need to build it from the OSM extract (requires an extra 2 passes). I'm thinking either:
    • natural earth - it's only a 5mb download and you can specify a POV to choose the source, but it could be of slightly from OSM
    • or overture country division areas - 62mb but should line up with OSM exactly, and it looks like they plan to eventually populate a "perspectives" field that would let you filter on a perspective from a single download

Or do you think it needs to go down to region instead of country in order to infer a language?

I could see building up a set of test cases for 1 and a separate set of test cases for 2 to make sure the rules from that wiki page and heuristics work together to produce the desired language inference. Or do you think we can't really separate 1 and 2 and we'll want the code/test cases to go directly from element and inferred country to abbreviated name? Or even worse that 3 can't be separated and we'd need to go from element and location to abbreviated name?

@1ec5
Copy link

1ec5 commented Sep 15, 2024

(element, country) => language or language[] most complex based on those rules you were describing

Yes, and to add to the complexity, we definitely would have to consider subnational divisions such as Québec and the cantons of Switzerland. default_language is defined recursively on subnational divisions, though I would probably cross-reference the wiki just in case.

(coordinate) => country or country[] can be controversial and I'd rather use a pre-constructed data source than need to build it from the OSM extract (requires an extra 2 passes).

Natural Earth should be fine. We don’t really need precision, because all we’re doing with the data is choosing abbreviations for name. Plus, not every language even abbreviates road names or abbreviates at all. If you’re planning to make it more generic, so that a tileset could expose the geocoded country code on each feature, that would be a different story. Another alternative would be country-coder. It’s in JavaScript, but you could pull out the GeoJSON data.

Rest assured, just about anything we do here is already going to surpass OpenMapTiles, Mapzen, and I think Mapbox Streets in terms of comprehensiveness.

@msbarry
Copy link
Contributor Author

msbarry commented Sep 16, 2024

OK cool, so sounds like a revised approach would be:

  1. (coordinate) => (admin level 2/admin level 4) or (adm2/adm4)[] - it's come up in other contexts that it would be helpful to have access to the country or region for other purposes as well, so I'd probably make this a pluggable thing where country-coder json is the built-in fallback but you can optionally choose natural earth or overture for higher-resolution or a specific POV
  2. (element, country, region?) => language or languages[] so there can be overrides per province or canton if the geocoder is detailed enough to provide region as well
  3. (name, language) => abbreviation - the abbreviations only seem to change by language, but would this need to be country or region-aware as well?

@1ec5
Copy link

1ec5 commented Sep 17, 2024

the abbreviations only seem to change by language, but would this need to be country or region-aware as well?

I think this is a fair starting assumption, though I wouldn’t be surprised if we eventually find out about country-specific abbreviations in some language.

@msbarry
Copy link
Contributor Author

msbarry commented Sep 17, 2024

OK got it, I guess it doesn't change the last step much except it could take a country/region as well. This seems doable, the first step I'd get started on when I have time would be the element -> country/region provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants