Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-15391 Update Canadian census and Fix Cans Script Match #4208

Merged

Conversation

conradarcturus
Copy link
Contributor

The original purpose of this change is to update the default language for Canadian Aboriginal syllabics [Cans] from Inuktitut [iu] to Cree [cr] since Cree has a larger population. Understandably, both of these languages are macrolanguages with many variations -- so its funny to include both the Cree macro-language along with its constituents. But I think it's better to cover both groupings because depending on the consumer they may want [cr] data or constituent data.

While I was doing this I updated all of the Canadian locale data to the 2021 Census. I also added a few missing aboriginal Canadian languages: Woods Cree [cwd] and Western Ojibway [ojw].

See the 2021 Census table here: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601

CLDR-15391

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@macchiati
Copy link
Member

Understandably, both of these languages are macrolanguages with many variations -- so its funny to include both the Cree macro-language along with its constituents.

Important: CLDR treats macrolanguage codes as regular languages — identifying them with their most common encompassed language. For example, zh is interpreted as identical to 'cmn' (and preferred — cmn aliases to zh). So it should be expected that cr (=cwd) appears along with other languages that ISO considers encompassed by cr (eg crj).

We test in other cases, and should test here, that we don't have any aliased language codes.

@conradarcturus
Copy link
Contributor Author

@srl295 ah this was a larger can of worms than I anticipated.

So it looks like currently cr defaults to Woods Cree/cwd. But, really, cr is the Macrolanguage for all Cree variations https://iso639-3.sil.org/code/cwd

The original reported bug is that "und_Cans" is matching to "iu_Cans" not "cr_Cans" even though the Cree community is bigger than the Inuktitut community. However I need to figure out the macrolanguage matching to see hwo to move forward.

Happy to punt this and get back to this once we are all back in late November.

@macchiati
Copy link
Member

The handling of the 'macrolanguage' concept was introduced in BCP47 for backwards compatibility, but causes its own — more severe — compatibility problems. If 'zh' truly means 'any Chinese language', then it would be perfectly fine for an implementation to request 'zh' and for us to serve up 'yue' content in LDML. So we have a longstanding policy for macro/encompassed languages that I outlined. See also https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code.

There are times where we will adjust the aliasing, where there is a strong shift one way or another. But for small changes we favor stability. The key is, just treat 'cr' exactly as if it were 'cwd', and don't use 'cwd'.

@conradarcturus
Copy link
Contributor Author

Ooof that makes sense and I don't want this ticket to generate any more work than it already has so I am inclined to punt on the macrolanguage conversation. To return to the narrow origin of the ticket of 1) making und_Cans match to cr_Cans instead of iu_Cans. I'll still keep the official language + population updates though.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/likelySubtags.xml is different
  • common/supplemental/supplementalData.xml is different
  • common/supplemental/supplementalMetadata.xml is no longer changed in the branch
  • common/testData/localeIdentifiers/likelySubtags.txt is different
  • common/testData/localeIdentifiers/localeCanonicalization.txt is no longer changed in the branch
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is now changed in the branch
  • tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different
  • tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/language_script.tsv is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

The main purpose of this change is to update the default language for Canadian Aborginal syllabics [Cans] from Inukitut [iu] to Cree [cr] since Cree has a larger population. Understandably, both of these languages are macrolanguages with many variations -- so its funny to include both the Cree macrolanguage along with its constituents. But I think it's better to cover both groupings because depending on the consumer they may want [cr] data or constituent data.

While I was doing this I updated all of the Canadian locale data to the 2021 Census. I also added a few missing aborginal Canadian languages: Woods Cree [cwd] and Western Ojibway [ojw].

See the 2021 Census table here: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601
I'll note that Cree is only official in the Northern Territories (NT). However [the NT law](https://web.archive.org/web/20090324202430/http://www.justice.gov.nt.ca/PDF/ACTS/Official_Languages.pdf) does not specify which Cree variation -- the only Cree language present in NT is Plains Cree [crk] so I infer that is the correct match.
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Removed hardcoded _CA entries because they aren't necessary since the likely subtags can be derived from the population data.

Also edited the cr_Cans_CA comment because it was overflowing the line and to add more context.

CLDR-15931 Restore und_Cans -> iu_Cans_CA

After discussing with the DDL committee, especially with someone who works with the Cree community -- we decided it was best to treat the Cree languages as separate languages and not a macrolanguage. For CLDR's purposes, [cr] is an alias of [cwd] / Woods Cree, so the likely subtag should not be cr_Cans_CA.
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@conradarcturus conradarcturus marked this pull request as ready for review November 25, 2024 23:39
@conradarcturus
Copy link
Contributor Author

Per feedback from the CLDR DDL Subcommittee: For the purposes of CLDR the cr language code is Woods Cree (ISO cwd). Thereby cr_Cans_CA should be considered Woods Cree specifically. Broadly, for the community, the Cree languages are not mutually intelligible so they don't want Cree to be lumped into 2 language anyway.

So I'm not going to complete the original ticket's purpose (sorry!) -- but I would still like this change to update the Canadian Population data to the latest census.

@conradarcturus conradarcturus dismissed macchiati’s stale review November 25, 2024 23:50

Feedback address -- treating "cr" Cree as the constituent language Woods Cree -- not as a macrolanguage.

@@ -172,6 +172,8 @@ cs Czech primary Latn Latin
csb Kashubian secondary Latn Latin
csw Swampy Cree primary Cans Unified Canadian Aboriginal Syllabics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csw and probably others may be secondary latin (they call it SRO - standard roman orthography)

@conradarcturus conradarcturus merged commit fccc7c2 into unicode-org:main Nov 26, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants