CLDR-17884 Regenerate AddPopulationData, ConvertLanguageData, reduce …

…standard out noise (unicode-org#3965) I started this ticket because I was seeing a lot of noisy warnings and errors in the regular tests -- I ended up in a rabbit hole with the generated population data. This change updates the data inputs and fixes errors in the scripts so we can regenerate population data in a stable way now. ### Scripts ran: * mvn package -DskipTests=true * Re-ran these scripts, they need to be run more regularly, some changes happen * java -jar tools/cldr-code/target/cldr-code.jar AddPopulationData # Runs successfully now, some changes happen * java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData # Runs successfully now, some changes happen * These scripts do not run * java -jar tools/cldr-code/target/cldr-code.jar WikipediaOfficialLanguages * java -jar tools/cldr-code/target/cldr-code.jar GenerateMaximalLocales * Running tests on github (I still cannot locally run all of the tests* #### Script output changes A lot of the script Standard Out messages mentioned in the original ticket are now fixed and will not appear -- mostly from fixing input data sources and a few processing scripts. If there are legitimate errors in the future the warnings and errors will appropriately come back. * Suriname had 2 un-distinguished sources of literacy data, this will now take the max value of the two * one was the overall number * the other had filtered institutional data * Since the aggregate regions from `world_bank_data.csv` are now gone, there are no more warnings about aggregates without country codes, eg. "Sub-Saharan Africa (all income levels)` ### Data changed: * `country_language_population.tsv` * Fixed some areas where spaces were used that should the tabs -- this affected how scripts parsed Kara-Kalpak, bug introduced in unicode-org#3657 * Added `Cantonese (Traditional) yue` row otherwise `yue` would disappear in the re-generated `supplementalData.xml` -- introduced in unicode-org#3945 * `factbook_gdp_ppp.csv` & `factbook_gdp_ppp.csv`: CIA Factbook data updated and imported using the csv that's exported by the [CIA's website](https://www.cia.gov/the-world-factbook/references/guide-to-country-comparisons/) -- see also [the old CLDR update documentation](https://cldr.unicode.org/development/updating-codes/updating-population-gdp-literacy). * This will update all population counts in `supplementalData.xml` * Some stale data was removed from the Factbook there I added missing entries to `other_country_data.txt` * `other_country_data.txt`: Added information that used to be in earlier versions of the CIA Factbook * `world_bank_data.csv`: Re-generated from [the World Bank Website](https://databank.worldbank.org/reports.aspx?source=world-development-indicators#) . See also [the old CLDR update documentation](https://cldr.unicode.org/development/updating-codes/updating-population-gdp-literacy). * A big difference is that I correctly read the instructions and did not import the country aggregates, eg. "Sub-Saharan Africa (all income levels)` * `alternate_country_names.txt`: Removed no longer needed skipped names since we no longer import CIA Factbook aggregates ### Consequences for`supplementalData.xml` * **Official Languages**:`<language>` territories tag should be the territories where the language is **official** -- so some entries updated. For instance Mocheno was incorrectly considered an official language of Italy in unicode-org#3665 * **Population counts** are incremented, so some **language population percentages** may increase or decrease if the input data is absolute value (since the denominator changed) * **GDPs** also changed * **Literacy Rates** some have changes * Note there was a wonderful bug where the UN literacy data was mis-parsed, so "96%" would be mis-read as "0.96%" -- I fixed that * **References**: The two Kara-Kalpak references are now grouped correctly, Chinese reference has been given more context too
richgillam · Aug 16, 2024 · b4e6abf · b4e6abf
1 parent e4264a8
commit b4e6abf
Show file tree

Hide file tree

Showing 12 changed files with 10,593 additions and 4,253 deletions.
diff --git a/common/supplemental/supplementalData.xml b/common/supplemental/supplementalData.xml
diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/tool/AddPopulationData.java b/tools/cldr-code/src/main/java/org/unicode/cldr/tool/AddPopulationData.java
@@ -85,11 +85,13 @@ static ArrayList<Pair<WBLine, Integer>> parseHeader(final String[] pieces) {
         }
     }
 
-    enum FBLine {
-        Rank,
-        Country,
+    enum FactbookLine {
+        CountryName,
+        CountrySlug,
         Value,
-        Year;
+        DateOfInformation,
+        Ranking,
+        Region;
 
         String get(String[] pieces) {
             return pieces[ordinal()];
@@ -200,6 +202,10 @@ private static void showCountryData(String country) {
                         + percent.format(getLiteracy(country) / 100));
     }
 
+    /**
+     * Gets the percent of people that can read in a particular country. Values are in the range 0
+     * to 100
+     */
     public static Double getLiteracy(String country) {
         return firstNonZero(
                 factbook_literacy.getCount(country),
@@ -275,16 +281,13 @@ private static void loadFactbookInfo(String filename, final Counter2<String> fac
                 new LineHandler() {
                     @Override
                     public boolean handle(String line) {
-                        if (line.length() == 0
-                                || line.startsWith("This tab")
-                                || line.startsWith("Rank")
-                                || line.startsWith(" This file")) {
+                        String[] pieces = splitCommaSeparated(line);
+                        String countryName = FactbookLine.CountryName.get(pieces);
+                        if (countryName.equals("name")) {
                             return false;
                         }
-                        String[] pieces = line.split("\\s{2,}");
                         String code =
-                                CountryCodeConverter.getCodeFromName(
-                                        FBLine.Country.get(pieces), true, missing);
+                                CountryCodeConverter.getCodeFromName(countryName, true, missing);
                         if (code == null) {
                             return false;
                         }
@@ -295,7 +298,7 @@ public boolean handle(String line) {
                             return false;
                         }
                         code = code.toUpperCase(Locale.ENGLISH);
-                        String valueString = FBLine.Value.get(pieces).trim();
+                        String valueString = FactbookLine.Value.get(pieces).trim();
                         if (valueString.startsWith("$")) {
                             valueString = valueString.substring(1);
                         }
@@ -395,7 +398,9 @@ public boolean handle(String line) {
                             return false;
                         }
                         code = code.toUpperCase(Locale.ENGLISH);
-                        String valueString = FBLiteracy.Percent.get(pieces).trim();
+                        String valueString =
+                                FBLiteracy.Percent.get(pieces)
+                                        .trim(); // Values are in the range 0 to 100
                         double percent = Double.parseDouble(valueString);
                         factbook_literacy.put(code, percent);
                         if (ADD_POP) {
@@ -521,7 +526,10 @@ static List<Pair<String, Double>> getUnLiteracy(Output<Boolean> hadErr) throws I
                 continue;
             }
             double total = literate + illiterate;
-            double percent = ((double) literate) / total;
+            double percent =
+                    ((double) literate)
+                            * 100
+                            / total; // Multiply by 100 to put values in range 0 to 100
             result.add(Pair.of(code, percent));
         }
         if (result.isEmpty()) {
@@ -535,8 +543,8 @@ static List<Pair<String, Double>> getUnLiteracy(Output<Boolean> hadErr) throws I
             loadFactbookLiteracy();
             loadUnLiteracy();
 
-            loadFactbookInfo("external/factbook_gdp_ppp.txt", factbook_gdp);
-            loadFactbookInfo("external/factbook_population.txt", factbook_population);
+            loadFactbookInfo("external/factbook_gdp_ppp.csv", factbook_gdp);
+            loadFactbookInfo("external/factbook_population.csv", factbook_population);
             CldrUtility.handleFile("external/other_country_data.txt", new MyLineHandler(other));
 
             loadWorldBankInfo();
@@ -577,7 +585,7 @@ static List<Pair<String, Double>> getUnLiteracy(Output<Boolean> hadErr) throws I
             }
             if (myErrors.length() != 0) {
                 throw new IllegalArgumentException(
-                        "Missing Country values, the following and add to external/other_country_data to fix, chaning the 0 to the real value:"
+                        "Missing Country values, the following and add to external/other_country_data to fix, changing the 0 to the real value:"
                                 + myErrors);
             }
         } catch (IOException e) {

diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/tool/UnLiteracyParser.java b/tools/cldr-code/src/main/java/org/unicode/cldr/tool/UnLiteracyParser.java
@@ -180,9 +180,17 @@ private void handleRecord() {
             throw new IllegalArgumentException(
                     "Inconsistent reliability " + reliability + " for " + thisRecord);
         }
-        final Long old = pa.perLiteracy.put(literacy, getLongValue());
-        if (old != null) {
-            System.err.println("Duplicate record " + country + " " + year + " " + age);
+        final Long new_value = getLongValue();
+        final Long old_value = pa.perLiteracy.put(literacy, new_value);
+        if (old_value != null) {
+            // Suriname is known to include duplicate records, 1 normal and 1 "Excluding the
+            // institutional population"
+            // Resolve this by taking higher value
+            if (country.equals("Suriname")) {
+                pa.perLiteracy.put(literacy, Math.max(old_value, new_value));
+            } else {
+                System.err.println("Duplicate record " + country + " " + year + " " + age);
+            }
         }
     }
 

diff --git a/...s/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv b/...s/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv
@@ -251,7 +251,8 @@ Chad	TD	"15,833,116"	35%	"28,620,000,000"	official	French	fr	26%			https://www.c
 Chile	CL	"17,925,262"	99%	"452,100,000,000"		English	en	9.5%
 Chile	CL	"17,925,262"	99%	"452,100,000,000"		Mapuche	arn	"272,000"			http://en.wikipedia.org/wiki/Mapuche_language
 Chile	CL	"17,925,262"	99%	"452,100,000,000"	official	Spanish	es	98%			"http://en.wikipedia.org/wiki/Demographics_of_Chile#Languages Spanish ""universal"", set to 98%"
-China	CN	"1,384,688,986"	95%	"23,210,000,000,000"		Cantonese (Simplified)	yue_Hans	5.2%	5%		"Mainly in Guangdong Prov, ~70-80 million"
+China	CN	"1,384,688,986"	95%	"23,210,000,000,000"		Cantonese (Simplified)	yue_Hans	5.2%	5%		"Mainly in Guangdong Prov, ~70-80 million. Script unspecified so both listed"
+China	CN	"1,384,688,986"	95%	"23,210,000,000,000"		Cantonese (Traditional)	yue	5.2%	5%		"Mainly in Guangdong Prov, ~70-80 million. Script unspecified so both listed"
 China	CN	"1,384,688,986"	95%	"23,210,000,000,000"	official	Chinese	zh	90%
 China	CN	"1,384,688,986"	95%	"23,210,000,000,000"		English	en	"62,900"
 China	CN	"1,384,688,986"	95%	"23,210,000,000,000"		Gan Chinese	gan	"22,900,000"
@@ -1114,7 +1115,7 @@ Russia	RU	"142,122,776"	100%	"4,016,000,000,000"	official_regional	Erzya	myv	"43
 Russia	RU	"142,122,776"	100%	"4,016,000,000,000"		Finnish	fi	"17,000"
 Russia	RU	"142,122,776"	100%	"4,016,000,000,000"		Ingrian	izh	120
 Russia	RU	"142,122,776"	100%	"4,016,000,000,000"	official_regional	Ingush	inh	"230,000"
-Russia	RU	"142,122,776"	100%	"4,016,000,000,000"		Kara-Kalpak	kaa	0.0006%       https://joshuaproject.net/languages/kaa       
+Russia	RU	"142,122,776"	100%	"4,016,000,000,000"		Kara-Kalpak	kaa	0.0006%			https://joshuaproject.net/languages/kaa
 Russia	RU	"142,122,776"	100%	"4,016,000,000,000"	official_regional	Kabardian	kbd	"440,000"
 Russia	RU	"142,122,776"	100%	"4,016,000,000,000"	official_regional	Karachay-Balkar	krc	"235,000"
 Russia	RU	"142,122,776"	100%	"4,016,000,000,000"		Karelian	krl	"117,000"
@@ -1469,7 +1470,7 @@ Unknown Region	ZZ	0	0%	0		Novial	nov	0	99%		An artificial language.  See http://
 Unknown Region	ZZ	0	0%	0		Toki Pona 	tok	800			https://en.wikipedia.org/wiki/Toki_Pona
 Unknown Region	ZZ	0	0%	0		Volapük	vo	200	99%		"http://en.wikipedia.org/wiki/Volap%C3%BCk Artificial: 'There are an estimated 20-30 Volapük speakers in the world today.'; see also http://www.villagevoice.com/arts/0031,lafarge,16942,12.html"
 Uruguay	UY	"3,369,299"	98%	"78,160,000,000"	official	Spanish	es	88%
-Uzbekistan	UZ	"36,799,756"	99%	"223,000,000,000"		Kara-Kalpak	kaa	2.1%       https://joshuaproject.net/languages/kaa
+Uzbekistan	UZ	"36,799,756"	99%	"223,000,000,000"		Kara-Kalpak	kaa	2.1%			https://joshuaproject.net/languages/kaa
 Uzbekistan	UZ	"36,799,756"	99%	"223,000,000,000"		Russian	ru	14%
 Uzbekistan	UZ	"36,799,756"	99%	"223,000,000,000"		Turkish	tr	"228,000"
 Uzbekistan	UZ	"36,799,756"	99%	"223,000,000,000"	official	Uzbek	uz	85%			"http://en.wikipedia.org/wiki/Uzbek_language#Writing_systems https://www.cia.gov/library/publications/the-world-factbook/geos/uz.html Latin/Cyrillic balance is estimated, based on literacy; younger education now in Latin"

diff --git a/...r-code/src/main/resources/org/unicode/cldr/util/data/external/alternate_country_names.txt b/...r-code/src/main/resources/org/unicode/cldr/util/data/external/alternate_country_names.txt
@@ -142,13 +142,17 @@ SY; Syria; Syrian Arab Republic
 SZ; Eswatini; eSwatini; Swaziland
 SZ; Eswatini;	Swaziland
 
+SH; Saint Helena;	Saint Helena
+SH; Saint Helena;	St. Helena
 SH; Saint Helena; Saint Helena, Ascension, and Tristan da Cunha
+SH; Saint Helena; Saint Helena, Ascension and Tristan da Cunha
 SH; Saint Helena; Saint Helena ex. dep.
 
 TL; East Timor; Timor-Leste
 TL; East Timor; East Timor
 
 TR; Turkey; Turkiye
+TR; Turkey; Turkey (Turkiye)
 TR; ; Turkey
 
 
@@ -198,11 +202,11 @@ RE; ;	Reunion
 PS; ;	Palestinian Territory
 CD; ;	Congo, Democratic Republic
 FX; ;	France, Metropolitan
-SH; ;	St. Helena
 SJ; ;	Svalbard and Jan Mayen Islands
 VA; ;	Vatican
 CW; ;	Netherlands Antilles
 WF; ;	Wallis and Futuna Islands
+WF; ;	Wallis and Futuna
 HM; ;	Heard and McDonald Islands
 PM; ;	St. Pierre and Miquelon
 
@@ -220,34 +224,9 @@ UK;; U.K.
 RS;; Yugoslavia
 KM;; Comros
 
-skip;	skip;	Arab World
-skip;	skip;	Caribbean small states
-skip;	skip;	Country Name
-skip;	skip;	East Asia & Pacific (all income levels)
-skip;	skip;	East Asia & Pacific (developing only)
-skip;	skip;	Euro area
-skip;	skip;	Europe & Central Asia (all income levels)
-skip;	skip;	Europe & Central Asia (developing only)
-skip;	skip;	Heavily indebted poor countries (HIPC)
-skip;	skip;	High income
-skip;	skip;	High income: nonOECD
-skip;	skip;	High income: OECD
-skip;	skip;	Latin America & Caribbean (all income levels)
-skip;	skip;	Latin America & Caribbean (developing only)
-skip;	skip;	Least developed countries: UN classification
-skip;	skip;	Low & middle income
-skip;	skip;	Low income
-skip;	skip;	Lower middle income
-skip;	skip;	Middle East & North Africa (all income levels)
-skip;	skip;	Middle East & North Africa (developing only)
-skip;	skip;	Middle income
-skip;	skip;	OECD members
-skip;	skip;	Other small states
-skip;	skip;	Pacific island small states
-skip;	skip;	Small states
-skip;	skip;	South Asia
-skip;	skip;	Sub-Saharan Africa (all income levels)
-skip;	skip;	Sub-Saharan Africa (developing only)
-skip;	skip;	Sudan (pre-secession)
-skip;	skip;	Upper middle income
-skip;	skip;	Paracel Islands
+419;	Latin America & Caribbean;	Latin America & Caribbean
+419;	Latin America & Caribbean;	Latin America & the Caribbean
+
+# Many of the skipped values below are aggregates from world_bank_data that we can ignore since they don't correspond to UN country groups
+
+skip;	skip;	Paracel Islands