Finishes scrape, adds restart command #340

lfashby · 2021-01-29T17:29:46Z

Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

This adds Cantonese, Russian, and Chinese data. Also adds Min Nan data, which necessitated modifying _skip_word() in wikipron/scrape.py and updating test_wikipron/test_scrape.py slightly.

Finally this adds restart functionality to the wikipron module, which was used to scrape the data for the aforementioned languages. The addition of this restart functionality required some simplification of src/scrape.py, and modifications to config.py and wikipron/scrape.py.
This restart functionality assumes that we only ever experience timeouts/connection errors when trying to connect to a new headword (the get request in _scrape_once()) and therefore it grabs the sortkey associated with each headword before the get request, and then restarts the scrape from that sortkey should the get request fail. In this way, it never prints duplicate entries.

Based on my testing, this assumption of where we experience timeouts appears to be correct. I never experienced any connection loss in the middle of scraping a headword (say, while scraping the 2nd of 4 phonetic transcriptions for a headword) or from our get request that connects to the Wiktionary API. I suppose that doesn't mean either of those are impossible (though I think the former might be), but should we find evidence of them we can revise the restart functionality.

Closes #253
Closes #57

lfashby · 2021-01-29T17:43:13Z

Hi @jacksonllee, if you have some free time your feedback on the changes I've made toconfig.py and wikipron/scrape.py in order to implement the restart functionality would be most appreciated.

kylebgorman

A few minor comments but other than that LGTM, this is great.

Once this is in are we ready to release?

kylebgorman · 2021-01-29T18:20:49Z

wikipron/scrape.py

@@ -1,11 +1,13 @@
 import re
 import unicodedata
-from typing import cast
+from typing import cast, Dict, Any


Very minor note but could you sort these lexicographically?

kylebgorman · 2021-01-29T18:22:05Z

wikipron/scrape.py


 import pkg_resources
 import requests
 import requests_html

+import time


This is a built-in, not 3rd party, module so it should be on line 2.

Also there should be a blank line before the typing imports (but that's a "drive-by" issue that you didn't cause...it was already like that).

wikipron/scrape.py

kylebgorman · 2021-01-29T18:24:14Z

Could you add "Closes #253" to the description here, Lucas?

kylebgorman · 2021-01-29T18:24:50Z

I would love to hear from Jackson too before we do this.

lfashby · 2021-01-29T19:27:47Z

I think I've addressed your comments suitably. What's the status of #259 given these changes? Should we leave it around to remind us to rework the dash stuff or file a separate issue?

I'd say we are good for a release after this. I'm also in favor of waiting for Jackson before merging this.

kylebgorman · 2021-01-29T20:02:40Z

I would leave #259 open. We should just add that as a config param. Sounds good, let's see if Jackson can take a look this weekend before we submit. But then I'll do a release as soon as this is mergable.

…

jacksonllee · 2021-01-29T20:12:52Z

I'm tied up at work at the moment, but can take a look tonight.

On making a new release, Kyle, you may want to take over #235 (I think most if not all items listed there are done, but someone needs to check), for the version bump etc.

kylebgorman · 2021-01-29T22:06:47Z

Will do re: #235.

…

On Fri, Jan 29, 2021 at 3:13 PM Jackson L. Lee ***@***.***> wrote: I'm tied up at work at the moment, but can take a look tonight. On making a new release, Kyle, you may want to take over #235 <https://github.com/kylebgorman/wikipron/issues/235> (I *think* most if not all items listed there are done, but someone needs to check), for the version bump etc. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/pull/340#issuecomment-770025198>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OKZXXGRD32QOG7JPWDS4MJFJANCNFSM4WZGXVIA> .

kylebgorman · 2021-01-30T00:55:22Z

@jacksonllee when I do a release should this be 1.2.0? I don't really feel comfortable making semantic versioning decisions like this...

jacksonllee

The updated restart code is great! For the scraped data, I would recommend dropping Min Nan (as well as the corresponding changes in _skip_word) in this pull request -- details in my comments.

wikipron/scrape.py

data/src/scrape.py

jacksonllee · 2021-01-30T07:11:31Z

data/README.md

@@ -41,11 +41,12 @@
 | [TSV](tsv/bul_phonetic.tsv) | bul | Bulgarian | Bulgarian | True | Phonetic | 6,377 |
 | [TSV](tsv/bur_phonemic.tsv) | bur | Burmese | Burmese | False | Phonemic | 4,636 |
 | [TSV](tsv/bur_phonemic_filtered.tsv) | bur | Burmese | Burmese | False | Phonemic_filtered | 4,631 |
+| [TSV](tsv/yue_phonemic.tsv) | yue | Yue Chinese | Cantonese | False | Phonemic | 87,961 |


I'm pleasantly surprised to see Cantonese data is finally in. Does this mean "Closes #57" should also be included in the pull request description?

I took a look at the scraped data, and (as a native speaker myself) I can confirm it's what I'd expect, with Chinese characters on the orthographic side. The data was scraped from the Cantonese custom extraction code checked in at #277, i.e., the data actually came from the entries under the Category:Chinese_terms_with_IPA_pronunciation pages (not the Category:Cantonese_terms_with_IPA_pronunciation pages, where orthography is represented by the standardized Jyutping romanization system instead of Chinese characters -- not really useful, and probably too easy for G2P!), but the extraction code pointed to the embedded Cantonese pronunciation instead. I'm bringing all this up because this is in contrast with Min Nan below...

jacksonllee · 2021-01-30T07:59:01Z

data/README.md

@@ -193,6 +194,7 @@
 | [TSV](tsv/okm_phonemic.tsv) | okm | Middle Korean (10th-16th cent.) | Middle Korean | False | Phonemic | 334 |
 | [TSV](tsv/gml_phonemic.tsv) | gml | Middle Low German | Middle Low German | True | Phonemic | 170 |
 | [TSV](tsv/wlm_phonemic.tsv) | wlm | Middle Welsh | Middle Welsh | True | Phonemic | 144 |
+| [TSV](tsv/nan_phonetic.tsv) | nan | Min Nan Chinese | Min Nan | True | Phonetic | 431 |


I must have missed #259 coming through. Min Nan is another Chinese language with the same issue as Cantonese that I've described in the previous comment above. There are entries for Category:Min_Nan_terms_with_IPA_pronunciation, but they are orthographically in a standardized romanization system, probably not what we'd want for G2P purposes. I imagine the treatment for Min Nan would be similar to that for Cantonese, by defining custom extraction code for Min Nan.

There's a complication. Min Nan, even when scraped in the way proposed in the previous paragraph, appears to have a similar sublist issue as #329 (and possibly sub-sublists). Here's a screenshot for Min Nan pronunciations, under the "Chinese" section, of the Wiktionary entry 一 "one" (click "more" to show IPA pronunciations):

I guess the TL;DR is... we might want to punt on Min Nan for this round.

Good point. Bit of an oversight on my part! I guess it is back to the drawing board on Min Nan and it looks like it'll be quite a challenge.

If we have to scrape through the Chinese category twice (phonetic/phonemic) for each of those Min Nan subdialects we'd probably add another 3-4 days to the big scrape runtime.

wikipron/scrape.py

jacksonllee · 2021-01-30T08:11:33Z

when I do a release should this be 1.2.0?

@kylebgorman There have been no API-breaking changes since version 1.1.0, correct? No breaking changes either for changes related to CLI flags, right? If that's the case, then version 1.2.0 is right.

kylebgorman · 2021-01-30T14:48:29Z

Thanks for clarification Jackson. I think Jackson's proposal to split out the Min Nan change is a good idea, Lucas, if it's not too much trouble.

…

On Sat, Jan 30, 2021 at 3:11 AM Jackson L. Lee ***@***.***> wrote: when I do a release should this be 1.2.0? @kylebgorman <https://github.com/kylebgorman> There have been no API-breaking changes since version 1.1.0, correct? No breaking changes either for changes related to CLI flags, right? If that's the case, then version 1.2.0 is right. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/pull/340#issuecomment-770175790>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OLRHCYO7ZOEVRPDBJ3S4O5MDANCNFSM4WZGXVIA> .

lfashby · 2021-01-30T16:00:57Z

I removed all the Min Nan changes and updated the description to at last close the Cantonese support ticket.

jacksonllee

LGTM

kylebgorman

LGTM.

wikipron/scrape.py

lfashby added 15 commits January 17, 2021 17:46

hacked together restart command

c9ec3f8

fixes to restart command

794c582

Merge branch 'master' into restart

b38c43d

potentially finalized restart command

9339f9f

cleanup

04d7525

Merge branch 'master' into restart

11d5f58

logging for restart

ae11286

raw 'rus', 'yue' and 'cmn' data

8565a43

updates and rescrapes Min Nan

094a52d

final restart changes

517fadc

postprocessing on final data

2cc5c48

Merge branch 'master' into restart

e489e91

brings branch up to date, regenerates summary

5c25214

formatting fixes

9c9969a

updates CHANGELOG

a9240c2

lfashby requested a review from kylebgorman January 29, 2021 17:43

kylebgorman reviewed Jan 29, 2021

View reviewed changes

fixes to scrape.py

bb3071f

jacksonllee reviewed Jan 30, 2021

View reviewed changes

jacksonllee mentioned this pull request Jan 30, 2021

[nan] Support Min Nan #259

Open

lfashby added 2 commits January 30, 2021 10:40

excise nan changes

c95b115

removes nan changes from CHANGELOG and languages.json

7db87eb

jacksonllee approved these changes Jan 30, 2021

View reviewed changes

kylebgorman approved these changes Jan 30, 2021

View reviewed changes

wikipron/scrape.py Outdated Show resolved Hide resolved

kylebgorman merged commit d70eee5 into CUNY-CL:master Jan 30, 2021

lfashby deleted the restart branch January 30, 2021 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finishes scrape, adds restart command #340

Finishes scrape, adds restart command #340

lfashby commented Jan 29, 2021 •

edited

Loading

lfashby commented Jan 29, 2021

kylebgorman left a comment

kylebgorman Jan 29, 2021

kylebgorman Jan 29, 2021

kylebgorman commented Jan 29, 2021

kylebgorman commented Jan 29, 2021

lfashby commented Jan 29, 2021

kylebgorman commented Jan 29, 2021 via email

jacksonllee commented Jan 29, 2021

kylebgorman commented Jan 29, 2021 via email

kylebgorman commented Jan 30, 2021

jacksonllee left a comment

jacksonllee Jan 30, 2021

jacksonllee Jan 30, 2021

lfashby Jan 30, 2021

jacksonllee commented Jan 30, 2021

kylebgorman commented Jan 30, 2021 via email

lfashby commented Jan 30, 2021

jacksonllee left a comment

kylebgorman left a comment

Finishes scrape, adds restart command #340

Finishes scrape, adds restart command #340

Conversation

lfashby commented Jan 29, 2021 • edited Loading

lfashby commented Jan 29, 2021

kylebgorman left a comment

Choose a reason for hiding this comment

kylebgorman Jan 29, 2021

Choose a reason for hiding this comment

kylebgorman Jan 29, 2021

Choose a reason for hiding this comment

kylebgorman commented Jan 29, 2021

kylebgorman commented Jan 29, 2021

lfashby commented Jan 29, 2021

kylebgorman commented Jan 29, 2021 via email

jacksonllee commented Jan 29, 2021

kylebgorman commented Jan 29, 2021 via email

kylebgorman commented Jan 30, 2021

jacksonllee left a comment

Choose a reason for hiding this comment

jacksonllee Jan 30, 2021

Choose a reason for hiding this comment

jacksonllee Jan 30, 2021

Choose a reason for hiding this comment

lfashby Jan 30, 2021

Choose a reason for hiding this comment

jacksonllee commented Jan 30, 2021

kylebgorman commented Jan 30, 2021 via email

lfashby commented Jan 30, 2021

jacksonllee left a comment

Choose a reason for hiding this comment

kylebgorman left a comment

Choose a reason for hiding this comment

lfashby commented Jan 29, 2021 •

edited

Loading