Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finishes scrape, adds restart command #340

Merged
merged 18 commits into from
Jan 30, 2021
Merged

Conversation

lfashby
Copy link
Collaborator

@lfashby lfashby commented Jan 29, 2021

  • Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

This adds Cantonese, Russian, and Chinese data. Also adds Min Nan data, which necessitated modifying _skip_word() in wikipron/scrape.py and updating test_wikipron/test_scrape.py slightly.

Finally this adds restart functionality to the wikipron module, which was used to scrape the data for the aforementioned languages. The addition of this restart functionality required some simplification of src/scrape.py, and modifications to config.py and wikipron/scrape.py.
This restart functionality assumes that we only ever experience timeouts/connection errors when trying to connect to a new headword (the get request in _scrape_once()) and therefore it grabs the sortkey associated with each headword before the get request, and then restarts the scrape from that sortkey should the get request fail. In this way, it never prints duplicate entries.

Based on my testing, this assumption of where we experience timeouts appears to be correct. I never experienced any connection loss in the middle of scraping a headword (say, while scraping the 2nd of 4 phonetic transcriptions for a headword) or from our get request that connects to the Wiktionary API. I suppose that doesn't mean either of those are impossible (though I think the former might be), but should we find evidence of them we can revise the restart functionality.

Closes #253
Closes #57

@lfashby lfashby requested a review from kylebgorman January 29, 2021 17:43
@lfashby
Copy link
Collaborator Author

lfashby commented Jan 29, 2021

Hi @jacksonllee, if you have some free time your feedback on the changes I've made toconfig.py and wikipron/scrape.py in order to implement the restart functionality would be most appreciated.

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments but other than that LGTM, this is great.

Once this is in are we ready to release?

@@ -1,11 +1,13 @@
import re
import unicodedata
from typing import cast
from typing import cast, Dict, Any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor note but could you sort these lexicographically?


import pkg_resources
import requests
import requests_html

import time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a built-in, not 3rd party, module so it should be on line 2.

Also there should be a blank line before the typing imports (but that's a "drive-by" issue that you didn't cause...it was already like that).

wikipron/scrape.py Outdated Show resolved Hide resolved
wikipron/scrape.py Show resolved Hide resolved
@kylebgorman
Copy link
Collaborator

Could you add "Closes #253" to the description here, Lucas?

@kylebgorman
Copy link
Collaborator

I would love to hear from Jackson too before we do this.

@lfashby
Copy link
Collaborator Author

lfashby commented Jan 29, 2021

I think I've addressed your comments suitably. What's the status of #259 given these changes? Should we leave it around to remind us to rework the dash stuff or file a separate issue?

I'd say we are good for a release after this. I'm also in favor of waiting for Jackson before merging this.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Jan 29, 2021 via email

@jacksonllee
Copy link
Collaborator

I'm tied up at work at the moment, but can take a look tonight.

On making a new release, Kyle, you may want to take over #235 (I think most if not all items listed there are done, but someone needs to check), for the version bump etc.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Jan 29, 2021 via email

@kylebgorman
Copy link
Collaborator

@jacksonllee when I do a release should this be 1.2.0? I don't really feel comfortable making semantic versioning decisions like this...

Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated restart code is great! For the scraped data, I would recommend dropping Min Nan (as well as the corresponding changes in _skip_word) in this pull request -- details in my comments.

wikipron/scrape.py Show resolved Hide resolved
data/src/scrape.py Show resolved Hide resolved
@@ -41,11 +41,12 @@
| [TSV](tsv/bul_phonetic.tsv) | bul | Bulgarian | Bulgarian | True | Phonetic | 6,377 |
| [TSV](tsv/bur_phonemic.tsv) | bur | Burmese | Burmese | False | Phonemic | 4,636 |
| [TSV](tsv/bur_phonemic_filtered.tsv) | bur | Burmese | Burmese | False | Phonemic_filtered | 4,631 |
| [TSV](tsv/yue_phonemic.tsv) | yue | Yue Chinese | Cantonese | False | Phonemic | 87,961 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pleasantly surprised to see Cantonese data is finally in. Does this mean "Closes #57" should also be included in the pull request description?

I took a look at the scraped data, and (as a native speaker myself) I can confirm it's what I'd expect, with Chinese characters on the orthographic side. The data was scraped from the Cantonese custom extraction code checked in at #277, i.e., the data actually came from the entries under the Category:Chinese_terms_with_IPA_pronunciation pages (not the Category:Cantonese_terms_with_IPA_pronunciation pages, where orthography is represented by the standardized Jyutping romanization system instead of Chinese characters -- not really useful, and probably too easy for G2P!), but the extraction code pointed to the embedded Cantonese pronunciation instead. I'm bringing all this up because this is in contrast with Min Nan below...

data/README.md Outdated
@@ -193,6 +194,7 @@
| [TSV](tsv/okm_phonemic.tsv) | okm | Middle Korean (10th-16th cent.) | Middle Korean | False | Phonemic | 334 |
| [TSV](tsv/gml_phonemic.tsv) | gml | Middle Low German | Middle Low German | True | Phonemic | 170 |
| [TSV](tsv/wlm_phonemic.tsv) | wlm | Middle Welsh | Middle Welsh | True | Phonemic | 144 |
| [TSV](tsv/nan_phonetic.tsv) | nan | Min Nan Chinese | Min Nan | True | Phonetic | 431 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed #259 coming through. Min Nan is another Chinese language with the same issue as Cantonese that I've described in the previous comment above. There are entries for Category:Min_Nan_terms_with_IPA_pronunciation, but they are orthographically in a standardized romanization system, probably not what we'd want for G2P purposes. I imagine the treatment for Min Nan would be similar to that for Cantonese, by defining custom extraction code for Min Nan.

There's a complication. Min Nan, even when scraped in the way proposed in the previous paragraph, appears to have a similar sublist issue as #329 (and possibly sub-sublists). Here's a screenshot for Min Nan pronunciations, under the "Chinese" section, of the Wiktionary entry 一 "one" (click "more" to show IPA pronunciations):

Screen Shot 2021-01-30 at 1 59 34 AM

I guess the TL;DR is... we might want to punt on Min Nan for this round.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Bit of an oversight on my part! I guess it is back to the drawing board on Min Nan and it looks like it'll be quite a challenge.

If we have to scrape through the Chinese category twice (phonetic/phonemic) for each of those Min Nan subdialects we'd probably add another 3-4 days to the big scrape runtime.

wikipron/scrape.py Outdated Show resolved Hide resolved
@jacksonllee
Copy link
Collaborator

when I do a release should this be 1.2.0?

@kylebgorman There have been no API-breaking changes since version 1.1.0, correct? No breaking changes either for changes related to CLI flags, right? If that's the case, then version 1.2.0 is right.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Jan 30, 2021 via email

@lfashby
Copy link
Collaborator Author

lfashby commented Jan 30, 2021

I removed all the Min Nan changes and updated the description to at last close the Cantonese support ticket.

Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

wikipron/scrape.py Outdated Show resolved Hide resolved
@kylebgorman kylebgorman merged commit d70eee5 into CUNY-CL:master Jan 30, 2021
@lfashby lfashby deleted the restart branch January 30, 2021 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restart command with partial extracted data [yue] Cantonese support
3 participants