-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finishes scrape, adds restart command #340
Conversation
Hi @jacksonllee, if you have some free time your feedback on the changes I've made to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments but other than that LGTM, this is great.
Once this is in are we ready to release?
wikipron/scrape.py
Outdated
@@ -1,11 +1,13 @@ | |||
import re | |||
import unicodedata | |||
from typing import cast | |||
from typing import cast, Dict, Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor note but could you sort these lexicographically?
wikipron/scrape.py
Outdated
|
||
import pkg_resources | ||
import requests | ||
import requests_html | ||
|
||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a built-in, not 3rd party, module so it should be on line 2.
Also there should be a blank line before the typing imports (but that's a "drive-by" issue that you didn't cause...it was already like that).
Could you add "Closes #253" to the description here, Lucas? |
I would love to hear from Jackson too before we do this. |
I think I've addressed your comments suitably. What's the status of #259 given these changes? Should we leave it around to remind us to rework the dash stuff or file a separate issue? I'd say we are good for a release after this. I'm also in favor of waiting for Jackson before merging this. |
I'm tied up at work at the moment, but can take a look tonight. On making a new release, Kyle, you may want to take over #235 (I think most if not all items listed there are done, but someone needs to check), for the version bump etc. |
Will do re: #235.
…On Fri, Jan 29, 2021 at 3:13 PM Jackson L. Lee ***@***.***> wrote:
I'm tied up at work at the moment, but can take a look tonight.
On making a new release, Kyle, you may want to take over #235
<https://github.com/kylebgorman/wikipron/issues/235> (I *think* most if
not all items listed there are done, but someone needs to check), for the
version bump etc.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/pull/340#issuecomment-770025198>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OKZXXGRD32QOG7JPWDS4MJFJANCNFSM4WZGXVIA>
.
|
@jacksonllee when I do a release should this be 1.2.0? I don't really feel comfortable making semantic versioning decisions like this... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated restart code is great! For the scraped data, I would recommend dropping Min Nan (as well as the corresponding changes in _skip_word
) in this pull request -- details in my comments.
@@ -41,11 +41,12 @@ | |||
| [TSV](tsv/bul_phonetic.tsv) | bul | Bulgarian | Bulgarian | True | Phonetic | 6,377 | | |||
| [TSV](tsv/bur_phonemic.tsv) | bur | Burmese | Burmese | False | Phonemic | 4,636 | | |||
| [TSV](tsv/bur_phonemic_filtered.tsv) | bur | Burmese | Burmese | False | Phonemic_filtered | 4,631 | | |||
| [TSV](tsv/yue_phonemic.tsv) | yue | Yue Chinese | Cantonese | False | Phonemic | 87,961 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pleasantly surprised to see Cantonese data is finally in. Does this mean "Closes #57" should also be included in the pull request description?
I took a look at the scraped data, and (as a native speaker myself) I can confirm it's what I'd expect, with Chinese characters on the orthographic side. The data was scraped from the Cantonese custom extraction code checked in at #277, i.e., the data actually came from the entries under the Category:Chinese_terms_with_IPA_pronunciation pages (not the Category:Cantonese_terms_with_IPA_pronunciation pages, where orthography is represented by the standardized Jyutping romanization system instead of Chinese characters -- not really useful, and probably too easy for G2P!), but the extraction code pointed to the embedded Cantonese pronunciation instead. I'm bringing all this up because this is in contrast with Min Nan below...
data/README.md
Outdated
@@ -193,6 +194,7 @@ | |||
| [TSV](tsv/okm_phonemic.tsv) | okm | Middle Korean (10th-16th cent.) | Middle Korean | False | Phonemic | 334 | | |||
| [TSV](tsv/gml_phonemic.tsv) | gml | Middle Low German | Middle Low German | True | Phonemic | 170 | | |||
| [TSV](tsv/wlm_phonemic.tsv) | wlm | Middle Welsh | Middle Welsh | True | Phonemic | 144 | | |||
| [TSV](tsv/nan_phonetic.tsv) | nan | Min Nan Chinese | Min Nan | True | Phonetic | 431 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed #259 coming through. Min Nan is another Chinese language with the same issue as Cantonese that I've described in the previous comment above. There are entries for Category:Min_Nan_terms_with_IPA_pronunciation, but they are orthographically in a standardized romanization system, probably not what we'd want for G2P purposes. I imagine the treatment for Min Nan would be similar to that for Cantonese, by defining custom extraction code for Min Nan.
There's a complication. Min Nan, even when scraped in the way proposed in the previous paragraph, appears to have a similar sublist issue as #329 (and possibly sub-sublists). Here's a screenshot for Min Nan pronunciations, under the "Chinese" section, of the Wiktionary entry 一 "one" (click "more" to show IPA pronunciations):
I guess the TL;DR is... we might want to punt on Min Nan for this round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Bit of an oversight on my part! I guess it is back to the drawing board on Min Nan and it looks like it'll be quite a challenge.
If we have to scrape through the Chinese category twice (phonetic/phonemic) for each of those Min Nan subdialects we'd probably add another 3-4 days to the big scrape runtime.
@kylebgorman There have been no API-breaking changes since version 1.1.0, correct? No breaking changes either for changes related to CLI flags, right? If that's the case, then version 1.2.0 is right. |
Thanks for clarification Jackson.
I think Jackson's proposal to split out the Min Nan change is a good idea,
Lucas, if it's not too much trouble.
…On Sat, Jan 30, 2021 at 3:11 AM Jackson L. Lee ***@***.***> wrote:
when I do a release should this be 1.2.0?
@kylebgorman <https://github.com/kylebgorman> There have been no
API-breaking changes since version 1.1.0, correct? No breaking changes
either for changes related to CLI flags, right? If that's the case, then
version 1.2.0 is right.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/pull/340#issuecomment-770175790>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OLRHCYO7ZOEVRPDBJ3S4O5MDANCNFSM4WZGXVIA>
.
|
I removed all the Min Nan changes and updated the description to at last close the Cantonese support ticket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Unreleased
inCHANGELOG.md
to reflect the changes in code or data.This adds Cantonese, Russian, and Chinese data. Also adds Min Nan data, which necessitated modifying
_skip_word()
inwikipron/scrape.py
and updatingtest_wikipron/test_scrape.py
slightly.Finally this adds restart functionality to the wikipron module, which was used to scrape the data for the aforementioned languages. The addition of this restart functionality required some simplification of
src/scrape.py
, and modifications toconfig.py
andwikipron/scrape.py
.This restart functionality assumes that we only ever experience timeouts/connection errors when trying to connect to a new headword (the get request in
_scrape_once()
) and therefore it grabs thesortkey
associated with each headword before the get request, and then restarts the scrape from thatsortkey
should the get request fail. In this way, it never prints duplicate entries.Based on my testing, this assumption of where we experience timeouts appears to be correct. I never experienced any connection loss in the middle of scraping a headword (say, while scraping the 2nd of 4 phonetic transcriptions for a headword) or from our get request that connects to the Wiktionary API. I suppose that doesn't mean either of those are impossible (though I think the former might be), but should we find evidence of them we can revise the restart functionality.
Closes #253
Closes #57