-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm] finding IPA transcriptions outside of the Pronunciation block #470
Comments
It basically finds anything in the pronunciation section in // or []. TBF it is bizarre to be giving the pronunciation of an unrelated Russian word here. I'm going to edit the entry. The Wiktionary people have taken absolutely zero interest in our project so I don't think there's a demand outside of WikiPron developers for this information. |
But then this is a glitch though because the Russian word was not under the pronunciation section but under a separate heading. The original example is gone now, but another example is գրաբար. The usage notes explain a pronunciation tidbit. It's in a separate section, but it's getting scraped too. |
Yes, that was a surprise to me that it did that all the same.
…On Mon, Nov 7, 2022 at 2:34 PM Hossep Dolatian ***@***.***> wrote:
It basically finds anything in the pronunciation section in // or [].
But then this is a glitch though because the Russian word was *not* under
the pronunciation section but under a separate heading. The original
example is gone now, but another example is գրաբար
<https://en.m.wiktionary.org/wiki/%D5%A3%D6%80%D5%A1%D5%A2%D5%A1%D6%80>.
The usage notes explain a pronunciation tidbit. It's in a separate section,
but it's getting scraped too.
—
Reply to this email directly, view it on GitHub
<#470 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OJOT6KGN4QW7QZN4ULWHFKVRANCNFSM6AAAAAARYXOKXI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Wikipron also found IPAs that were in the etymology section, before the pronunciation section. This word had a transcription there until I found and removed it (via the above 'fake dialect' trick). This makes me think that Wikipron is looking IPA anywhere in the entry, and not just in the pronunciation box. I'm not sure if that's an error (because the code isn't designed to go out of the pronunciation box) or a missing feature (because the code is designed to go out of the pronunciation box). |
For the word կարկանդակ, wikipron finds the correct pronunciation of [kɑɾkɑndɑk] but it also finds the IPA transcriptions of other words in the Usage Notes section like [pɛrɑʃˈki]. I'm not sure if this is an unavoidable glitch from Wikipron's side, or if it's a glitch that could be fixed from the Wiktionary side.
It seems that what's going on is that WikiPron is just finding any IPA transcription that's inside the Armenian entry, even if it's not associated with a dialect. E.g., if you run
wikipron arm --dialect='ladygaga' --no-skip-parens --narrow > randos.tsv
you get a handful of IPA transcriptions that aren't associated with the pre-defined dialects. These are either a) IPA transcriptions in the Usage notes or etymology, or b) IPA transcriptions for non-standard dialects. This isn't a problem for using Wikipron on a specific language (because the person can just filter those out manually). But I wonder if thisglitch
causes any other funny business for the other languages.Side note: I wonder if there's been enough situations where people had to fix Wiktionary entries in order to optimize Wikipron's scraper (like on the various closed issues). If so, perhaps a
tips and tricks
page would be helpful down the line?The text was updated successfully, but these errors were encountered: