Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile from the "キー" instead of the "語彙素" #2

Open
epistularum opened this issue Aug 22, 2022 · 4 comments
Open

Compile from the "キー" instead of the "語彙素" #2

epistularum opened this issue Aug 22, 2022 · 4 comments

Comments

@epistularum
Copy link

The 語彙素 reading and writing is generated from the キー and isn't accurate.

Simple example, the writing "川蝉" doesn't appear a single time in any of the source material (AKA the キー) yet it ranks higher that 翡翠 that appears more than a hundred time in the キー. This happens because the 語彙素 is merely generated from the キー for easier grouping, I suppose.
image
image

Generally, when only the writing is present in the キー it will extrapolate only the writing and store both in the 語彙素. When only the reading is present in the キー then it will extrapolate only the writing. But sometimes oddities like this happen, adding to the fact that the 語彙素 is widely inaccurate:
image
(the キー and the corresponding sentence are the first two columns)

@toasted-nutbread
Copy link
Owner

It's been a while since I implemented this and I don't remember all the details of what was in the data. What change are you proposing is made to the Yomichan dictionary data that this script generates, or how it's used in Yomichan?

@epistularum
Copy link
Author

epistularum commented Sep 11, 2022

I'm proposing that the generated dictionary use the original data (キー) instead of the interpreted lexeme (語彙素).

The way bccwj works is that they analyze large amounts of texts, break up the material with something like mecab, call each broken fragment a "キー" and then associate all キー with a 語彙素.
That way, words that are written in different ways get grouped into the same 語彙素. (for instance, the キー 後ろ盾 and 後ろ楯 get grouped into the 語彙素 後ろ盾)

From what I understand, this dictionary is compiled using the 語彙素 instead of the キー which makes very little sense for a frequency list since it reflects the interpreted frequency and not the actual source frequency.

@toasted-nutbread
Copy link
Owner

Is that information present in the .tsv files found on https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html?

@epistularum
Copy link
Author

epistularum commented Sep 11, 2022

I've been using 中納言 and assumed that it was using the same data as the one published. It looks like you have to pay for the full version of the corpus to have access to the キー.
https://clrd.ninjal.ac.jp/bccwj/en/dvd-index.html
It's barely visible but you can see both the lemma and the origText which I believe is what 中納言 would call the 語彙素 and the キー
https://clrd.ninjal.ac.jp/bccwj/en/images/dvd-index/sashie1.png

It's a cool 200,000 yen
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants