Compile from the "キー" instead of the "語彙素" #2

epistularum · 2022-08-22T23:33:45Z

The 語彙素 reading and writing is generated from the キー and isn't accurate.

Simple example, the writing "川蝉" doesn't appear a single time in any of the source material (AKA the キー) yet it ranks higher that 翡翠 that appears more than a hundred time in the キー. This happens because the 語彙素 is merely generated from the キー for easier grouping, I suppose.

Generally, when only the writing is present in the キー it will extrapolate only the writing and store both in the 語彙素. When only the reading is present in the キー then it will extrapolate only the writing. But sometimes oddities like this happen, adding to the fact that the 語彙素 is widely inaccurate:

(the キー and the corresponding sentence are the first two columns)

toasted-nutbread · 2022-09-11T03:06:16Z

It's been a while since I implemented this and I don't remember all the details of what was in the data. What change are you proposing is made to the Yomichan dictionary data that this script generates, or how it's used in Yomichan?

epistularum · 2022-09-11T12:15:56Z

I'm proposing that the generated dictionary use the original data (キー) instead of the interpreted lexeme (語彙素).

The way bccwj works is that they analyze large amounts of texts, break up the material with something like mecab, call each broken fragment a "キー" and then associate all キー with a 語彙素.
That way, words that are written in different ways get grouped into the same 語彙素. (for instance, the キー後ろ盾 and 後ろ楯 get grouped into the 語彙素後ろ盾)

From what I understand, this dictionary is compiled using the 語彙素 instead of the キー which makes very little sense for a frequency list since it reflects the interpreted frequency and not the actual source frequency.

toasted-nutbread · 2022-09-11T14:27:10Z

Is that information present in the .tsv files found on https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html?

epistularum · 2022-09-11T16:11:30Z

I've been using 中納言 and assumed that it was using the same data as the one published. It looks like you have to pay for the full version of the corpus to have access to the キー.
https://clrd.ninjal.ac.jp/bccwj/en/dvd-index.html
It's barely visible but you can see both the lemma and the origText which I believe is what 中納言 would call the 語彙素 and the キー
https://clrd.ninjal.ac.jp/bccwj/en/images/dvd-index/sashie1.png

It's a cool 200,000 yen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile from the "キー" instead of the "語彙素" #2

Compile from the "キー" instead of the "語彙素" #2

epistularum commented Aug 22, 2022

toasted-nutbread commented Sep 11, 2022

epistularum commented Sep 11, 2022 •

edited

Loading

toasted-nutbread commented Sep 11, 2022

epistularum commented Sep 11, 2022 •

edited

Loading

Compile from the "キー" instead of the "語彙素" #2

Compile from the "キー" instead of the "語彙素" #2

Comments

epistularum commented Aug 22, 2022

toasted-nutbread commented Sep 11, 2022

epistularum commented Sep 11, 2022 • edited Loading

toasted-nutbread commented Sep 11, 2022

epistularum commented Sep 11, 2022 • edited Loading

epistularum commented Sep 11, 2022 •

edited

Loading

epistularum commented Sep 11, 2022 •

edited

Loading