Latin script: segmenter should support word segmentation #175

Kerollmops · 2023-01-17T19:11:01Z

We should be able to support splitting words by methods other than the text casing. Libraries like instant-segment exist to do that.

redneckbossryan -> redneck, boss and ryan can be extracted
massachusetsinstutitute -> massachusetts, institute

The text was updated successfully, but these errors were encountered:

ams-ryanolson · 2023-01-18T01:19:03Z

This issue came from a discord chat I was having about how search works vs what I thought would happen. So this all comes from the belief that a user should be able to search for key segments within the system. So, if we use the example above

RedneckBossRyan - red, neck, redneck, boss, Ryan

However, this goes deeper for my use hope, I would want someone to be able to search for redneckbo and get the results accordingly. So, instant-segment does make a lot of sense, but we are also at that point having to consider how many languages and other factors that come into play.

Would it maybe make sense to create a solution that would take words that are not dictionary and maybe store them in segments, so RedneckBossRyan could be stored as word segments sure, but could also be stored in 3-4 character chunks

ManyTheFish · 2023-01-18T09:22:39Z

Sorry if my question sounds a bit simple, but, why don't you add spaces when searching in Meilisearch? 🤔

ams-ryanolson · 2023-01-18T16:07:33Z

It's a valid question. The issue is usernames. People type these long winded usernames so searching becomes very "challenging"

ManyTheFish · 2023-01-31T10:05:24Z

Mmmmh, I see, unless having the possibility to upload a custom dictionary, your use case can't be solved completely.
However, solving this issue (#129) could enhance your search experience. 🤔

If a contributor solves it, you could expect a better Meilisearch behavior in a future version. 😃

pingiun · 2023-03-09T13:14:18Z

I'm investigating meilisearch for my employer, whose main customers will be Dutch. For our use case word segmentation is very important. Dutch (and German too) uses a lot of compound words, so searching for a part of a word should work correctly. In my testing, meilisearch (and typesense for that matter), was unable to find documents containing "lichtgeel" and "zachtgeel" when searching for "geel". Even though specific types of yellow should come up when searching for yellow in general.

Instant-segment may not work in our case as it is meant for English text, for which corpuses exist which have the bigrams separated. In Dutch the correct word is a combination of two words.

jakob-ledermann · 2024-02-09T15:22:09Z

An other german here with the same problem.
At least in German the compound words are also hypenated at the word boundary and iirc that word boundary is to be prefered.
So maybe https://github.com/typst/hypher/tree/main/patterns can serve as a source for these boundaries.

aersam · 2024-02-09T19:09:52Z

In the source of that repo, another repo is linked which seems to have a quite good database af all german compound words: http://repo.or.cz/w/wortliste.git / https://repo.or.cz/wortliste.git/blob/HEAD:/wortliste

In general I think a database of words would be the way to go (at least for my use cases)

ManyTheFish changed the title ~~Latin segmenter should support word segmentation~~ Latin script: segmenter should support word segmentation Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latin script: segmenter should support word segmentation #175

Latin script: segmenter should support word segmentation #175

Kerollmops commented Jan 17, 2023

ams-ryanolson commented Jan 18, 2023

ManyTheFish commented Jan 18, 2023

ams-ryanolson commented Jan 18, 2023

ManyTheFish commented Jan 31, 2023

pingiun commented Mar 9, 2023

jakob-ledermann commented Feb 9, 2024

aersam commented Feb 9, 2024 •

edited

Loading

Latin script: segmenter should support word segmentation #175

Latin script: segmenter should support word segmentation #175

Comments

Kerollmops commented Jan 17, 2023

ams-ryanolson commented Jan 18, 2023

ManyTheFish commented Jan 18, 2023

ams-ryanolson commented Jan 18, 2023

ManyTheFish commented Jan 31, 2023

pingiun commented Mar 9, 2023

jakob-ledermann commented Feb 9, 2024

aersam commented Feb 9, 2024 • edited Loading

aersam commented Feb 9, 2024 •

edited

Loading