Skip to content

Commit

Permalink
Adopt uax29 segmenter
Browse files Browse the repository at this point in the history
Replacing blevesearch/segment. ~2x throughput improvement. Refactor allocations, now ~O(1).

Add tests & multilingual sample text to ensure identical behavior. Known differences from previous segmenter:
- The original segmenter splits runs of spaces into separate tokens; uax29 concatenates runs into a single token.
- The original segmenter doesn’t handle emoji skin tone modifiers, the new one does, attributable to Unicode version update.
  • Loading branch information
clipperhouse committed Nov 6, 2023
1 parent 80d9b18 commit 4bfba33
Show file tree
Hide file tree
Showing 5 changed files with 562 additions and 106 deletions.
Loading

0 comments on commit 4bfba33

Please sign in to comment.