Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore possibility of using a compressed fasttext wasm locally to generate embeddings instead of vector library #9

Open
Bluebie opened this issue Mar 22, 2021 · 0 comments

Comments

@Bluebie
Copy link
Owner

Bluebie commented Mar 22, 2021

this may be able to cut server disk use by around half a gigabyte and could provide for better results (every word would have an embedding, the keywords system could die, potentially less dataset patching needed for words like "DEEEEEEEAF"

fasttext.cc has info on how to build fasttext to a wasm binary, and compress fasttext down to just a few megabytes.

Upsides:

  • server storage requirements dramatically reduced
  • every word gets an embedding
  • slightly improved privacy traits
  • possibly maybe could remove embeddings from search index data, generate them locally on index parse. index could be as simple as a result index number and a list of keywords and hashtags
  • By ditching vector-library, server disk data size becomes mainly defined by locally hosted videos in video cache, potentially makes it possible to run find-sign off services like free tier glitch for smaller indexes, or for larger indexes if hotlinking videos or rehosting them elsewhere (some cloud video provider maybe?)

Downsides

  • initial load to interactive goes from < 1mb to a few megabytes to load the fasttext model
  • devices which do not support wasm maybe an issue (defer to server execution? improve the old cgi-bin fallback?)
  • unknown impact on older devices? will ancient android phones cope with running fasttext locally?
  • even ditching embedding cache in search index, likely to still be larger and more resource intensive to load and run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant