This pipeline is part of a DSSG Berlin e.V. volunteering project that ran from 2022 to 2023 to support an aid organisation uncovering the different services sub-organisations offer to help homeless people. This public repository is for demonstration purpose and does not include all project outputs.
Contribute to add your name to the list of contributors.
Create and open a codespace with 8 GB RAM on the repository main branch
This will open a new tab in your browser
Go to https://github.com/codespaces
Change the codespace machine type to a machine with a memory of 8 GB RAM
For the change of the machine type to become active stop the codespace ...
... and restart it
Open the tab of codespace inside your browser and enter the following commands into the terminal of your codespace
Install dependencies
pip install -r pipeline/tfidf-fasttext-pipe-codespace/pipeline_requirements.txt
Download fasttext vector model provided by deepset.ai
wget https://s3.eu-central-1.amazonaws.com/int-emb-fasttext-de-wiki/20180917/model.bin
Unzip the keyword vectorizer
unzip data-assets/vectorizer.zip -d data-assets
Create and save keywords for each document using TF-IDF, this will also download the sample anonymized dataset and the sklearn TF-IDF vectorizer
python -W ignore pipeline/tfidf-fasttext-pipe-codespace/02_extract_keywords.py
Run text search with predefined terms and cosine similarity cutoff
python pipeline/tfidf-fasttext-pipe-codespace/03_search_documents_for_topic.py
FastText under Creative Commons Attribution-Share-Alike License 3.0, as described in P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information and supplied by https://www.deepset.ai/german-word-embeddings.