Pipeline

This pipeline is part of a DSSG Berlin e.V. volunteering project that ran from 2022 to 2023 to support an aid organisation uncovering the different services sub-organisations offer to help homeless people. This public repository is for demonstration purpose and does not include all project outputs.

Pipeline

Thanks for the Contributions

Contribute to add your name to the list of contributors.

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Create and open a codespace with 8 GB RAM on the repository main branch

This will open a new tab in your browser

Go to https://github.com/codespaces

Change the codespace machine type to a machine with a memory of 8 GB RAM

For the change of the machine type to become active stop the codespace ...

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#stopping-a-codespace

... and restart it

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#restarting-a-codespace

Open the tab of codespace inside your browser and enter the following commands into the terminal of your codespace

Run the Pipeline

Install dependencies

pip install -r pipeline/tfidf-fasttext-pipe-codespace/pipeline_requirements.txt

Download fasttext vector model provided by deepset.ai

wget https://s3.eu-central-1.amazonaws.com/int-emb-fasttext-de-wiki/20180917/model.bin

Unzip the keyword vectorizer

unzip data-assets/vectorizer.zip -d data-assets

Create and save keywords for each document using TF-IDF, this will also download the sample anonymized dataset and the sklearn TF-IDF vectorizer

python -W ignore pipeline/tfidf-fasttext-pipe-codespace/02_extract_keywords.py

Run text search with predefined terms and cosine similarity cutoff

python pipeline/tfidf-fasttext-pipe-codespace/03_search_documents_for_topic.py

Licenses

FastText under Creative Commons Attribution-Share-Alike License 3.0, as described in P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information and supplied by https://www.deepset.ai/german-word-embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data-assets		data-assets
pipeline/tfidf-fasttext-pipe-codespace		pipeline/tfidf-fasttext-pipe-codespace
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline

Thanks for the Contributions

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Run the Pipeline

Licenses

About

Releases

Packages

Languages

License

dssg-berlin/text-search-pipeline

Folders and files

Latest commit

History

Repository files navigation

Pipeline

Thanks for the Contributions

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Run the Pipeline

Licenses

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages