parla-document-processor

This repository contains scripts for pre-processing PDF files for later use in the explorational project Parla. It offers a generic way of importing / registering and processing PDF documents. For the use case of Parla, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.

Prerequisites / External Services

Running and accessible Supabase database with the schema defined in https://github.com/technologiestiftung/parla-api
OpenAI account and API key
LLamaParse account and API key

Features

Register relevant documents from various data sources, see ./src/importers. Registering documents means storing their download URL and possible metadata in the database.
Process registered documents by
1. Downloading the PDF
2. Extracting text (Markdown) content from the PDF via LLamaParse API
3. Generating a summary of the PDF content via OpenAI
4. Generating a list of tags describing the PDF content via OpenAI
5. Generating embedding vectors of each PDF page via OpenAI
Regenerate embeddings both for chunks and summaries. This is particularly useful when the used LLM (we use OpenAI) introduces a new embedding model as it happened in January 2024 (https://openai.com/blog/new-embedding-models-and-api-updates). Regenerating the embeddings is done in the run_regenerate_embeddings.ts script and performs the following steps:
- For each chunk in processed_document_chunks, generate embedding with the (new) model set in env variable OPENAI_EMBEDDING_MODEL and store in column embedding_temp.
- For each summary in processed_document_summaries, generate embedding with the (new) model set in env variable OPENAI_EMBEDDING_MODEL and store in column summary_embedding_temp.
- After doing so, the API (https://github.com/technologiestiftung/parla-api) must be changed to use the new model as well.
- The final migration must happen simultaneously with the API changes by renaming the columns:
```
ALTER TABLE processed_document_chunks rename column embedding to embedding_old;
ALTER TABLE processed_document_chunks rename column embedding_temp to embedding;
ALTER TABLE processed_document_chunks rename column embedding_old to embedding_temp;
```
  and
```
ALTER TABLE processed_document_summaries rename column summary_embedding to summary_embedding_old;
ALTER TABLE processed_document_summaries rename column summary_embedding_temp to summary_embedding;
ALTER TABLE processed_document_summaries rename column summary_embedding_old to summary_embedding_temp;
```
- After swapping the columns, the indices must be regenerated, see section [Periodically regenerate indices]

Limitations

Only PDF documents are supported
The download URL of the documents must be publicly accessible

Environment variables

See .env.sample

# Supabase Configuration
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=

#OpenAI Configuration
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=

# Directory for processing temporary files
PROCESSING_DIR=

ALLOW_DELETION=false

# Max limit for the number of pages to process (with fallback strategy)
MAX_PAGES_LIMIT=5000

# Limit for the number of pages to process with LLamaParse
MAX_PAGES_FOR_LLM_PARSE_LIMIT=128

# LLamaParse Token (get via LlamaParse Cloud)
LLAMA_PARSE_TOKEN=

# Max number of documents to process in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_PROCESS_IN_ONE_RUN=

# Max number of documents to import in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_IMPORT_PER_DOCUMENT_TYPE=

Run locally

⚠️ Warning: Running those scripts on many PDF documents will result in significant costs. ⚠️

Setup .env file based on .env.sample
Run npm ci to install dependencies
Run npx tsx ./src/run_import.ts to register the documents
Run npx tsx ./src/run_process.ts to process all unprocessed documents

Periodically regenerate indices

The indices on the processed_document_chunks and processed_document_summaries tables need be regenerated upon arrival of new data. This is because the lists parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.

select cron.schedule (
    'regenerate_embedding_indices_for_chunks',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);

select cron.schedule (
    'regenerate_embedding_indices_for_summaries',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);

Related repositories

API and database definition: https://github.com/technologiestiftung/parla-api
Parla frontend: https://github.com/technologiestiftung/parla-frontend

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Fabian Morón Zirfas}
💻 🤔

_{Jonas Jaszkowic}
💻 🤔 🚇

This project follows the all-contributors specification. Contributions of any kind welcome!

Credits

Made by

A project by

Supported by

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github/workflows		.github/workflows
src		src
.all-contributorsrc		.all-contributorsrc
.env.sample		.env.sample
.eslintrc		.eslintrc
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc.json		.prettierrc.json
README.md		README.md
db_schema.ts		db_schema.ts
package-lock.json		package-lock.json
package.json		package.json
renovate.json		renovate.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parla-document-processor

Prerequisites / External Services

Features

Limitations

Environment variables

Run locally

Periodically regenerate indices

Related repositories

Contributors ✨

Credits

Related Projects

About

Releases

Packages

Contributors 2

Languages

technologiestiftung/parla-document-processor

Folders and files

Latest commit

History

Repository files navigation

parla-document-processor

Prerequisites / External Services

Features

Limitations

Environment variables

Run locally

Periodically regenerate indices

Related repositories

Contributors ✨

Credits

Related Projects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages