Skip to content

technologiestiftung/parla-document-processor

Repository files navigation

All Contributors

parla-document-processor

This repository contains scripts for pre-processing PDF files for later use in the explorational project Parla. It offers a generic way of importing / registering and processing PDF documents. For the use case of Parla, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.

Prerequisites / External Services

Features

  • Register relevant documents from various data sources, see ./src/importers. Registering documents means storing their download URL and possible metadata in the database.

  • Process registered documents by

    1. Downloading the PDF
    2. Extracting text (Markdown) content from the PDF via LLamaParse API
    3. Generating a summary of the PDF content via OpenAI
    4. Generating a list of tags describing the PDF content via OpenAI
    5. Generating embedding vectors of each PDF page via OpenAI
  • Regenerate embeddings both for chunks and summaries. This is particularly useful when the used LLM (we use OpenAI) introduces a new embedding model as it happened in January 2024 (https://openai.com/blog/new-embedding-models-and-api-updates). Regenerating the embeddings is done in the run_regenerate_embeddings.ts script and performs the following steps:

    • For each chunk in processed_document_chunks, generate embedding with the (new) model set in env variable OPENAI_EMBEDDING_MODEL and store in column embedding_temp.
    • For each summary in processed_document_summaries, generate embedding with the (new) model set in env variable OPENAI_EMBEDDING_MODEL and store in column summary_embedding_temp.
    • After doing so, the API (https://github.com/technologiestiftung/parla-api) must be changed to use the new model as well.
    • The final migration must happen simultaneously with the API changes by renaming the columns:
      ALTER TABLE processed_document_chunks rename column embedding to embedding_old;
      ALTER TABLE processed_document_chunks rename column embedding_temp to embedding;
      ALTER TABLE processed_document_chunks rename column embedding_old to embedding_temp;
      
      and
      ALTER TABLE processed_document_summaries rename column summary_embedding to summary_embedding_old;
      ALTER TABLE processed_document_summaries rename column summary_embedding_temp to summary_embedding;
      ALTER TABLE processed_document_summaries rename column summary_embedding_old to summary_embedding_temp;
      
    • After swapping the columns, the indices must be regenerated, see section [Periodically regenerate indices]

Limitations

  • Only PDF documents are supported
  • The download URL of the documents must be publicly accessible

Environment variables

See .env.sample

# Supabase Configuration
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=

#OpenAI Configuration
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=

# Directory for processing temporary files
PROCESSING_DIR=

ALLOW_DELETION=false

# Max limit for the number of pages to process (with fallback strategy)
MAX_PAGES_LIMIT=5000

# Limit for the number of pages to process with LLamaParse
MAX_PAGES_FOR_LLM_PARSE_LIMIT=128

# LLamaParse Token (get via LlamaParse Cloud)
LLAMA_PARSE_TOKEN=

# Max number of documents to process in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_PROCESS_IN_ONE_RUN=

# Max number of documents to import in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_IMPORT_PER_DOCUMENT_TYPE=

Run locally

⚠️ Warning: Running those scripts on many PDF documents will result in significant costs. ⚠️

  • Setup .env file based on .env.sample
  • Run npm ci to install dependencies
  • Run npx tsx ./src/run_import.ts to register the documents
  • Run npx tsx ./src/run_process.ts to process all unprocessed documents

Periodically regenerate indices

The indices on the processed_document_chunks and processed_document_summaries tables need be regenerated upon arrival of new data. This is because the lists parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.

select cron.schedule (
    'regenerate_embedding_indices_for_chunks',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);

select cron.schedule (
    'regenerate_embedding_indices_for_summaries',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);

Related repositories

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Fabian Morón Zirfas
Fabian Morón Zirfas

💻 🤔
Jonas Jaszkowic
Jonas Jaszkowic

💻 🤔 🚇

This project follows the all-contributors specification. Contributions of any kind welcome!

Credits

Made by

A project by

Supported by

Related Projects

About

Pre-Processing of PDF documents for the "Parla" project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published