This repository contains scripts for pre-processing PDF files for later use in the explorational project Parla. It offers a generic way of importing / registering and processing PDF documents. For the use case of Parla, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.
- Running and accessible Supabase database with the schema defined in https://github.com/technologiestiftung/parla-api
- OpenAI account and API key
- LLamaParse account and API key
-
Register relevant documents from various data sources, see ./src/importers. Registering documents means storing their download URL and possible metadata in the database.
-
Process registered documents by
- Downloading the PDF
- Extracting text (Markdown) content from the PDF via LLamaParse API
- Generating a summary of the PDF content via OpenAI
- Generating a list of tags describing the PDF content via OpenAI
- Generating embedding vectors of each PDF page via OpenAI
-
Regenerate embeddings both for chunks and summaries. This is particularly useful when the used LLM (we use OpenAI) introduces a new embedding model as it happened in January 2024 (https://openai.com/blog/new-embedding-models-and-api-updates). Regenerating the embeddings is done in the
run_regenerate_embeddings.ts
script and performs the following steps:- For each chunk in
processed_document_chunks
, generate embedding with the (new) model set in env variableOPENAI_EMBEDDING_MODEL
and store in columnembedding_temp
. - For each summary in
processed_document_summaries
, generate embedding with the (new) model set in env variableOPENAI_EMBEDDING_MODEL
and store in columnsummary_embedding_temp
. - After doing so, the API (https://github.com/technologiestiftung/parla-api) must be changed to use the new model as well.
- The final migration must happen simultaneously with the API changes by renaming the columns:
and
ALTER TABLE processed_document_chunks rename column embedding to embedding_old; ALTER TABLE processed_document_chunks rename column embedding_temp to embedding; ALTER TABLE processed_document_chunks rename column embedding_old to embedding_temp;
ALTER TABLE processed_document_summaries rename column summary_embedding to summary_embedding_old; ALTER TABLE processed_document_summaries rename column summary_embedding_temp to summary_embedding; ALTER TABLE processed_document_summaries rename column summary_embedding_old to summary_embedding_temp;
- After swapping the columns, the indices must be regenerated, see section [Periodically regenerate indices]
- For each chunk in
- Only PDF documents are supported
- The download URL of the documents must be publicly accessible
See .env.sample
# Supabase Configuration
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=
#OpenAI Configuration
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=
# Directory for processing temporary files
PROCESSING_DIR=
ALLOW_DELETION=false
# Max limit for the number of pages to process (with fallback strategy)
MAX_PAGES_LIMIT=5000
# Limit for the number of pages to process with LLamaParse
MAX_PAGES_FOR_LLM_PARSE_LIMIT=128
# LLamaParse Token (get via LlamaParse Cloud)
LLAMA_PARSE_TOKEN=
# Max number of documents to process in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_PROCESS_IN_ONE_RUN=
# Max number of documents to import in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_IMPORT_PER_DOCUMENT_TYPE=
- Setup
.env
file based on.env.sample
- Run
npm ci
to install dependencies - Run
npx tsx ./src/run_import.ts
to register the documents - Run
npx tsx ./src/run_process.ts
to process all unprocessed documents
The indices on the processed_document_chunks
and processed_document_summaries
tables need be regenerated upon arrival of new data.
This is because the lists
parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron
extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.
select cron.schedule (
'regenerate_embedding_indices_for_chunks',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);
select cron.schedule (
'regenerate_embedding_indices_for_summaries',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);
- API and database definition: https://github.com/technologiestiftung/parla-api
- Parla frontend: https://github.com/technologiestiftung/parla-frontend
Thanks goes to these wonderful people (emoji key):
Fabian Morón Zirfas 💻 🤔 |
Jonas Jaszkowic 💻 🤔 🚇 |
This project follows the all-contributors specification. Contributions of any kind welcome!
Made by
|
A project by
|
Supported by
|