Warning
Please be aware that for development purposes, these experiments use experimental Large Language Models (LLM's) not intended for production. They can present inaccurate information, hallucinated statements and offensive text by random chance or through malevolent prompts.
- Under development / Experimental
- Tested on macOS only
- Peer-reviewed
- Depends on external API's
This is an experimental application for semantic search of ONS statistical publications. It uses LangChain to implement a fairly simple Retriaval Augmented Generation (RAG) using embedding search and QA information retrieval process.
Upon receiving a query, documents are returned as search results using embedding similarity to score relevance. Next, the relevant text is passed to a Large Language Model (LLM), which is prompted to write an answer to the original question, if it can, using only the information contained within the documents.
For this prototype, relevant web pages are
scraped and the data stored in data/bulletins
, the docstore / embedding store
that is created is likewise in local folders and files, and the LLM is either run in memory or accessed through VertexAI.
The project requires specific versions of some packages so it is recommended to set up a virtual environment. Using venv and pip:
python3.10 -m venv env
source env/bin/activate
python -m pip install --upgrade pip
python -m pip install .
Note
If you are doing development work on statschat
, you should install the
package locally as editable with our optional dev
dependencies:
python -m pip install -e ".[dev]"
This repository contains a configuration of pre-commit hooks. These are language agnostic and focussed on repository security (such as detection of passwords and API keys).
If approaching this project as a developer, you are encouraged to install and
enable pre-commits
by running the following in your shell:
- Install
pre-commit
:pip install pre-commit
- Enable
pre-commit
:pre-commit install
Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The use of active pre-commits are highly encouraged.
Note
Pre-commit hooks execute Python, so it expects a working Python build.
This main module statschat can be either called directly or deployed as an API (using fastapi). A lightweight flask front end is implemented separately in a subfolder and relies on the API running.
The first time you instantiate the Inquirer
class, any ML models specified in the code will be
downloaded to your machine. This will use a few GB of data and take a few
minutes. App and search pipeline parameter are stored and can be updated by
editing statschat/_config/main.toml
.
We have included few EXAMPLE scraped data files in data/bulletins
so that
the preprocessing and app can be run as a small example system without waiting
on webscraping.
If you wish to use Google's model API update the model variables in
statschat/_config/main.toml
:
- to use the question-answering system with Google's PaLM2 API set the
generative_model_name
parameter totext-unicorn
orgemini-pro
(their name for the model). - for PaLM2 (Gecko) to create embeddings, set the
embedding_model_name
parameter totextembedding-gecko@001
. You may also wish to disable the removal of near-identical documents in the preprocessing pipeline (line 59,statschat/embedding/preprocess.py
), to reduce calls to the embedding API.
In addition to changing this parameter, you will need a Google Cloud Platform (GCP) project set up, with the Vertex AI API enabled. You will need to have the GCP Command Line Interface installed in the machine running this code, logged in to an account with sufficient permissions to access the API (you may need to set up application default credentials). Usually this can be achieved by running:
gcloud config set project "<PROJECT_ID>"
gcloud auth application-default login
-
Webscraping the source documents (not included in the public repository, only examples in
data/bulletins
)python statschat/webscraping/main.py
-
python statschat/embedding/preprocess.py
-
python statschat/embedding/preprocess_update_db.py
-
uvicorn fast-api.main_api:app
The fastapi is set to respond to http requests on port 8000. When running, you can see docs at http://localhost:8000/docs.
-
python flask-app/app.py
To use the user UI navigate in your browser to http://localhost:5000. Note that it requires the API to be running and the endpoind specified in the app.
-
python statschat/model_evaluation/evaluation.py
The StatsChat pipeline is currently evaluated based on small number of test question. The main 'app_config.toml' determines pipeline setting used in evaluation and results are written to
data/model_evaluation
folder. -
python -m pytest
Preferred unittesting framework is PyTest.
There are some key parameters in statschat/_config/main.toml
that we're
experimenting with to improve the search results, and the generated text
answer. The current values are initial guesses:
Parameter | Current Value | Function |
---|---|---|
k_docs | 10 | Maximum number of search results to return |
similarity_threshold | 2.0 | Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER) |
k_contexts | 3 | Number of top documents to pass to generative QA LLM |
At the Data Science Campus we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at [email protected].
The code, unless otherwise stated, is released under the MIT License.
The documentation for this work is subject to © Crown copyright and is available under the terms of the Open Government 3.0 licence.