Skip to content

Latest commit

 

History

History
995 lines (788 loc) · 46.4 KB

README.md

File metadata and controls

995 lines (788 loc) · 46.4 KB

Harmony API

my badge

PyPI package version number License

You can also join our Discord server!

Who to contact?

You can contact Harmony team at https://harmonydata.ac.uk/, or Thomas Wood at http://fastdatascience.com/.

How to contribute

Read our guide to contributing to Harmony here.

You can raise an issue in the issue tracker, and you can open a pull request.

Please contact us at https://harmonydata.ac.uk/contact or write to [email protected] if you would like to be involved in the project.

Looking for the Harmony Python library?

Please visit https://github.com/harmonydata/harmony

Looking for the original (Plotly Dash-based) Harmony?

Please visit https://github.com/harmonydata/harmony_original

About Harmony

Harmony is a data harmonisation project that uses Natural Language Processing to help researchers make better use of existing data from different studies by supporting them with the harmonisation of various measures and items used in different studies. Harmony is a collaboration project between the University of Ulster, University College London, the Universidade Federal de Santa Maria in Brazil, and Fast Data Science Ltd in London.

You can read more at https://harmonydata.ac.uk.

There is a live demo at: https://harmonydata.ac.uk/app

Screenshot

How does Harmony work in layman's terms?

Harmony compares questions from different instruments by converting them to a vector representation and calculating their similarity. You can read more at https://harmonydata.ac.uk/how-does-harmony-work/

Do you want to run Harmony in your browser locally from a pre-built Docker container?

Download and install Docker:

Open a Terminal and run

docker run -p 8000:8000 -p 3000:3000 harmonydata/harmonylocal

Then go to http://localhost:3000 in your browser.

Docker images

If you are a Docker user, you can run Harmony from a pre-built Docker image.

Getting started: running and developing the API on your computer using Docker

A prerequisite is Tika, which is a PDF parsing library. This must run as a server in Java. We use the Tika Python bindings.

First, clone the API and make sure to clone with --recurse-submodules.

git clone --recurse-submodules [email protected]:harmonydata/harmonyapi.git

The Harmony API includes the harmony repo as a submodule.

Troubleshooting the submodules after git clone

After you have cloned the repository, if the folder inside called harmony is empty, or at any point you get an error like the below, please check you have cloned with --recurse-submodules as below:

./images/error_no_submodules.png

git clone --recurse-submodules https://github.com/harmonydata/harmonyapi.git

1. Run Tika

Download and install Java if you don't have it already. Download and install Apache Tika and run it on your computer https://tika.apache.org/download.html

java -jar tika-server-standard-2.3.0.jar

2. Build Docker container

docker build -t harmonyapi .

3. Run Docker container

Don't forget to expose port 8080:

docker run -p 8080:80 harmonyapi

You should now be able to visit http://0.0.0.0:8080/docs and view the data.

If you want to run the Harmony API container and execute Bash commands inside it, you can run:

docker run -it harmonyapi bash

Architecture of deployed Harmony API server

On-premises deployment

Harmony team only: see details of how Harmony is deployed on-premises here:

https://github.com/harmonydata/harmony_deployment_ulster_private

Alternative Docker Compose deployment

You can deploy Harmony with Docker Compose - see docker_compose.yml.

MHC data

When the app is run, there is an environment variable HARMONY_DATA_PATH which is set to /data on the production server, and that's where you need to put any data files. But you could set it to anything you like on your local machine e.g. /home/xxx/data/ and put the files there and it will find them.

These 3 files are the files it looks for in the /data folder, although the app will run without them. It's a cached version of the Mental Health Catalogue:

mhc_all_metadatas.json
mhc_embeddings.npy
mhc_questions.json

When Harmony is deployed to Azure, there is an Azure blob storage which is mounted under /data.

The data files can be found here: https://github.com/harmonydata/harmony_deployment_ulster_private

Environment variables

There are also other environment variables which tell the API where to look to load the sentence transformer or contact Tika:

 environment:
   HARMONY_DATA_PATH: /data
   HARMONY_SENTENCE_TRANSFORMER_PATH: /data/paraphrase-multilingual-MiniLM-L12-v2
   OPENAI_API_KEY:
   GOOGLE_APPLICATION_CREDENTIALS:
   AZURE_OPENAI_API_KEY:
   AZURE_OPENAI_ENDPOINT:
   TIKA_SERVER_ENDPOINT: http://tika:9998

HARMONY_DATA_PATH - This path will be used to store for example the cache files. Defaults to the HOME DIRECTORY.

OPENAI_API_KEY - The OpenAI API key.

GOOGLE_APPLICATION_CREDENTIALS - To make use of Google's Vertex AI, fill in this environment variable. This should be the content of your service account file, so a JSON object is expected as the value for the environment. Make sure to give the service account the required Vertex AI role.

AZURE_OPENAI_API_KEY - The Azure OpenAI API key.

AZURE_OPENAI_ENDPOINT - The Azure OpenAI endpoint.

TIKA_SERVER_ENDPOINT - This is the endpoint where Tika is served from.

AZURE_STORAGE_URL - The Azure Blob storage URL. This is required for downloading the catalogue data.

You can ideally set these environment variables to show Harmony where to look for dependencies and data, but it will work without it (it will download the sentence transformer from HuggingFace Hub, etc).

The deployed Harmony uses an Azure Function to run spaCy, available in the repository here: https://github.com/harmonydata/spacyfunctionapp

So to run locally with Docker Compose you can do:

docker compose up

Providing environment variables in .env file

If you are working with external third-party services in the API, you may find it convenient to make an .env file in the base folder of the project. You can connect an IDE such as Pycharm to use this .env file. It will be ignored by .gitignore so you don't need to worry about accidentally committing your credentials to the repo.

harmonyenv.png

Example content of the .env file:

GOOGLE_APPLICATION_CREDENTIALS='{   "type": "service_account",  ... }'
AZURE_OPENAI_API_KEY=f46axxxxxxxxxxxxxxxxxxxxxxxxxxxd
AZURE_OPENAI_ENDPOINT=https://xxxxxxxxx.openai.azure.com/
OPENAI_API_TYPE=azure
OPENAI_API_VERSION=2023-12-01-preview
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxX

When deploying, you can use these environment variables in your Docker run command, e.g.

docker run -d -p 80:80 -p 3000:3000 -e GOOGLE_APPLICATION_CREDENTIALS=xxx -e AZURE_OPENAI_API_KEY=xxxx -e "HARMONY_DATA_PATH=/data" -v /home/thomaswood/data:/data harmonydata/harmonyapi:[DOCKER_TAG_HERE]

Harmony FastAPI API implementation

If you are not running with Docker, you can run the individual components of the Harmony API separately.

Architecture of the Harmony implementation on Azure with FastAPI:

Screenshot

Getting started with the Harmony Python library

Installing Python library

You can install from PyPI.

pip install harmonydata

You can read the user guide at ./harmony_pypi_package/README.md.

Troubleshooting running Harmony API on port e.g. 8000 on local machine

By default, Harmony API runs on port 8000 (see screenshot below)

./images/port.png

If you are having errors running the API on the port it could be

  1. a different program is already using port 8000
  2. you are trying to run on a forbidden port e.g. on port 80 which is private and your computer doesn't give permission to do this

In particular on Windows, you may need to give some kind of permission to a Python program to use any port.

Calling the Harmony API

Parsing a raw file into an Instrument

If you want to read in a raw (unstructured) PDF or Excel file, you can do this via a POST request to the REST API. This will convert the file into an Instrument object in JSON.

curl -X 'POST' \
  'https://api.harmonydata.ac.uk/text/parse' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '[
  {
    "file_id": "d39f31718513413fbfc620c6b6135d0c",
    "file_name": "GAD-7.pdf",
    "file_type": "pdf",
    "content": "data:application/pdf;base64,"
  }
]'

curl -X 'POST' \
  'https://api.harmonydata.ac.uk/text/parse' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '[
  {
    "file_id": "1d66bce4b80c4b0eaefe33f00cddedef",
    "file_name": "GAD-7.xlsx",
    "file_type": "xlsx",
    "content": "data:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;base64,UEsDBBQAAAAIAGmwhFZGWsEMggAAALEAAAAQAAAAZG9jUHJvcHMvYXBwLnhtbE2OTQvCMBBE/0rp3W5V8CAxINSj4Ml7SDc2kGRDdoX8fFPBj9s83jCMuhXKWMQjdzWGxKd+EclHALYLRsND06kZRyUaaVgeQM55ixPZZ8QksBvHA2AVTDPOm/wd7LU65xy8NeIp6au3hZicdJdqMSj4l2vzjoXXvB+2b/lhBb+T+gVQSwMEFAAAAAgAabCEVu3qrybuAAAAKwIAABEAAABkb2NQcm9wcy9jb3JlLnhtbM2Sz0rEMBCHX0VybydpRSR0e1E8KQguKN5CMrsbbP6QjLT79qZ1t4voAwi5ZOaXb76BdDpKHRI+pxAxkcV8NbnBZ6njhh2IogTI+oBO5bokfGnuQnKKyjXtISr9ofYIDec34JCUUaRgBlZxJbK+M1rqhIpCOuGNXvHxMw0LzGjAAR16yiBqAayfJ8bjNHRwAcwwwuTydwHNSlyqf2KXDrBTcsp2TY3jWI/tkis7CHh7enxZ1q2sz6S8xvIqW0nHiBt2nvza3t1vH1jf8Kat+HU520ZI3kpx+z67/vC7CLtg7M7+Y+OzYN/Br3/RfwFQSwMEFAAAAAgAabCEVplcnCMQBgAAnCcAABMAAAB4bC90aGVtZS90aGVtZTEueG1s7Vpbc9o4FH7vr9B4Z/ZtC8Y2gba0E3Npdtu0mYTtTh+FEViNbHlkkYR/v0c2EMuWDe2STbqbPAQs6fvORUfn6Dh58+4uYuiGiJTyeGDZL9vWu7cv3uBXMiQRQTAZp6/wwAqlTF61WmkAwzh9yRMSw9yCiwhLeBTL1lzgWxovI9bqtNvdVoRpbKEYR2RgfV4saEDQVFFab18gtOUfM/gVy1SNZaMBE1dBJrmItPL5bMX82t4+Zc/pOh0ygW4wG1ggf85vp+ROWojhVMLEwGpnP1Zrx9HSSICCyX2UBbpJ9qPTFQgyDTs6nVjOdnz2xO2fjMradDRtGuDj8Xg4tsvSi3AcBOBRu57CnfRsv6RBCbSjadBk2PbarpGmqo1TT9P3fd/rm2icCo1bT9Nrd93TjonGrdB4Db7xT4fDronGq9B062kmJ/2ua6TpFmhCRuPrehIVteVA0yAAWHB21szSA5ZeKfp1lBrZHbvdQVzwWO45iRH+xsUE1mnSGZY0RnKdkAUOADfE0UxQfK9BtorgwpLSXJDWzym1UBoImsiB9UeCIcXcr/31l7vJpDN6nX06zmuUf2mrAaftu5vPk/xz6OSfp5PXTULOcLwsCfH7I1thhyduOxNyOhxnQnzP9vaRpSUyz+/5CutOPGcfVpawXc/P5J6MciO73fZYffZPR24j16nAsyLXlEYkRZ/ILbrkETi1SQ0yEz8InYaYalAcAqQJMZahhvi0xqwR4BN9t74IyN+NiPerb5o9V6FYSdqE+BBGGuKcc+Zz0Wz7B6VG0fZVvNyjl1gVAZcY3zSqNSzF1niVwPGtnDwdExLNlAsGQYaXJCYSqTl+TUgT/iul2v6c00DwlC8k+kqRj2mzI6d0Js3oMxrBRq8bdYdo0jx6/gX5nDUKHJEbHQJnG7NGIYRpu/AerySOmq3CEStCPmIZNhpytRaBtnGphGBaEsbReE7StBH8Waw1kz5gyOzNkXXO1pEOEZJeN0I+Ys6LkBG/HoY4SprtonFYBP2eXsNJweiCy2b9uH6G1TNsLI73R9QXSuQPJqc/6TI0B6OaWQm9hFZqn6qHND6oHjIKBfG5Hj7lengKN5bGvFCugnsB/9HaN8Kr+ILAOX8ufc+l77n0PaHStzcjfWfB04tb3kZuW8T7rjHa1zQuKGNXcs3Ix1SvkynYOZ/A7P1oPp7x7frZJISvmlktIxaQS4GzQSS4/IvK8CrECehkWyUJy1TTZTeKEp5CG27pU/VKldflr7kouDxb5OmvoXQ+LM/5PF/ntM0LM0O3ckvqtpS+tSY4SvSxzHBOHssMO2c8kh22d6AdNfv2XXbkI6UwU5dDuBpCvgNtup3cOjiemJG5CtNSkG/D+enFeBriOdkEuX2YV23n2NHR++fBUbCj7zyWHceI8qIh7qGGmM/DQ4d5e1+YZ5XGUDQUbWysJCxGt2C41/EsFOBkYC2gB4OvUQLyUlVgMVvGAyuQonxMjEXocOeXXF/j0ZLj26ZltW6vKXcZbSJSOcJpmBNnq8reZbHBVR3PVVvysL5qPbQVTs/+Wa3InwwRThYLEkhjlBemSqLzGVO+5ytJxFU4v0UzthKXGLzj5sdxTlO4Ena2DwIyubs5qXplMWem8t8tDAksW4hZEuJNXe3V55ucrnoidvqXd8Fg8v1wyUcP5TvnX/RdQ65+9t3j+m6TO0hMnHnFEQF0RQIjlRwGFhcy5FDukpAGEwHNlMlE8AKCZKYcgJj6C73yDLkpFc6tPjl/RSyDhk5e0iUSFIqwDAUhF3Lj7++TaneM1/osgW2EVDJk1RfKQ4nBPTNyQ9hUJfOu2iYLhdviVM27Gr4mYEvDem6dLSf/217UPbQXPUbzo5ngHrOHc5t6uMJFrP9Y1h75Mt85cNs63gNe5hMsQ6R+wX2KioARq2K+uq9P+SWcO7R78YEgm/zW26T23eAMfNSrWqVkKxE/Swd8H5IGY4xb9DRfjxRiraaxrcbaMQx5gFjzDKFmON+HRZoaM9WLrDmNCm9B1UDlP9vUDWj2DTQckQVeMZm2NqPkTgo83P7vDbDCxI7h7Yu/AVBLAwQUAAAACABpsIRWZJCgEIMBAADfAgAAGAAAAHhsL3dvcmtzaGVldHMvc2hlZXQxLnhtbH1STU/cMBD9K5bPFV52Ba1QEolSIXpotQW1PTvJJLFwPOl4wsK/70xgoz2UHizPl9+beePigPSYBwA2z2NMubQD83TlXG4GGH0+wwmSZDqk0bO41Ls8Efh2eTRGt91sLt3oQ7JVscT2VBU4cwwJ9mTyPI6eXj5DxENpz+0xcB/6gTXgqmLyPTwA/5z2JJ5bUdowQsoBkyHoSnt9fnW90/ql4FeAQz6xjU5SIz6q87Ut7UYbgggNK4KX6wluIEYFkjb+vGHalVIfntpH9Ntldpml9hluMP4OLQ+l/WRNC52fI9/j4Q7e5rlYG/zi2VcF4cGQzlkVjRoL9yKEVIekKj0wSTYIHVc/Zsjab+FYWtGYa+QIygq1XaG274DcAkioNwnoCef8wfj0HBYDyYgW0PbwH4LdSrB7h+A7sqlBKXwdwTCazDgpeoOJCaMug16k4F807kQeXf03T31I2UTohG1z9vHCGnqV89UR7EWxGplxXMxBfiCQFki+Q+Sjo9tc/3T1F1BLAwQUAAAACABpsIRW4O5QR6kCAAAWCwAADQAAAHhsL3N0eWxlcy54bWzdVtuK2zAQ/RXhD6iTmJq4xIE2ECi0ZWH3oa9KLMcCWXJlOST79Z2RHOeymqXtYxM2Hs3RmTOaGeFd9e6sxHMjhGOnVum+TBrnuk9p2u8b0fL+g+mEBqQ2tuUOlvaQ9p0VvOqR1Kp0MZvlaculTtYrPbTb1vVsbwbtymSWpOtVbfTVs0iCA7byVrAjV2Wy4UrurPR7eSvVObgX6NgbZSxzkIookzl6+tcAz8MKsxzjtFIbi840KITf3bj9BvCPHjZIpe4zA8d61XHnhNVbWHiOd76B2Gi/nDtI7WD5eb74mFwJ/gEiO2MrYe9kgmu9UqJ2QLDy0ODTmS5F0DnTglFJfjCa+xwujFsm860rE9dA6S9hHp0Q89EVBB69k8RoQOZ7odQz7vpZT+nPIf1TzUKfv1bYYobVvJhw5tEMYcIC499GC7Fvwi7+KSzr5NG4LwOcR/v1r8E48WRFLU9+faonfSr6nIgOft516vxZyYNuRTj7HwuuV/zCY42x8hXUcAr34BA2YUdhndyjBxrky3OqxxpN5fHFuiv85GV4ecrkB95JdVVlu0EqJ/W4amRVCf2m/hDe8R1c+rv4sL8SNR+Ue5nAMrna30Ulh7aYdj1hJcZdV/sbzuA8n24uaEldiZOoNuPSHnbeZGCA6vjx8/uAbP0njlCcgMURxCgdKgOKE1iUzv90niV5noBRuS2jyJLkLElOYMWQjf9SOnFOAZ/4SYsiy/KcquhmE81gQ9Utz/EvHo3KDRmUDir9Xa3pbtMT8v4cUD19b0Kok9KTSJ2UrjUi8bohoyji3aZ0kEF1gZod1I/r4EzFOVmGXaVyo24wjRQFheAsxmc0z4nq5PiN94e6JVlWFHEEsXgGWUYheBtphMoAc6CQLPPvwYf3UXp5T6XX/4TXvwFQSwMEFAAAAAgAabCEVpeKuxzAAAAAEwIAAAsAAABfcmVscy8ucmVsc52SuW7DMAxAf8XQnjAH0CGIM2XxFgT5AVaiD9gSBYpFnb+v2qVxkAsZeT08EtweaUDtOKS2i6kY/RBSaVrVuAFItiWPac6RQq7ULB41h9JARNtjQ7BaLD5ALhlmt71kFqdzpFeIXNedpT3bL09Bb4CvOkxxQmlISzMO8M3SfzL38ww1ReVKI5VbGnjT5f524EnRoSJYFppFydOiHaV/Hcf2kNPpr2MitHpb6PlxaFQKjtxjJYxxYrT+NYLJD+x+AFBLAwQUAAAACABpsIRWGrobqzABAAAjAgAADwAAAHhsL3dvcmtib29rLnhtbI1R0UrDQBD8lXAfYFLRgqXpi0UtiBYrfb8km2bp3W3Y27Tar3eTECz44tPezizDzNzyTHwsiI7Jl3ch5qYRaRdpGssGvI031EJQpib2VnTlQxpbBlvFBkC8S2+zbJ56i8GslpPWltPrhQRKQQoK9sAe4Rx/+X5NThixQIfynZvh7cAkHgN6vECVm8wksaHzCzFeKIh1u5LJudzMRmIPLFj+gXe9yU9bxAERW3xYNZKbeaaCNXKU4WLQt+rxBHo8bp3QEzoBXluBZ6auxXDoZTRFehVj6GGaY4kL/k+NVNdYwprKzkOQsUcG1xsMscE2miRYD7kZLA6BdG6qMZyoq6uqeIFK8KYa/U2mKqgxQPWmOlFxLajcctKPQef27n72oEV0zj0q9h5eyVZTxul/Vj9QSwMEFAAAAAgAabCEViQem6KtAAAA+AEAABoAAAB4bC9fcmVscy93b3JrYm9vay54bWwucmVsc7WRPQ6DMAyFrxLlADVQqUMFTF1YKy4QBfMjEhLFrgq3L4UBkDp0YbKeLX/vyU6faBR3bqC28yRGawbKZMvs7wCkW7SKLs7jME9qF6ziWYYGvNK9ahCSKLpB2DNknu6Zopw8/kN0dd1pfDj9sjjwDzC8XeipRWQpShUa5EzCaLY2wVLiy0yWoqgyGYoqlnBaIOLJIG1pVn2wT06053kXN/dFrs3jCa7fDHB4dP4BUEsDBBQAAAAIAGmwhFZlkHmSGQEAAM8DAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbK2TTU7DMBCFrxJlWyUuLFigphtgC11wAWNPGqv+k2da0tszTtpKoBIVhU2seN68z56XrN6PEbDonfXYlB1RfBQCVQdOYh0ieK60ITlJ/Jq2Ikq1k1sQ98vlg1DBE3iqKHuU69UztHJvqXjpeRtN8E2ZwGJZPI3CzGpKGaM1ShLXxcHrH5TqRKi5c9BgZyIuWFCKq4Rc+R1w6ns7QEpGQ7GRiV6lY5XorUA6WsB62uLKGUPbGgU6qL3jlhpjAqmxAyBn69F0MU0mnjCMz7vZ/MFmCsjKTQoRObEEf8edI8ndVWQjSGSmr3ghsvXs+0FOW4O+kc3j/QxpN+SBYljmz/h7xhf/G87xEcLuvz+xvNZOGn/mi+E/Xn8BUEsBAhQDFAAAAAgAabCEVkZawQyCAAAAsQAAABAAAAAAAAAAAAAAAIABAAAAAGRvY1Byb3BzL2FwcC54bWxQSwECFAMUAAAACABpsIRW7eqvJu4AAAArAgAAEQAAAAAAAAAAAAAAgAGwAAAAZG9jUHJvcHMvY29yZS54bWxQSwECFAMUAAAACABpsIRWmVycIxAGAACcJwAAEwAAAAAAAAAAAAAAgAHNAQAAeGwvdGhlbWUvdGhlbWUxLnhtbFBLAQIUAxQAAAAIAGmwhFZkkKAQgwEAAN8CAAAYAAAAAAAAAAAAAACAgQ4IAAB4bC93b3Jrc2hlZXRzL3NoZWV0MS54bWxQSwECFAMUAAAACABpsIRW4O5QR6kCAAAWCwAADQAAAAAAAAAAAAAAgAHHCQAAeGwvc3R5bGVzLnhtbFBLAQIUAxQAAAAIAGmwhFaXirscwAAAABMCAAALAAAAAAAAAAAAAACAAZsMAABfcmVscy8ucmVsc1BLAQIUAxQAAAAIAGmwhFYauhurMAEAACMCAAAPAAAAAAAAAAAAAACAAYQNAAB4bC93b3JrYm9vay54bWxQSwECFAMUAAAACABpsIRWJB6boq0AAAD4AQAAGgAAAAAAAAAAAAAAgAHhDgAAeGwvX3JlbHMvd29ya2Jvb2sueG1sLnJlbHNQSwECFAMUAAAACABpsIRWZZB5khkBAADPAwAAEwAAAAAAAAAAAAAAgAHGDwAAW0NvbnRlbnRfVHlwZXNdLnhtbFBLBQYAAAAACQAJAD4CAAAQEQAAAAA="
  }
]'

Example response from the /parse endpoint:

[
  {
    "file_id": "fd60a9a64b1b4078a68f4bc06f20253c",
    "instrument_id": "7829ba96f48e4848abd97884911b6795",
    "instrument_name": "GAD-7 English",
    "file_name": "GAD-7.pdf",
    "file_type": "pdf",
    "file_section": "GAD-7 English",
    "language": "en",
    "study": "MCS",
    "sweep": "Sweep 1",
    "questions": [
      {
        "question_no": "1",
        "question_intro": "Over the last two weeks, how often have you been bothered by the following problems?",
        "question_text": "Feeling nervous, anxious, or on edge",
        "options": [
          "Not at all",
          "Several days",
          "More than half the days",
          "Nearly every day"
        ],
        "source_page": 0
      }
    ]
  }
]

Matching instruments

You can request the similarities between instruments with a second POST request:

curl -X 'POST' \
  'https://api.harmonydata.ac.uk/text/match' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "instruments": [
    {
      "file_id": "fd60a9a64b1b4078a68f4bc06f20253c",
      "instrument_id": "7829ba96f48e4848abd97884911b6795",
      "instrument_name": "GAD-7 English",
      "file_name": "GAD-7 EN.pdf",
      "file_type": "pdf",
      "file_section": "GAD-7 English",
      "language": "en",
      "questions": [
        {
          "question_no": "1",
          "question_intro": "Over the last two weeks, how often have you been bothered by the following problems?",
          "question_text": "Feeling nervous, anxious, or on edge",
          "options": [
            "Not at all",
            "Several days",
            "More than half the days",
            "Nearly every day"
          ],
          "source_page": 0
        },
        {
          "question_no": "2",
          "question_intro": "Over the last two weeks, how often have you been bothered by the following problems?",
          "question_text": "Not being able to stop or control worrying",
          "options": [
            "Not at all",
            "Several days",
            "More than half the days",
            "Nearly every day"
          ],
          "source_page": 0
        }
      ]
    },
    {
      "file_id": "fd60a9a64b1b4078a68f4bc06f20253c",
      "instrument_id": "7829ba96f48e4848abd97884911b6795",
      "instrument_name": "GAD-7 Portuguese",
      "file_name": "GAD-7 PT.pdf",
      "file_type": "pdf",
      "file_section": "GAD-7 Portuguese",
      "language": "en",
      "questions": [
        {
          "question_no": "1",
          "question_intro": "Durante as últimas 2 semanas, com que freqüência você foi incomodado/a pelos problemas abaixo?",
          "question_text": "Sentir-se nervoso/a, ansioso/a ou muito tenso/a",
          "options": [
            "Nenhuma vez",
            "Vários dias",
            "Mais da metade dos dias",
            "Quase todos os dias"
          ],
          "source_page": 0
        },
        {
          "question_no": "2",
          "question_intro": "Durante as últimas 2 semanas, com que freqüência você foi incomodado/a pelos problemas abaixo?",
          "question_text": " Não ser capaz de impedir ou de controlar as preocupações",
          "options": [
            "Nenhuma vez",
            "Vários dias",
            "Mais da metade dos dias",
            "Quase todos os dias"
          ],
          "source_page": 0
        }
      ]
    }
  ],
  "query": "anxiety",
  "parameters": {
    "framework": "huggingface",
    "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
  }
}'

Example response

The response contains a dictionary with three key-value pairs: questions (the questions matched in order), matches ( the matrix of matches between all items), and query_similarity (the degree of similarity to the query term).

{
  "questions": [
    ...
  ],
  "matches": [
    [
      1.0000001192092896,
      ...
      0.9999998807907104
    ]
  ],
  "query_similarity": [
    0.7244994640350342,
    ...
  ]
}

Alternative serverless deployment on AWS Lambda

This repository also contains code for an alternative serverless deployment on AWS Lambda. The deployment has been divided into four AWS Lambda functions, managed by Terraform. Please refer to folder serverless_deployment for details.

Screenshot

License

License: MIT License

Contact

[email protected]

Built With

Licences of Third Party Software

How do I cite Harmony?

If you would like to cite the tool alone, you can cite:

Wood, T.A., McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffmann, M., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2022)

A BibTeX entry for LaTeX users is

@unpublished{harmony,
    AUTHOR = {Wood, T.A., McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffman, M.},
    TITLE  = {Harmony (Computer software), Version 1.0},
    YEAR   = {2022},
    Note   = {To appear},
    url = {https://harmonydata.ac.uk/app}
}

You can also cite the wider Harmony project which is registered with the Open Science Foundation:

McElroy, E., Moltrecht, B., Scopel Hoffmann, M., Wood, T. A., & Ploubidis, G. (2023, January 6). Harmony – A global platform for contextual harmonisation, translation and cooperation in mental health research. Retrieved from osf.io/bct6k

@misc{McElroy_Moltrecht_Scopel Hoffmann_Wood_Ploubidis_2023,
  title={Harmony - A global platform for contextual harmonisation, translation and cooperation in mental health research},
  url={osf.io/bct6k},
  publisher={OSF},
  author={McElroy, Eoin and Moltrecht, Bettina and Scopel Hoffmann, Mauricio and Wood, Thomas A and Ploubidis, George},
  year={2023},
  month={Jan}
}

API Reference

Harmony API

API Version: 2.

Documentation for Harmony API.

Harmony is a tool using AI which allows you to compare items from questionnaires and identify similar content. You can try Harmony at harmonydata.ac.uk/app and you can read our blog at harmonydata.ac.uk/blog/.

CONTACT

NAME: Thomas Wood URL: https://fastdatascience.com

INDEX

    1. HEALTH CHECK
  • 1.1 GET /health-check
    1. INFO
  • 2.1 GET /info/version
    1. TEXT
  • 3.1 POST /text/parse
  • 3.2 POST /text/match
  • 3.3 POST /text/examples
  • 3.4 GET /text/cache

API

1. HEALTH CHECK

1.1 GET /health-check

Health Check

REQUEST

No request parameters

RESPONSE

STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
undefined

2. INFO

2.1 GET /info/version

Show Version

REQUEST

No request parameters

RESPONSE

STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
undefined

3. TEXT

3.1 POST /text/parse

Parse Instruments Parse PDFs or Excels or text files into Instruments, and identifies the language.

If the file is binary (Excel or PDF), you must supply each file with the content in MIME format and the bytes in base encoding, like the example RawFile in the schema.

If the file is plain text, supply the file content as a standard string.

REQUEST

REQUEST BODY - application/json
[{
Array of object:
file_id string Unique identifier for the file (UUID-4)
file_name string DEFAULT:Untitled file
The name of the input file
file_type* enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
content* string The raw file contents
text_content string The plain text content
tables [undefined]
}]

RESPONSE

STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
[{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument
file_name string DEFAULT:Untitled file
The name of the input file
file_type enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
file_section string The sub-section of the file, e.g. Excel tab
study string The study
sweep string The sweep
metadata {
Optional metadata about the instrument (URL, citation, DOI, copyright holder)
}
language enum DEFAULT:en
ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru,
uk, zh, ar, la, tr, af, ak, am, as, ay, az, be, bg,
bho, bm, bn, bs, ca, ceb, ckb, co, cs, cy, da, doi,
dv, ee, eo, et, eu, fa, fi, fil, fy, ga, gd, gl, gn,
gom, gu, ha, haw, hi, hmn, hr, ht, hu, hy, id, ig,
ilo, is, jv, ka, kk, km, kn, kri, ku, ky, lb, lg,
ln, lo, lt, lus, lv, mai, mg, mi, mk, ml, mn, mni-
mtei, mr, ms, mt, my, ne, nl, no, nso, ny, om, or,
pa, pl, ps, qu, ro, rw, sa, sd, si, sk, sl, sm, sn,
so, sq, sr, st, su, sv, sw, ta, te, tg, th, ti, tk,
tl, ts, tt, ug, ur, uz, vi, xh, yi, yo, zh-tw, zu,
yue
The ISO 639-2 (alpha-2) encoding of the instrument language
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
}]
STATUS CODE - 422: Validation Error
RESPONSE MODEL - application/json
{
detail [{
Array of object:
loc*
ANY OF
prop
string
prop
integer
msg* string
type* string
}]
}

3.2 POST /text/match

Match Match instruments

REQUEST

REQUEST BODY - application/json
{
instruments* [{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument
file_name string DEFAULT:Untitled file
The name of the input file
file_type enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
file_section string The sub-section of the file, e.g. Excel tab
study string The study
sweep string The sweep
metadata {
Optional metadata about the instrument (URL, citation, DOI, copyright holder)
}
language enum DEFAULT:en
ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru, uk,
zh, ar, la, tr, af, ak, am, as, ay, az, be, bg, bho, bm,
bn, bs, ca, ceb, ckb, co, cs, cy, da, doi, dv, ee, eo,
et, eu, fa, fi, fil, fy, ga, gd, gl, gn, gom, gu, ha,
haw, hi, hmn, hr, ht, hu, hy, id, ig, ilo, is, jv, ka,
kk, km, kn, kri, ku, ky, lb, lg, ln, lo, lt, lus, lv,
mai, mg, mi, mk, ml, mn, mni-mtei, mr, ms, mt, my, ne,
nl, no, nso, ny, om, or, pa, pl, ps, qu, ro, rw, sa, sd,
si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te,
tg, th, ti, tk, tl, ts, tt, ug, ur, uz, vi, xh, yi, yo,
zh-tw, zu, yue
The ISO 639-2 (alpha-2) encoding of the instrument language
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
}]
query string Search term
parameters {
Parameters on how to match
framework string DEFAULT:huggingface
The framework to use for matching
model string DEFAULT:sentence-transformers/paraphrase-multilingual-MiniLM-L12-v
Model
}
}

RESPONSE

STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
{
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
matches* [{
Array of object:
}]
query_similarity [undefined]
}
STATUS CODE - 422: Validation Error
RESPONSE MODEL - application/json
{
detail [{
Array of object:
loc*
ANY OF
prop
string
prop
integer
msg* string
type* string
}]
}

3.3 POST /text/examples

Get Example Instruments

Get example instruments

REQUEST

No request parameters

RESPONSE

STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
[{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument
file_name string DEFAULT:Untitled file
The name of the input file
file_type enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
file_section string The sub-section of the file, e.g. Excel tab
study string The study
sweep string The sweep
metadata {
Optional metadata about the instrument (URL, citation, DOI, copyright holder)
}
language enum DEFAULT:en
ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru, uk, zh,
ar, la, tr, af, ak, am, as, ay, az, be, bg, bho, bm, bn,
bs, ca, ceb, ckb, co, cs, cy, da, doi, dv, ee, eo, et,
eu, fa, fi, fil, fy, ga, gd, gl, gn, gom, gu, ha, haw,
hi, hmn, hr, ht, hu, hy, id, ig, ilo, is, jv, ka, kk, km,
kn, kri, ku, ky, lb, lg, ln, lo, lt, lus, lv, mai, mg,
mi, mk, ml, mn, mni-mtei, mr, ms, mt, my, ne, nl, no,
nso, ny, om, or, pa, pl, ps, qu, ro, rw, sa, sd, si, sk,
sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th,
ti, tk, tl, ts, tt, ug, ur, uz, vi, xh, yi, yo, zh-tw,
zu, yue
The ISO 639-2 (alpha-2) encoding of the instrument language
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
}]

3.4 GET /text/cache

Get Cache Get all items in cache

REQUEST

No request parameters

RESPONSE

STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
{
instruments* [{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument

file_name string DEFAULT:Untitled file The name of the input file file_type enum ALLOWED:pdf, xlsx, txt, docx The file type (pdf, xlsx, txt) file_section string The sub-section of the file, e.g. Excel tab study string The study sweep string The sweep metadata { Optional metadata about the instrument (URL, citation, DOI, copyright holder) } language enum DEFAULT:en ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru, uk, zh, ar, la, tr, af, ak, am, as, ay, az, be, bg, bho, bm, bn, bs, ca, ceb, ckb, co, cs, cy, da, doi, dv, ee, eo, et, eu, fa, fi, fil, fy, ga, gd, gl, gn, gom, gu, ha, haw, hi, hmn, hr, ht, hu, hy, id, ig, ilo, is, jv, ka, kk, km, kn, kri, ku, ky, lb, lg, ln, lo, lt, lus, lv, mai, mg, mi, mk, ml, mn, mni-mtei, mr, ms, mt, my, ne, nl, no, nso, ny, om, or, pa, pl, ps, qu, ro, rw, sa, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, ti, tk, tl, ts, tt, ug, ur, uz, vi, xh, yi, yo, zh-tw, zu, yue The ISO 639-2 (alpha-2) encoding of the instrument language questions* [{ Array of object: question_no string Number of the question question_intro string Introductory text applying to the question question_text* string Text of the question options [string] source_page integer DEFAULT: 0 The page of the PDF on which the question was located, zero-indexed instrument_id string Unique identifier for the instrument (UUID-4) instrument_name string Human readable name for the instrument topics_auto [undefined] nearest_match_from_mhc_auto { Automatically identified nearest MHC match } }] }] vectors* [{ Array of object: }] }

📜 How do I cite Harmony?

You can cite our validation paper:

McElroy, Wood, Bond, Mulvenna, Shevlin, Ploubidis, Scopel Hoffmann, Moltrecht, Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data. BMC Psychiatry 24, 530 (2024), https://doi.org/10.1186/s12888-024-05954-2

A BibTeX entry for LaTeX users is

@article{mcelroy2024using,
  title={Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data},
  author={McElroy, Eoin and Wood, Thomas and Bond, Raymond and Mulvenna, Maurice and Shevlin, Mark and Ploubidis, George B and Hoffmann, Mauricio Scopel and Moltrecht, Bettina},
  journal={BMC psychiatry},
  volume={24},
  number={1},
  pages={530},
  year={2024},
  publisher={Springer}
}