Skip to content

Latest commit

 

History

History
265 lines (171 loc) · 14.5 KB

README.md

File metadata and controls

265 lines (171 loc) · 14.5 KB

READ-IT backend

Public interface for READ-IT

This is a server side web application based on Django, Django REST framework (DRF) and RDF. Its primary purpose is to provide a JSON API with authentication and authorization, in order to support a separate frontend application.

Before you start

You need to install the following software:

  • PostgreSQL >= 9.3, client, server and C libraries
  • Python >= 3.8
  • virtualenv
  • Apache Jena Fuseki (see notes below) (requires Java)
    • Elasticsearch 8 (see notes below), (requires Java)
  • RabbitMQ or other message broker and Celery (see notes below)
  • WSGI-compatible webserver (deployment only)
  • Visual C++ for Python (Windows only)

Notes for setting up Blazegraph

The development settings included with this application assume that you have a Blazegraph server running on port 9999 (the default) and the namespace readit is created. The following steps suffice to make this true.

Follow the Blazegraph quick start guide to download and start the database server and a foreground process. While the server is running, you can access its web interface at http://localhost:9999. This lets you upload and download data, try out queries and review statistics about the dataset. The server can be stopped by typing ctrl-c.

Setting up Blazegraph namespaces

Visit the web interface, navigate to the NAMESPACES tab. Use the create namespace form to create a new namespace. Choose readit as a name, and set the mode to quads. All other checkboxes should be disabled. A popup is shown with additional settings. Leave these at their default values and choose Create. The created namespace should now appear in the list of namespaces. Choose use to use the readit namespace when operating the web interface

In order to support the unittests, visit the Blazegraph web interface and create an additional namespace by the name readit-test.

If you are new to Blazegraph but not to READ-IT, i.e., you have previously deployed READ-IT version 0.4.0 or older, or done local development work on any commit that did not descend from 0063b21, then you should also read the following section about migrating your triples from the rdflib-django store to Blazegraph.

Migrating triples from rdflib-django to Blazegraph

If you are setting up the READ-IT backend anew, you can skip this section.

To copy pre-existing triples from the rdflib-django store to Blazegraph, only a few commands are needed. Ensure that Blazegraph is running and that your virtualenv is activated before you start.

First, open the interactive Django shell, for example with the following command.

$ python manage.py shell

In the interactive console, just two lines will do the trick:

>>> from scripts.move_to_sparqlstore import move
>>> move()

Setting up Elasticsearch

Download Elasticsearch 8 from the Elastic website. Optionally, also download Kibana (for easier index management). Unzip to a location of your choice. Navigate to the location of Elasticsearch and start with bin/elasticsearch (requires JAVA). This will start up the Elasticsearch server at localhost:9200. If you wish to use another port, you can set this in the Elasticsearch settings (config/elasticsearch.yml). In that case, adjust the settings in the settings.py document to use the correct port.

Set up an Elasticsearch index

Note: The following commands include localhost:9200. Omit everything until the first / after PUT or POST when using Kibana.

From Kibana, Postman, curl, or similar, create an index readit-1 with the following mapping:

PUT localhost:9200/readit-1
{
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "language": {"type": "keyword"},
      "text": {"type": "text"},
      "text_en":  {
        "type": "text",
        "analyzer": "english"
      },
      "text_fr":  {
        "type": "text",
        "analyzer": "french"
      },
      "text_de":  {
        "type": "text",
        "analyzer": "german"
      },
      "text_nl":  {
        "type": "text",
        "analyzer": "dutch"
      }
    }
  }
}

Also add an alias named readit, like so:

POST localhost:9200/_aliases
{
    "actions" : [
        { "add" : { "index" : "readit-1", "alias" : "readit" } }
    ]
}

This will make sure that the id and languate will be saved. The source texts will be saved with standard analyzer in the text field, and depending on the source language, in a text_{lang} field with a language-specific analyzer.

Indexing and reading will be performed via the alias readit, which is set in settings.py as ES_ALIASNAME. The alias is used such that indices can be rolled over to a new version if necessary. Then the alias will have to be unset from the old index, and set to the new index.

Run the conversion script

If you have sources in the media/sources folder, you can add them to the Elasticsearch index with a conversion script as follows:

>>> from scripts.sources_to_elasticsearch import text_to_index
>>> text_to_index()

Setting up Celery

  • Install RabbitMQ or another message broker.
  • Adjust settings in Celery accordingly (see Celery documentation).
  • Activate your virtual environment, make sure you installed all Python packages, then run:
$ cd backend
$ celery -A readit worker -l INFO

How it works

The readit package is our "project" in Django jargon. It contains all central administration. The settings, urls and wsgi modules inside this package play the same roles as in any Django project. The settings module contains defaults that can be immediately used in development, but should be overridden in production. The urls module registers DRF viewsets besides the regular Django registrations.

The index module contains a special view factory function which is meant to facilitate a client side application. It can generate views that always attempt to find a static HTML file with a particular name and return it as the response. Two views are generated in this way: index, which tries to respond with an index.html, and specRunner, which tries to respond with a specRunner.html. In the urls module, specRunner is configured to respond on the /specRunner.html path, but only in debug mode. The index view is configured as a global fallback route. The index.html should launch a client side (frontend) application that handles routing.

Note: this backend application doesn't—and shouldn't—contain a root index.html or specRunner.html in any of its static folders. Instead, you should add an external directory to Django's STATICFILES_DIRS setting which contains these files in its root, if you wish to combine this backend application with your frontend application of choice.

As in any Django application, you may add an arbitrary number of "application" (Django jargon) packages next to the readit package. Each "application" may contain its own models and migrations, as well as admin, signals, validators, urls etcetera. A views module may contain DRF viewsets instead of native Django views, in which case there should also be a serializers module which intermediates between the models and the views.

Unittest modules live directly next to the module they belong to. Each directory may contain a conftest.py with test fixtures available to all tests in the directory.

Data are stored in two places. The RDF triplestore contains the data of primary interest, i.e., sources, annotations and supporting concepts. The RDF data are segmented in several graphs, each represented by a separate Django application. The relational database takes care of user profiles, privileges and other bits of administration.

Each type of storage has its own way of describing the data model and of performing migrations. RDF is inherently self-describing, so the datamodel is stored alongside the data. Changes in the datamodel are performed using the rdfmigrate management command, which is implemented in our own rdf package.

The relational database follows the Django ORM conventions and can be migrated using the standard migrate command. The user list is however also exposed in RDF format, as if the users were stored in the triplestore. This facilitates linking annotations to users in RDF data.

Development

Quickstart

Create and activate a virtualenv. Ensure your working directory is the one that contains this README. Run the following commands as yourself (i.e., not in sudo mode nor with elevated privileges). You may need to reconfigure PostgreSQL and/or pass additional arguments to psql (in particular, your own PostgreSQL dbname and username) in order to be able to run the first command. You need to execute this sequence of commands only once after cloning the repository.

$ psql -f create_db.sql
$ pip install pip-tools
$ pip install -r requirements.txt
$ python manage.py migrate
$ python manage.py rdfmigrate
$ python manage.py createsuperuser

We need to install psycopg2 with the --no-binary flag until version 2.8 of psycopg2 is available. If this were not the case, we could use pip-sync instead of pip install -r; the former currently doesn't work because of the --no-binary flag being present in the requirements.txt.

If you are overriding the default settings, you may pass --pythonpath and --settings arguments to every invocation of python manage.py. --settings should be the name of the module (without .py) with your settings overrides. --pythonpath should be the path to the directory with your overridden settings module.

Running the application (development server)

$ python manage.py runserver

Once you see this line:

Starting development server at http://127.0.0.1:8000/

you can visit http://localhost:8000/admin/ and http://localhost:8000/api/ in your browser of choice. If you attached an external frontend application, its main page will be at http://localhost:8000/ and its unittests will be at http://localhost:8000/specRunner.html.

Enabling livereload

Run the following command in parallel with the development server:

$ python manage.py livereload

This works for all Python modules, templates and static files that Django knows about. This also includes external directories that you may have added to the STATICFILES_DIRS setting. The DEBUG setting should be True, otherwise the livereload script is not inserted in HTML pages by the livereload middleware.

Running the unittests

Run pytest to execute all tests once or pytest --looponfail to retest continuously as files change. Use the pytest-django helpers when writing new tests. pytest has all bells and whistles you may ever dream of; see the documentation.

Package management

When adding a new package to the requirements, it is recommended that you manually install it first and check that it works. Then, add the name of the package to the requirements.in. The entry should not include a version specification, unless you want to set an upper bound on the version. See the django entry for an example. After editing the requirements.in, run

$ pip-compile

to update the requirements.txt with pinned versions of the package and all of its dependencies. Commit the changes to requirements.in and requirements.txt together to VCS.

Deployment

Deployment is quite different from development. Please read the Django documentation and also the documentation of whatever webserver you are using. This section will only address some application specifics.

Overriding the settings

Make a copy of readit/settings.py and keep it out of reach from spying eyes. Change at least the following settings.

  • BASE_DIR should point to the directory containing this README.
  • SECRET_KEY should change to a different but equally long and random value. It is recommended that you use os.urandom for this.
  • DEBUG must be False.
  • ALLOWED_HOSTS should contain the hostname(s) on which you wish to serve your application. Just hostnames, e.g. example.com rather than http://example.com:88.
  • DATABASES['default']['PASSWORD'] should change and should also be impractically hard to guess.
  • STATIC_ROOT should point to a directory where you want to collect all static files.

See also the Django documentation.

Creating the database

You can follow the steps from create_db.sql, with two important differences:

  • The createdb permission is not needed in production, so you shouldn't include it.
  • The username, password and database name should be the same as the one in your settings overrides from the previous section.

Configuring your webserver

How to configure your webserver is completely beyond the scope of this README. However, we can mention a few things to keep in mind:

  • Django will not serve static files in production mode. You need to configure the webserver to directly serve files from the STATIC_ROOT in your settings at the STATIC_URL in your settings.
  • Your webserver configuration should set environment variables or pass arguments to the WSGI application so it will use the settings overrides rather than the defaults from readit/settings.py.