Skip to content

EWB's TM API

Lorena Calvo-Bartolomé edited this page May 5, 2024 · 4 revisions

Indexing

To ensure that the Evaluation Workbench's frontend displays relevant information, we need to index a logical corpus and one or more topic models associated with that corpus, under the condition that both of the latter have been created using the Interactive Model Trainer.

Corpus Indexing

To index a corpus into the Evaluation Workbench (EWB), we require the presence of the raw corpus in the mounted volume "/data/source" of the "ewb-restapi" service.

Then, to index the Cordis corpus, named "Cordis.parquet" we do as depicted in the following image:

Logical corpus indexing example

This process creates a corpus collection named "cordis" in Solr. The collection includes all the metadata available in the parquet file specified by the "parquet" field in the logical corpus. Additionally, it includes information related to the lemmas used for topic modeling calculations ("all_lemmas"). To maintain consistency among all the possible corpora indexed into the Solr collection, we rename the fields "id", "title", and "date" to these pseudonyms, regardless of their original names. The instructions for performing these field equivalences must be specified in the "ewb_config/config.cf" file prior to indexing. For more detailed information, you can refer to here.

During corpus indexing, an entry is also created in the "corpora" collection. This collection stores information about all the corpus collections indexed in the Solr instances, along with their indexed models.

Model Indexing

To index a model into the Evaluation Workbench (EWB), the following requirements must be met:

  1. The topic model entity, specified by the Interactive Topic model, should be present in the mounted volume "/data/source" of the "ewb-restapi". This includes a folder named after the model, containing at least the "TMmodel" folder and the training configuration file ("trainconfig.json").

  2. The model to be indexed must be associated with a logical corpus that has already been indexed into the Solr instance.

To index a model (e.g., a model named "Mallet-10"), follow the steps illustrated in the image below:

Model indexing example

This process creates a model collection named "mallet-10" in Solr. The collection includes all the metadata available in the model's "TMmodel" folder, namely word distribution, size, entropy, coherence, number of active documents, chemical description, labels, vocabulary, and coordinates in a 2D-space for each topic in the model.

Additionally, the corpus collection associated with the model is modified by adding two fields to each document that has a topical representation for that model:

  • "doctpc_{model_name}" contains the document-topic distribution given by the model with the name "model_name".
  • "sim_{model_name}" contains a list of the 50 most similar documents to the given document, according to the model with the name "model_name". These additional fields are included in the corpus information within the "corpora" collection. Furthermore, the name of the model collection is added to the list of models associated with that corpus, as shown in the example below:

Collection corpora after indexing corpus and model

Endpoints

Collections

The endpoints in this category refer to generic Solr-related operations that, in principle, will only be used internally:

  • /collections/createCollection/: Creates a Solr collection.
  • /collections/deleteCollection/: Deletes a Solr collection.
  • /collections/listCollections/: List all collections available in the Solr instance.
  • /collections/query/: Performs a generic Solr query.

Corpora

These endpoints performs corpora-related operations, that is, those related with the management, indexing and listing of linguistic data sets or collections known as corpora:

  • /corpora/deleteCorpus/: Deletes a corpus collection.
  • /corpora/indexCorpus/: Index a corpus in a Solr collection, the name of such a collection being that of the logical corpus that describes the corpus.
  • /corpora/listAllCorpus/: List all the corpus collections available in the Solr instance.
  • /corpora/listCorpusModels/: List all the models associated with a specific corpus previously indexed in Solr.

Note that /corpora/deleteCorpus/ and /corpora/indexCorpus/ will be invoked either internally or through the ITMT, but not from the EWB frontend.

Models

These endpoints performs models-related operations, that is, those related with the management, indexing and listing of topic models:

  • /models/deleteModel/: Deletes a model collection.
  • /models/indexModel/: Index the model information in a model collection and its corresponding corpus collection.
  • /models/listAllModels/: List all model collections available in the Solr instance.

Note that /models/deleteModel/ and /models/indexModel/ will be invoked either internally or through the ITMT, but not from the EWB frontend.

Queries

Query Endpoint Description Returns Ready?
Q1 getThetasDocById Retrieve the document-topic distribution of a selected document in a corpus collection for a given topic model {"thetas": thetas}
Q2 getCorpusMetadataFields Get the available metadata fields for a specific corpus collection {"metadata_fields": meta_fields}
Q3 getNrDocsColl Get the number of documents in a collection {"ndocs": ndocs}
Q4 getDocsWithThetasLargerThanThr Get documents with a topic proportion larger than a threshold according to a selected topic model [{"id": id1, "doctpc_{model_name}": doctpc1 }, {"id": id2, "doctpc_{model_name}": doctpc2}, ...]
Q5 getDocsWithHighSimWithDocByid Retrieve documents that have a high semantic relationship with a selected document, i.e., its most similar documents [{"id": id1, "score": score1 }, {"id": id2, "score": score2 }, ...]
Q6 getMetadataDocById Get the metadata of a selected document in a corpus collection {"metadata1": metadata1, "metadata2": metadata2, "metadata3": metadata3, ... }
Q7 getDocsWithString Retrieve the IDs of documents whose title contains a specific string in a corpus collection [{"id": id1}, {"id": id2}, ...]
Q8 getTopicsLabels Get the labels associated with each topic in a given model [{"id": id1, "tpc_labels": label1 }, {"id": id2, "tpc_labels": label2}, ...]
Q9 getTopicTopDocs Get the top documents for a given topic in a model collection. Two criteria are considered: first, the thematic representation for the requested topic and second, the number of words in the document. [{"id": id1, "thetas": thetas1, "num_words_per_doc": num_words_per_doc1 }, {"id": id2, thetas": thetas2, "num_words_per_doc": num_words_per_doc2}, ...]
Q10 getModelInfo Get information (chemical description, label, statistics, top docs, etc.) for each topic in a model collection [{"id":id1, "betas": betas1, "alphas": alphas1, "topic_entropy":entropies1, "topic_coherence":cohrs1, "ndocs_active":active1, "tpc_descriptions":desc1, "tpc_labels":labels1, "coords":coords1, "top_words_betas":top_words_betas1,}, {"id":id2, "betas": betas2, "alphas": alphas2, "topic_entropy":entropies2, "topic_coherence":cohrs2, "ndocs_active":active2, "tpc_descriptions":desc2, "tpc_labels":labels2, "coords":coords2, "top_words_betas":top_words_betas2}, ...]
Q11 getBetasTopicById Get the word distribution of a selected topic in a model collection {"betas": betas}
Q12 getMostCorrelatedTopics Get the most correlated topics to a given topic in a selected model [{"id": id1, "betas": betas1 }, {"id": id2, "betas": betas2}, ...]
Q13 getPairsOfDocsWithHighSim Retrieve pairs of documents with a semantic similarity larger than a certain threshold in a given topic model, filtered by year. [{"id_1": id1, "id_2": id2, "score": score1 }, {"id_1": id1, "id_2": id2, "score": score2}, ...]
Q14 getDocsSimilarToFreeText Get documents that are semantically similar to a free text according to a given topic model [{"id": id1, "score": score1 }, {"id": id2, "score": score2 }, ...]
Q15 getLemmasDocById Retrieve the lemmas of a selected document in a corpus collection {"thetas": thetas}
Q16 getThetasAndDateAllDocs Get the date and document-topic representation associated with a given model for all documents in a corpus collection [{"id": id1, "date": date1, "doctpc_{model_name}":doctpc1}, {"id": id2, "date": date2, "doctpc_{model_name}":doctpc2}, ...]
Q17 getBetasByWordAndTopicId Get the topic-word distribution of a given word in a given topic associated with a given model {"betas": betas}
Clone this wiki locally