Skip to content

Commit

Permalink
Merge pull request #10505 from Recherche-Data-Gouv/9276-allow-mapping…
Browse files Browse the repository at this point in the history
…-of-indexable-fields-in-cvoc-conf

CVOC : Indexed field accuracy (Ontoportal integration)
  • Loading branch information
sekmiller authored Jun 13, 2024
2 parents 5bf6b6d + 4a6d3e4 commit ad58f3e
Show file tree
Hide file tree
Showing 8 changed files with 485 additions and 35 deletions.
18 changes: 18 additions & 0 deletions doc/release-notes/9276-doc-cvoc-index-in.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## Release Highlights

### Updates on Support for External Vocabulary Services

Multiple extensions of the External Vocabulary mechanism have been added. These extensions allow interaction with services based on the Ontoportal software and are expected to be generally useful for other service types.

These changes include:

#### Improved Indexing with Compound Fields

When using an external vocabulary service with compound fields, you can now specify which field(s) will include additional indexed information, such as translations of an entry into other languages. This is done by adding the `indexIn` in `retrieval-filtering`. (#10505)
For more information, please check [GDCC/dataverse-external-vocab-support documentation](https://github.com/gdcc/dataverse-external-vocab-support/tree/main/docs).

#### Broader Support for Indexing Service Responses

Indexing of the results from `retrieval-filtering` responses can now handle additional formats including Json Arrays of Strings and values from arbitrary keys within a JSON Object. (#10505)

**** This documentation must be merged with 9276-allow-flexible-params-in-retrievaluri-cvoc.md (#10404)
4 changes: 2 additions & 2 deletions doc/sphinx-guides/source/admin/metadatacustomization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -579,9 +579,9 @@ In general, the external vocabulary support mechanism may be a better choice for
The specifics of the user interface for entering/selecting a vocabulary term and how that term is then displayed are managed by third-party Javascripts. The initial Javascripts that have been created provide auto-completion, displaying a list of choices that match what the user has typed so far, but other interfaces, such as displaying a tree of options for a hierarchical vocabulary, are possible.
Similarly, existing scripts do relatively simple things for displaying a term - showing the term's name in the appropriate language and providing a link to an external URL with more information, but more sophisticated displays are possible.

Scripts supporting use of vocabularies from services supporting the SKOMOS protocol (see https://skosmos.org), retrieving ORCIDs (from https://orcid.org), and using ROR (https://ror.org/) are available https://github.com/gdcc/dataverse-external-vocab-support. (Custom scripts can also be used and community members are encouraged to share new scripts through the dataverse-external-vocab-support repository.)
Scripts supporting use of vocabularies from services supporting the SKOMOS protocol (see https://skosmos.org), retrieving ORCIDs (from https://orcid.org), services based on Ontoportal product (see https://ontoportal.org/), and using ROR (https://ror.org/) are available https://github.com/gdcc/dataverse-external-vocab-support. (Custom scripts can also be used and community members are encouraged to share new scripts through the dataverse-external-vocab-support repository.)

Configuration involves specifying which fields are to be mapped, whether free-text entries are allowed, which vocabulary(ies) should be used, what languages those vocabulary(ies) are available in, and several service protocol and service instance specific parameters, including the ability to send HTTP headers on calls to the service.
Configuration involves specifying which fields are to be mapped, to which Solr field they should be indexed, whether free-text entries are allowed, which vocabulary(ies) should be used, what languages those vocabulary(ies) are available in, and several service protocol and service instance specific parameters, including the ability to send HTTP headers on calls to the service.
These are all defined in the :ref:`:CVocConf <:CVocConf>` setting as a JSON array. Details about the required elements as well as example JSON arrays are available at https://github.com/gdcc/dataverse-external-vocab-support, along with an example metadata block that can be used for testing.
The scripts required can be hosted locally or retrieved dynamically from https://gdcc.github.io/ (similar to how dataverse-previewers work).

Expand Down
91 changes: 61 additions & 30 deletions src/main/java/edu/harvard/iq/dataverse/DatasetFieldServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@

import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.lang3.StringUtils;
import org.apache.http.HttpResponse;
import org.apache.http.HttpResponseInterceptor;
import org.apache.http.client.methods.HttpGet;
Expand Down Expand Up @@ -322,14 +323,15 @@ public Map<Long, JsonObject> getCVocConf(boolean byTermUriField){
+ jo.getString("term-uri-field"));
}
}
if (jo.containsKey("child-fields")) {
JsonArray childFields = jo.getJsonArray("child-fields");
for (JsonString elm : childFields.getValuesAs(JsonString.class)) {
dft = findByNameOpt(elm.getString());
logger.info("Found: " + dft.getName());
if (jo.containsKey("managed-fields")) {
JsonObject managedFields = jo.getJsonObject("managed-fields");
for (String s : managedFields.keySet()) {
dft = findByNameOpt(managedFields.getString(s));
if (dft == null) {
logger.warning("Ignoring External Vocabulary setting for non-existent child field: "
+ elm.getString());
+ managedFields.getString(s));
} else {
logger.fine("Found: " + dft.getName());
}
}
}
Expand All @@ -346,7 +348,7 @@ public Map<Long, JsonObject> getCVocConf(boolean byTermUriField){
* @param df - the primitive/parent compound field containing a newly saved value
*/
public void registerExternalVocabValues(DatasetField df) {
DatasetFieldType dft =df.getDatasetFieldType();
DatasetFieldType dft = df.getDatasetFieldType();
logger.fine("Registering for field: " + dft.getName());
JsonObject cvocEntry = getCVocConf(true).get(dft.getId());
if (dft.isPrimitive()) {
Expand All @@ -371,38 +373,48 @@ public void registerExternalVocabValues(DatasetField df) {
}
}
}

/**
* Retrieves indexable strings from a cached externalvocabularyvalue entry.
*
* This method assumes externalvocabularyvalue entries have been filtered and
* the externalvocabularyvalue entry contain a single JsonObject whose "personName" or "termName" values
* are either Strings or an array of objects with "lang" and ("value" or "content") keys. The
* string, or the "value/content"s for each language are added to the set.
*
* Retrieves indexable strings from a cached externalvocabularyvalue entry filtered through retrieval-filtering configuration.
* <p>
* This method assumes externalvocabularyvalue entries have been filtered and that they contain a single JsonObject.
* Cases Handled : A String, an Array of Strings, an Array of Objects with "value" or "content" keys, an Object with one or more entries that have String values or Array values with a set of String values.
* The string(s), or the "value/content"s for each language are added to the set.
* Retrieved string values are indexed in the term-uri-field (parameter defined in CVOC configuration) by default, or in the field specified by an optional "indexIn" parameter in the retrieval-filtering defined in the CVOC configuration.
* <p>
* Any parsing error results in no entries (there can be unfiltered entries with
* unknown structure - getting some strings from such an entry could give fairly
* random info that would be bad to addd for searches, etc.)
*
* @param termUri
*
* @param termUri unique identifier to search in database
* @param cvocEntry related cvoc configuration
* @param indexingField name of solr field that will be filled with getStringsFor while indexing
* @return - a set of indexable strings
*/
public Set<String> getStringsFor(String termUri) {
Set<String> strings = new HashSet<String>();
public Set<String> getIndexableStringsByTermUri(String termUri, JsonObject cvocEntry, String indexingField) {
Set<String> strings = new HashSet<>();
JsonObject jo = getExternalVocabularyValue(termUri);
JsonObject filtering = cvocEntry.getJsonObject("retrieval-filtering");
String termUriField = cvocEntry.getJsonString("term-uri-field").getString();

if (jo != null) {
try {
for (String key : jo.keySet()) {
if (key.equals("termName") || key.equals("personName")) {
String indexIn = filtering.getJsonObject(key).getString("indexIn", null);
// Either we are in mapping mode so indexingField (solr field) equals indexIn (cvoc config)
// Or we are in default mode indexingField is termUriField, indexIn is not defined then only termName and personName keys are used
if (indexingField.equals(indexIn) ||
(indexIn == null && termUriField.equals(indexingField) && (key.equals("termName")) || key.equals("personName"))) {
JsonValue jv = jo.get(key);
if (jv.getValueType().equals(JsonValue.ValueType.STRING)) {
logger.fine("adding " + jo.getString(key) + " for " + termUri);
strings.add(jo.getString(key));
} else {
if (jv.getValueType().equals(JsonValue.ValueType.ARRAY)) {
JsonArray jarr = jv.asJsonArray();
for (int i = 0; i < jarr.size(); i++) {
} else if (jv.getValueType().equals(JsonValue.ValueType.ARRAY)) {
JsonArray jarr = jv.asJsonArray();
for (int i = 0; i < jarr.size(); i++) {
if (jarr.get(i).getValueType().equals(JsonValue.ValueType.STRING)) {
strings.add(jarr.getString(i));
} else if (jarr.get(i).getValueType().equals(ValueType.OBJECT)) { // This condition handles SKOMOS format like [{"lang": "en","value": "non-apis bee"},{"lang": "fr","value": "abeille non apis"}]
JsonObject entry = jarr.getJsonObject(i);
if (entry.containsKey("value")) {
logger.fine("adding " + entry.getString("value") + " for " + termUri);
Expand All @@ -414,6 +426,22 @@ public Set<String> getStringsFor(String termUri) {
}
}
}
} else if (jv.getValueType().equals(JsonValue.ValueType.OBJECT)) {
JsonObject joo = jv.asJsonObject();
for (Map.Entry<String, JsonValue> entry : joo.entrySet()) {
if (entry.getValue().getValueType().equals(JsonValue.ValueType.STRING)) { // This condition handles format like { "fr": "association de quartier", "en": "neighborhood associations"}
logger.fine("adding " + joo.getString(entry.getKey()) + " for " + termUri);
strings.add(joo.getString(entry.getKey()));
} else if (entry.getValue().getValueType().equals(ValueType.ARRAY)) { // This condition handles format like {"en": ["neighbourhood societies"]}
JsonArray jarr = entry.getValue().asJsonArray();
for (int i = 0; i < jarr.size(); i++) {
if (jarr.get(i).getValueType().equals(JsonValue.ValueType.STRING)) {
logger.fine("adding " + jarr.getString(i) + " for " + termUri);
strings.add(jarr.getString(i));
}
}
}
}
}
}
}
Expand All @@ -425,7 +453,7 @@ public Set<String> getStringsFor(String termUri) {
}
logger.fine("Returning " + String.join(",", strings) + " for " + termUri);
return strings;
}
}

/**
* Perform a query to retrieve a cached value from the externalvocabularvalue table
Expand Down Expand Up @@ -461,10 +489,11 @@ public void registerExternalTerm(JsonObject cvocEntry, String term, List<Dataset
String retrievalUri = cvocEntry.getString("retrieval-uri");
String termUriFieldName = cvocEntry.getString("term-uri-field");
String prefix = cvocEntry.getString("prefix", null);
if(term.isBlank()) {
if(StringUtils.isBlank(term)) {
logger.fine("Ignoring blank term");
return;
}

boolean isExternal = false;
JsonObject vocabs = cvocEntry.getJsonObject("vocabs");
for (String key: vocabs.keySet()) {
Expand Down Expand Up @@ -532,7 +561,7 @@ public void process(HttpResponse response, HttpContext context) throws HttpExcep
if (statusCode == 200) {
logger.fine("Returned data: " + data);
try (JsonReader jsonReader = Json.createReader(new StringReader(data))) {
String dataObj =filterResponse(cvocEntry, jsonReader.readObject(), term).toString();
String dataObj = filterResponse(cvocEntry, jsonReader.readObject(), term).toString();
evv.setValue(dataObj);
evv.setLastUpdateDate(Timestamp.from(Instant.now()));
logger.fine("JsonObject: " + dataObj);
Expand Down Expand Up @@ -574,7 +603,7 @@ private String replaceRetrievalUriParam(String retrievalUri, String paramName, S
* Parse the raw value returned by an external service for a give term uri and
* filter it according to the 'retrieval-filtering' configuration for this
* DatasetFieldType, creating a Json value with the specified structure
*
*
* @param cvocEntry - the config for this DatasetFieldType
* @param readObject - the raw response from the service
* @param termUri - the term uri
Expand Down Expand Up @@ -633,6 +662,8 @@ private JsonObject filterResponse(JsonObject cvocEntry, JsonObject readObject, S
if (pattern.equals("{0}")) {
if (vals.get(0) instanceof JsonArray) {
job.add(filterKey, (JsonArray) vals.get(0));
} else if (vals.get(0) instanceof JsonObject) {
job.add(filterKey, (JsonObject) vals.get(0));
} else {
job.add(filterKey, (String) vals.get(0));
}
Expand Down Expand Up @@ -670,7 +701,7 @@ Object processPathSegment(int index, String[] pathParts, JsonValue curPath, Stri
String[] keyVal = pathParts[index].split("=");
logger.fine("Looking for object where " + keyVal[0] + " is " + keyVal[1]);
String expected = keyVal[1];

if (!expected.equals("*")) {
if (expected.equals("@id")) {
expected = termUri;
Expand Down Expand Up @@ -699,7 +730,7 @@ Object processPathSegment(int index, String[] pathParts, JsonValue curPath, Stri
}
return parts.build();
}

} else {
curPath = ((JsonObject) curPath).get(pathParts[index]);
logger.fine("Found next Path object " + curPath.toString());
Expand Down
Loading

0 comments on commit ad58f3e

Please sign in to comment.