Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: index numerical and date fields in Solr with appropriate types + more targeted search result highlighting #10887

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

vera
Copy link
Contributor

@vera vera commented Sep 27, 2024

What this PR does / why we need it:

Currently, all fields regardless of type are indexed in Solr as English text (text_en). With this PR, numerical and date fields are indexed in Solr with appropriate types:

Field type defined in TSV Field type indexed in Solr
int plong
float pdouble
date date_range (solr.DateRangeField)

I chose to index dates as DateRangeField because they can be used to represent dates to any precision, e.g. a day YYYY-MM-DD, a month YYYY-MM or a year YYYY. See: Date Formatting and Date Math :: Apache Solr Reference Guide

This matches the allowed formats in a date field as defined by Dataverse.

This means that range queries are now possible on numerical and date fields, e.g. exampleIntegerField:[25 TO 50] or exampleDateField:[2000-11-01 TO 2014-12-01].

Which issue(s) this PR closes:

This PR implements ranged queries as discussed in #370 (issue was already closed)

This issue is related to #8813 and IQSS/dataverse-frontend#278 (the range queries that are now possible lay the groundwork for a nicer search facet UI)

Special notes for your reviewer:

For testing, I've created a sample TSV containing all relevant fields here.

Suggestions on how to test this:

Regression Testing

  • On a server running "develop", publish some datasets that use the affected fields mentioned in the release note (coverage.Depth, etc.). Be sure dates, integers and floats are all used by populating various different fields.
  • Deploy this branch
  • Test, test, test for any regressions

Feature testing

  1. Load sample TSV and update + reload Solr schema as described in docs
  2. In the UI:
    1. Activate metadata block
    2. Activate facets for all three fields
    3. Create dataset with values in all three fields
  3. Run test range queries via the search bar, e.g. exampleIntegerField:[25 TO 50] or exampleDateField:[2000-11-01 TO 2014-12-01]
  4. Check that facets are working correctly

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Facets still look the same as before. There is only a small change in the highlighting of search results, see my comment below

Is there a release notes update needed for this change?:

Yes, there should be an info text describing the new feature + instructions for how to activate the feature:

  • the Solr schema.xml needs to be updated
  • all datasets need to be reindexed

Additional documentation:

/

@coveralls
Copy link

coveralls commented Sep 27, 2024

Coverage Status

coverage: 22.568% (-0.003%) from 22.571%
when pulling 541946e on vera:feat/solr-field-types
into e3b5795 on IQSS:develop.

@vera
Copy link
Contributor Author

vera commented Sep 27, 2024

Additionally, I've set hl.requireFieldMatch to true:

If false, all query terms will be highlighted for each field to be highlighted (hl.fl) no matter what fields the parsed query refer to. If set to true, only query terms aligning with the field being highlighted will in turn be highlighted.

https://solr.apache.org/guide/solr/latest/query-guide/highlighting.html

Two reasons:

  1. Querying solr with a date range query with activated highlighting using the default (unified) highlighter without requireFieldMatch triggers a 500 error in Solr (see my post on the Solr mailing list. My guess is that Solr is attempting to highlight the matched date range within fields in a nonsensical way which triggers the error)
  2. I think this improves the highlighting of search results, previously a match of my search term is highlighted anywhere even if I limited my query to a specific field, e.g. here "replication" is also highlighted in the title even though I limited my search specifically to the description:

image

With this change, the highlighting is limited to specific fields if the query is:

image
image
image

@pdurbin pdurbin added the Size: 3 A percentage of a sprint. 2.1 hours. label Sep 27, 2024
@vera vera changed the title feat: index numerical and date fields in Solr with appropriate types feat: index numerical and date fields in Solr with appropriate types + more targeted search result highlighting Sep 27, 2024
@johannes-darms
Copy link
Contributor

@qqmyers Would it make sense to include this feature in the next release? It's a rahter small adaption that improve the serach experience.

@pdurbin
Copy link
Member

pdurbin commented Oct 15, 2024

@johannes-darms it's a very cool feature that adds a lot of value, something I've wanted for years.

Let's see what @cmbz and @scolapasta think.

@pdurbin pdurbin added the Champion: pdurbin Championed by @pdurbin for inclusion in the next release label Oct 15, 2024
@qqmyers
Copy link
Member

qqmyers commented Oct 15, 2024

One question - is it possible to enter or have legacy values that don't fit the new types that would break indexing?

@cmbz cmbz added the GREI 3 Search and Browse label Oct 15, 2024
@cmbz cmbz added this to the 6.5 milestone Oct 15, 2024
@cmbz
Copy link

cmbz commented Oct 15, 2024

2024/10/15: Added to sprint ready after conversation with @pdurbin

@vera
Copy link
Contributor Author

vera commented Oct 17, 2024

@qqmyers good question. When I tried to enter to invalid data (non-integer in an integer field, non-float in a float field, or non-date in a date field), I got the following errors via the UI:

image

...and via the API:

{"status":"ERROR","message":"Validation Failed: Example integer field is not a valid integer. (Invalid value:edu.harvard.iq.dataverse.DatasetFieldValueValue[ id=null ]), Example floating point field is not a valid number. (Invalid value:edu.harvard.iq.dataverse.DatasetFieldValueValue[ id=null ]), Example date field is not a valid date. \"yyyy\" is a supported format. (Invalid value:edu.harvard.iq.dataverse.DatasetFieldValueValue[ id=null ]).java.util.stream.ReferencePipeline$3@6e9605fb"}

The validator code seems to be quite old, so I don't know if there could be any installations with legacy invalid values entered before it was added.

However, looking at the validator code, I found that date fields allow some formats which are not documented: YYYY followed by AD or BC, a "Bracket format" (not familiar with this), and datetime formats (yyyy-MM-dd'T'HH:mm:ss, yyyy-MM-dd'T'HH:mm:ss.SSS and yyyy-MM-dd HH:mm:ss).

When I add a dataset using one of those formats and try to index it, the indexing fails. The Dataverse log shows an error like dev_solr> org.apache.solr.common.SolrException: ERROR: [doc=dataset_2_draft] Error adding field 'exampleDateField'='2024-09-01 12:34:56' msg=Couldn't parse date because: Improperly formatted datetime: 2024-09-01 12:34:56 and the dataset is missing from the UI.

So, yes, there may be some installations with date field values using the above formats which would cause invisible datasets due to indexing errors. I am not sure how we should deal with this. Are those date formats intended to be officially fully supported/widely used?

If no, it might be OK to offer a workaround in the upgrade instructions like "Please ensure that all date fields containing legacy dates in formats other than YYYY, YYYY-MM, YYYY-MM-DD, or YYYY-MM-DDThh:mm:ssZ are not updated to the new date_range type in the Solr schema. Otherwise, datasets with these legacy dates will fail to index and disappear from your Dataverse page.".

If yes, this feature becomes a bit more complicated, because Solr does not support those date formats and we would need to work around that somehow.

@vera
Copy link
Contributor Author

vera commented Oct 17, 2024

@pdurbin I've just added an API test for the range queries as you suggested.

@qqmyers
Copy link
Member

qqmyers commented Oct 17, 2024

I don't know what is required but I wouldn't be surprised if BC dates are something people want and want to have indexed. Perhaps @jggautier would know more about what legacy values exist and what's required.
W.r.t. validation - I'd also make sure that the API calls don't allow bad values - I know there was newer validation code just added there. W.r.t. the code, I'd definitely suggest doing checks for int, float, date values when indexing or somehow assure that a bad legacy value can't break the overall submission and we just drop that field rather than have the dataset not index. (I didn't see such a check but I might have missed it).

@vera
Copy link
Contributor Author

vera commented Oct 17, 2024

I wouldn't be surprised if BC dates are something people want and want to have indexed.

It wouldn't be a problem in general. Solr does support BC dates, however in a different format than YYYYBC:

-0009 – The year 10 BC. A 0 in the year position is 0 AD, and is also considered 1 BC.

https://solr.apache.org/guide/solr/latest/indexing-guide/date-formatting-math.html

W.r.t. validation - I'd also make sure that the API calls don't allow bad values - I know there was newer validation code just added there.

Is the code doing checks for API-submitted datasets different from the code doing the UI checks? I assumed it was both the same code I linked above, since the error messages are the same.

I'd definitely suggest doing checks for int, float, date values when indexing or somehow assure that a bad legacy value can't break the overall submission and we just drop that field rather than have the dataset not index.

Yes, that would be nice.

@jggautier
Copy link
Contributor

jggautier commented Oct 17, 2024

Hi all. I haven't been following this issue closely enough to contribute and won't have the time to catch up. But I agree with Jim and encourage folks to look into how others have and are using these fields. My dataset at https://doi.org/10.7910/DVN/2SA6SN might be helpful for seeing who's using the fields in different ways. And the list of contacts in our spreadsheet of Dataverse installations might help for contacting particular installations to learn more.

@pdurbin
Copy link
Member

pdurbin commented Oct 17, 2024

Here's the related issue about BC dates:

@vera
Copy link
Contributor Author

vera commented Oct 18, 2024

I've just pushed a commit that implements the suggestion above (if encountering a bad legacy value in an int/float/date field, just drop that field, but index the rest of the dataset).

So, if a dataset contains a bad legacy value in an int/float/date field, this means that queries on that field will not yield that dataset, since the field hasn't been indexed. But for any other query, the dataset will still be found. (I've also added a test showing this)

I think this is a small limitation. And we could add BC support relatively easily in a future PR.

@cmbz cmbz added the FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) label Nov 7, 2024
@pdurbin pdurbin removed the Champion: pdurbin Championed by @pdurbin for inclusion in the next release label Nov 12, 2024
@pdurbin pdurbin self-assigned this Nov 15, 2024
@pdurbin pdurbin unassigned pdurbin and vera Nov 18, 2024
@ofahimIQSS ofahimIQSS self-assigned this Nov 19, 2024
@cmbz cmbz added the FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) label Nov 21, 2024
@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Dec 3, 2024

I came across an issue while testing this in my local.
After I load sample TSV and update + reload Solr schema as described in docs.

  1. Create a dataverse using the new Solr Metadata fields
  2. Create a dataset within the same collection, ensuring solr metadata fields are populated as follows:
    Integer: 22 Float: 19 Date: 2024
  3. Save and publish the dataset
  4. Go back to search and try to find the dataset

Issue: Dataset does not display on the UI after publishishing

Screen.Recording.2024-12-03.at.1.29.58.PM.mov

server.log.txt

@pdurbin
Copy link
Member

pdurbin commented Dec 3, 2024

@ofahimIQSS can you please provide more of server.log?

Also, let's add @vera to the PR to let her know you're having some trouble. Maybe she can help

@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Dec 3, 2024

serverlog.txt
attaching extended server.log file

@vera One more note it may be a problem getting solr configured/ not a code problem. --- I am still retesting on my end

@pdurbin
Copy link
Member

pdurbin commented Dec 3, 2024

@ofahimIQSS ah you're still re-testing. Yes, the error in your log...

dev_solr> org.apache.solr.common.SolrException: ERROR: [doc=dataset_65] unknown field 'exampleDateField'\

... means that you need to update your schema.xml file.

I'm not sure if this helps, but I added this one-liner...

curl http://localhost:8080/api/admin/index/solr/schema | docker run -i --rm -v ./docker-dev-volumes/solr/data:/var/solr gdcc/configbaker:unstable update-fields.sh /var/solr/data/collection1/conf/schema.xml

... to a PR I'm working on at #11024.

@ofahimIQSS
Copy link
Contributor

ofahimIQSS commented Dec 4, 2024

Hi Vera - I’ve tried to validate this ticket but could not see the datasets after I publish them on my local. Steps to reproduce:

  1. Build PR in local environment
  2. Load the solr tsv file
  3. Update Solr schema curl http://localhost:8080/api/admin/index/solr/schema | docker run -i --rm -v ./docker-dev-volumes/solr/data:/var/solr gdcc/configbaker:unstable update-fields.sh /var/solr/data/collection1/conf/schema.xml
  4. Restart Solr --- in local, go to solr container and run bin/solr restart
  5. Clear all data from Solr and start Async Reindex based on https://guides.dataverse.org/en/6.4/admin/solr-search-index.html
  6. Create a Collection with Solr Field Types Test Metadata
  7. Create a dataset within the collection and publish
    Issue: After publishing dataset, dataset is not appearing on the UI. Server.log file can be found below.
    server.log.txt

You can ignore this comment --- I have since resolved the issue from my end.

@ofahimIQSS
Copy link
Contributor

Overall, PR looks good. One observation I had was with the "Example Date Field". I can enter in a 2 digit value and save it as a date but the values specified for that field are (YYYY-MM-DD, YYYY-MM, or YYYY).

image

@cmbz cmbz added the FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) label Dec 5, 2024
@vera
Copy link
Contributor Author

vera commented Dec 5, 2024

@ofahimIQSS thanks for testing! Yes, I am seeing the same behaviour. The document indexed in Solr for that metadata looks like this:

image

While exampleFloatField and exampleIntegerField are indexed, exampleDateField is not because "11" is an invalid date value (according to Solr).

I think this is a pre-existing bug/inconsistency in the date field validation code not caused by this PR. We might want to open an issue for that.

@pdurbin
Copy link
Member

pdurbin commented Dec 5, 2024

Hmm, this makes me wonder about two and three digit years from ancient history.

For example, the philosopher Epictetus was born in 50 and died in 135. It sounds like Dataverse doesn't want to index these as real dates. Do we have to pad them as 0050 and 0135? (I haven't tried this.)

This issue about BCE dates is related:

To quote @vera from that issue, "Since the Solr search index underlying Dataverse supports BC dates (using the ISO 8601 format: 1 BC = +0000, 2 BC = -0001, and so on)."

@vera
Copy link
Contributor Author

vera commented Dec 5, 2024

Yes, padding with zeroes seems to be the way intended by ISO8601/Solr. I just tried inputting "0011" instead of "11" and it indexes fine.

Stop Solr (usually `service solr stop`, depending on Solr installation/OS, see the [Installation Guide](https://guides.dataverse.org/en/6.5/installation/prerequisites.html#solr-init-script)).

```shell
service solr stop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
service solr stop
sudo service solr stop

## Upgrade Instructions

7\. Update Solr schema.xml file. Start with the standard v6.5 schema.xml, then, if your installation uses any custom or experimental metadata blocks, update it to include the extra fields (step 7a).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Run the commands below as a non-root user.


```shell
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.5/conf/solr/schema.xml
cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf
sudo cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf

Start Solr (but if you use any custom metadata blocks, perform the next step, 7a first).

```shell
service solr start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
service solr start
sudo service solr start

```shell
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.5/conf/solr/update-fields.sh
chmod +x update-fields.sh
curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml
curl "http://localhost:8080/api/admin/index/solr/schema" | sudo ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml

@ofahimIQSS
Copy link
Contributor

Wanted to drop another observation (edge case) I found during testing related to display. When entering very long values in the meta data fields, the numbers go past the border instead of wrapping around.
image
To test, populate the Solr Field Types Test Metadata as follows:
Integer: add as many 0's followed by a 2. ie. 0000...00002
Floating Point: 2.01000290329039 - copy and paste everything after the decimal to make it a long
Date: 2024-11-11 (or any valid date)

@pdurbin pdurbin mentioned this pull request Dec 5, 2024
@ofahimIQSS ofahimIQSS removed this from the 6.5 milestone Dec 6, 2024
@pdurbin pdurbin added this to the 6.6 milestone Dec 9, 2024
@pdurbin
Copy link
Member

pdurbin commented Dec 9, 2024

@vera I'm afraid we had to bump the milestone to 6.6 but I think this feature will be very popular! Thanks again! ❤️

@cmbz cmbz added the FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) GREI 3 Search and Browse Size: 3 A percentage of a sprint. 2.1 hours. Type: Feature a feature request
Projects
Status: QA ✅
Development

Successfully merging this pull request may close these issues.

8 participants