-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: index numerical and date fields in Solr with appropriate types + more targeted search result highlighting #10887
base: develop
Are you sure you want to change the base?
Conversation
Additionally, I've set
https://solr.apache.org/guide/solr/latest/query-guide/highlighting.html Two reasons:
With this change, the highlighting is limited to specific fields if the query is: |
@qqmyers Would it make sense to include this feature in the next release? It's a rahter small adaption that improve the serach experience. |
@johannes-darms it's a very cool feature that adds a lot of value, something I've wanted for years. Let's see what @cmbz and @scolapasta think. |
One question - is it possible to enter or have legacy values that don't fit the new types that would break indexing? |
2024/10/15: Added to sprint ready after conversation with @pdurbin |
@qqmyers good question. When I tried to enter to invalid data (non-integer in an integer field, non-float in a float field, or non-date in a date field), I got the following errors via the UI: ...and via the API:
The validator code seems to be quite old, so I don't know if there could be any installations with legacy invalid values entered before it was added. However, looking at the validator code, I found that date fields allow some formats which are not documented: When I add a dataset using one of those formats and try to index it, the indexing fails. The Dataverse log shows an error like So, yes, there may be some installations with date field values using the above formats which would cause invisible datasets due to indexing errors. I am not sure how we should deal with this. Are those date formats intended to be officially fully supported/widely used? If no, it might be OK to offer a workaround in the upgrade instructions like "Please ensure that all date fields containing legacy dates in formats other than If yes, this feature becomes a bit more complicated, because Solr does not support those date formats and we would need to work around that somehow. |
@pdurbin I've just added an API test for the range queries as you suggested. |
I don't know what is required but I wouldn't be surprised if BC dates are something people want and want to have indexed. Perhaps @jggautier would know more about what legacy values exist and what's required. |
It wouldn't be a problem in general. Solr does support BC dates, however in a different format than
https://solr.apache.org/guide/solr/latest/indexing-guide/date-formatting-math.html
Is the code doing checks for API-submitted datasets different from the code doing the UI checks? I assumed it was both the same code I linked above, since the error messages are the same.
Yes, that would be nice. |
Hi all. I haven't been following this issue closely enough to contribute and won't have the time to catch up. But I agree with Jim and encourage folks to look into how others have and are using these fields. My dataset at https://doi.org/10.7910/DVN/2SA6SN might be helpful for seeing who's using the fields in different ways. And the list of contacts in our spreadsheet of Dataverse installations might help for contacting particular installations to learn more. |
Here's the related issue about BC dates: |
…ing invalid ints, floats or dates
I've just pushed a commit that implements the suggestion above (if encountering a bad legacy value in an int/float/date field, just drop that field, but index the rest of the dataset). So, if a dataset contains a bad legacy value in an int/float/date field, this means that queries on that field will not yield that dataset, since the field hasn't been indexed. But for any other query, the dataset will still be found. (I've also added a test showing this) I think this is a small limitation. And we could add BC support relatively easily in a future PR. |
I came across an issue while testing this in my local.
Issue: Dataset does not display on the UI after publishishing Screen.Recording.2024-12-03.at.1.29.58.PM.mov |
@ofahimIQSS can you please provide more of server.log? Also, let's add @vera to the PR to let her know you're having some trouble. Maybe she can help |
serverlog.txt @vera One more note it may be a problem getting solr configured/ not a code problem. --- I am still retesting on my end |
@ofahimIQSS ah you're still re-testing. Yes, the error in your log...
... means that you need to update your schema.xml file. I'm not sure if this helps, but I added this one-liner...
... to a PR I'm working on at #11024. |
Hi Vera - I’ve tried to validate this ticket but could not see the datasets after I publish them on my local. Steps to reproduce:
You can ignore this comment --- I have since resolved the issue from my end. |
@ofahimIQSS thanks for testing! Yes, I am seeing the same behaviour. The document indexed in Solr for that metadata looks like this: While I think this is a pre-existing bug/inconsistency in the date field validation code not caused by this PR. We might want to open an issue for that. |
Hmm, this makes me wonder about two and three digit years from ancient history. For example, the philosopher Epictetus was born in 50 and died in 135. It sounds like Dataverse doesn't want to index these as real dates. Do we have to pad them as 0050 and 0135? (I haven't tried this.) This issue about BCE dates is related: To quote @vera from that issue, "Since the Solr search index underlying Dataverse supports BC dates (using the ISO 8601 format: 1 BC = +0000, 2 BC = -0001, and so on)." |
Yes, padding with zeroes seems to be the way intended by ISO8601/Solr. I just tried inputting "0011" instead of "11" and it indexes fine. |
Stop Solr (usually `service solr stop`, depending on Solr installation/OS, see the [Installation Guide](https://guides.dataverse.org/en/6.5/installation/prerequisites.html#solr-init-script)). | ||
|
||
```shell | ||
service solr stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service solr stop | |
sudo service solr stop |
## Upgrade Instructions | ||
|
||
7\. Update Solr schema.xml file. Start with the standard v6.5 schema.xml, then, if your installation uses any custom or experimental metadata blocks, update it to include the extra fields (step 7a). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run the commands below as a non-root user. |
|
||
```shell | ||
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.5/conf/solr/schema.xml | ||
cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf | |
sudo cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf |
Start Solr (but if you use any custom metadata blocks, perform the next step, 7a first). | ||
|
||
```shell | ||
service solr start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service solr start | |
sudo service solr start |
```shell | ||
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.5/conf/solr/update-fields.sh | ||
chmod +x update-fields.sh | ||
curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml | |
curl "http://localhost:8080/api/admin/index/solr/schema" | sudo ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml |
@vera I'm afraid we had to bump the milestone to 6.6 but I think this feature will be very popular! Thanks again! ❤️ |
What this PR does / why we need it:
Currently, all fields regardless of type are indexed in Solr as English text (
text_en
). With this PR, numerical and date fields are indexed in Solr with appropriate types:int
plong
float
pdouble
date
date_range
(solr.DateRangeField
)I chose to index dates as
DateRangeField
because they can be used to represent dates to any precision, e.g. a day YYYY-MM-DD, a month YYYY-MM or a year YYYY. See: Date Formatting and Date Math :: Apache Solr Reference GuideThis matches the allowed formats in a date field as defined by Dataverse.
This means that range queries are now possible on numerical and date fields, e.g.
exampleIntegerField:[25 TO 50]
orexampleDateField:[2000-11-01 TO 2014-12-01]
.Which issue(s) this PR closes:
This PR implements ranged queries as discussed in #370 (issue was already closed)
This issue is related to #8813 and IQSS/dataverse-frontend#278 (the range queries that are now possible lay the groundwork for a nicer search facet UI)
Special notes for your reviewer:
For testing, I've created a sample TSV containing all relevant fields here.
Suggestions on how to test this:
Regression Testing
Feature testing
exampleIntegerField:[25 TO 50]
orexampleDateField:[2000-11-01 TO 2014-12-01]
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Facets still look the same as before. There is only a small change in the highlighting of search results, see my comment below
Is there a release notes update needed for this change?:
Yes, there should be an info text describing the new feature + instructions for how to activate the feature:
Additional documentation:
/