Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse datetimes and timestamps with leading and/or trailing whitespace #5544

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

guilload
Copy link
Member

@guilload guilload commented Nov 6, 2024

Description

Per title. Request by Airmail. Supported by ES.

How was this PR tested?

  • Updated unit tests
  • make test-all

Copy link
Contributor

@trinity-1686a trinity-1686a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if for Strptime patterns we might not want to trim the parser spec: if the parser is %Y (this pattern only parse years, but that's not the point), it won't parse 2024 as valid because it will expect those spaces we just removed

@guilload
Copy link
Member Author

guilload commented Nov 6, 2024

That makes sense. Let me add a commit on top of this PR.

@@ -36,22 +36,24 @@ pub fn parse_date_time_str(
date_time_str: &str,
date_time_formats: &[DateTimeInputFormat],
) -> Result<TantivyDateTime, String> {
let trimmed_date_time_str = date_time_str.trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let trimmed_date_time_str = date_time_str.trim();
let trimmed_date_time_str = date_time_str.trim_ascii();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably minor, but triming unicode is considerably more expensive than trimming ascii.
(It requires decoding utf-8, and check if each individual char is a whitespace or not. )

I'd feel safer if we restricted ourselves to ascii. It will just prevent us from trimming weird whitespace like the japanese " ".

@fulmicoton
Copy link
Contributor

@guilload can you merge this?

@trinity-1686a
Copy link
Contributor

it seems the strptime is already lenient to whitespaces inside the pattern (but not the parsed string, which we trim ourselves anyway). I updated the tests in consequence

the java-compatible parser used for range requests doesn't have access to that trim code I think, but it's not going to reject documents, so I don't think we necessarily care as much

@fulmicoton
Copy link
Contributor

@guilload can we merge this?

@guilload guilload force-pushed the guilload/parse-date-leading-trailing-whitespace branch 2 times, most recently from e0275e5 to 44e60e5 Compare November 22, 2024 21:52
@guilload
Copy link
Member Author

Yes, we should be good.

@guilload guilload force-pushed the guilload/parse-date-leading-trailing-whitespace branch from 44e60e5 to b119201 Compare November 25, 2024 17:03
@guilload guilload force-pushed the guilload/parse-date-leading-trailing-whitespace branch from 32af9e8 to 334b578 Compare November 26, 2024 14:04
@guilload guilload merged commit aa600c9 into main Nov 26, 2024
4 of 5 checks passed
@guilload guilload deleted the guilload/parse-date-leading-trailing-whitespace branch November 26, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants