Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CAST(STRING AS TIMESTAMP) in Hive to Trino conversion. #543

Open
kupferk opened this issue Nov 19, 2024 · 1 comment
Open

Improve CAST(STRING AS TIMESTAMP) in Hive to Trino conversion. #543

kupferk opened this issue Nov 19, 2024 · 1 comment

Comments

@kupferk
Copy link

kupferk commented Nov 19, 2024

Background

We are using the built-in support of Trino to access existing Hive views. Under the hood, Trino uses Coral to perform this conversion. Syntactically, most queries work out perfectly fine. But now we encountered a case, where the behaviour of the same SQL expression differs between Trino and Hive/Impala/Spark.

Problem

Hive/Impala support a wider range of timestamp formats in a CAST(STRING AS TIMESTAMP) expression. Specifically, the following two CAST operations work in Hive/Impala/Spark, but not in Trino:

  • SELECT CAST('2020-08-27T13:58:02.483Z' AS TIMESTAMP);
  • SELECT CAST('2020-08-27T13:58:02.483' AS TIMESTAMP);

The problem is the "T" in between the date and the time. The following two queries work as expected:

  • SELECT CAST('2020-08-27 13:58:02.483Z' AS TIMESTAMP);
  • SELECT CAST('2020-08-27 13:58:02.483' AS TIMESTAMP);

Unfortunately, we have both datetime formats in our data and therefore cannot simply pick one of them.

Moreover, a failing conversion results in a query failure in Trino, while Hive/Impala/Spark will simply produce NULL values (quite debatable which behaviour is preferred in the context of a Hive to Trino conversion).

Suggested Solution

I was experimenting with different expressions in Trino for implementing a conversion, which supports both value types, and came up with the following expression:

  • SELECT COALESCE(TRY(CAST('2020-08-27T13:58:02.483Z' AS TIMESTAMP)), CAST(TRY(from_iso8601_timestamp('2020-08-27T13:58:02.483Z')) AS TIMESTAMP WITHOUT TIME ZONE));

Of course, the real SQL wouldn't use fixed strings, but a column from the source table instead. This expression provides a compatible behaviour in the sense that:

  • It supports two different time formats (with and without a "T")
  • It doesn't fail the query on failed conversions

Suggested Implementation

I know that this is a very special corner case, and I was wondering where this expression replacement should be implemented. I would not suggest to implement it in the Hive to Coral translation, and I also would not suggest to implement it in the Coral to Trino translation. Instead, I propose to implement that very special rule in the wrapper Hive to Trino translation class by implementing an additional translation pass.

If you agree on this approach, I am optimistic that I can provide a corresponding pull-request. But I first want to discuss the issue with you to avoid going into a wrong direction.

@hqbhoho
Copy link
Contributor

hqbhoho commented Dec 19, 2024

I file a PR use 'TRY_CAST' instead of 'CAST' #544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants