Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicit cast change in DBR 13.3 can cause failures in Silver Spark modules #1311

Open
neilbest-db opened this issue Dec 5, 2024 · 0 comments
Assignees
Labels
bug Something isn't working data quality There is a data quality issue here schema change Requires a schema change
Milestone

Comments

@neilbest-db
Copy link
Contributor

neilbest-db commented Dec 5, 2024

Overwatch Version: 0.8.2.0

In raw Spark event logs, and therefore in Overwatch table spark_events_bronze, the field ExecutorID is usually a number but occasionally it gets the value 'driver'.

In Overwatch deployments where this special value is present in the first run for a given target storage location Spark will infer STRING type for that column and this particular issue will never occur.

In Overwatch deployments where no such special value is present in the first run for a given target storage location Spark will infer BIGINT/Long for that column and create the spark_*_silver target tables with same. In some later Overwatch ETL run, when 'driver' shows up in that column in spark_events_bronze one of two things can happen when persisting results to spark_*_silver tables depending on the DBR version:

  • DBR 11.3: Spark silently converts 'driver' to NULL while implicitly casting values like '0' and '105' to BIGINTs. This behavior is available in later DBRs by setting configuration property spark.sql.storeAssignmentPolicy to legacy, but this is not explicitly set anywhere in the Overwatch code as of release 0.8.2.0.

  • DBR 13.3: by default, spark.sql.storeAssignmentPolicy is set to ANSI, which causes a runtime exception when attempting to implicitly cast 'driver' to BIGINT. See Safe casts enabled by default for Delta Lake operations in the DBR 13.3 release notes and ANSI compliance in Databricks Runtime in the Databricks SQL language reference for details.

The Silver Spark modules should be future-proofed for DBR > 11.3 by explicitly designating STRING type for ExecutorID columns or some equivalent solution.

Two workarounds are available in the meantime:

  • Use DBR 11.3 for Overwatch ETL runs and accept this minor data loss in spark_*_silver tables, i.e. ExecutorID will be NULL unlike its upstream source column of type STRING in spark_events_bronze.
  • OR manually adjust the target schemas according to this guidance: Explicitly update schema to change column type or name.
@neilbest-db neilbest-db added bug Something isn't working data quality There is a data quality issue here schema change Requires a schema change labels Dec 5, 2024
@neilbest-db neilbest-db added this to the 0.9.0.0 milestone Dec 5, 2024
@neilbest-db neilbest-db self-assigned this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data quality There is a data quality issue here schema change Requires a schema change
Projects
None yet
Development

No branches or pull requests

1 participant