Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be looser about case when checking schemas #176

Merged
merged 1 commit into from
Feb 8, 2024
Merged

Be looser about case when checking schemas #176

merged 1 commit into from
Feb 8, 2024

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Feb 7, 2024

This fixes a couple bugs:

  • We were not lower-casing the result of asking DuckDB for its schema.
    So we were searching inside its schema results for the exact camel
    case string in duckdb but Athena would need a lower case string.
    Impossible to work in both. So double-word depth-two-fields would
    never be matched.
  • For depth-two fields, we were also assuming that we would be given
    fields in the same order that the SQL table would return them.
    So if you didn't provide them in that precise order (like if you used
    start, end but the SQL table had end, start -- you'd miss one).
    This was the exact case for Encounter fields in duckdb.

This commit allows the following new features/conveniences:

  • Case-insensitive schema lookups (i.e. table name and column name
    arguments to get_column_datatype_query() can be any case)
  • Case-insensitive schema validation (i.e. the expected column names
    passed to validate_table_schema() can be any case)
  • The resulting validated schema will use the original case used by
    the expected-fields dictionary. So that the user will not be
    surprised and we'll use camel case if they do, lower case if they do.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added
  • Update template repo if there are changes to study configuration

Comment on lines -74 to +76
assert parsed, expected
assert parsed == expected
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHEM. 😭

Fixing this bad assert (point in favor of assertEqual I suppose...) revealed that duckdb's version that I wrote is returning naive timestamps - which I think I actually did manually change from aware timestamps at one point... but Athena looks like it uses aware ones too.... Needs more investigation. I just fixed the tests to work here, but didn't change any business logic for now.

Added a TODO in the code though.

('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 16), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660')
('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 17), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes in the encounter tables were because of the bug where after the old code split the schema string, future searches were only in the second half of the split.

If you gave an expected field list with a different order than DuckDB gives it back (in this case, the encounter builder was feeding (start, end) but the schema was giving back (end, start)), then the end field would never be detected as existing in the schema. So for DuckDB and Athena, the end field would always fall back to the start value.

col_schema = col_schema[0]
else:
output[column][field] = True
col_schema = col_schema[1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here was the source of one of the bugs - after splitting on each field, future scans would only look at the part after the split. So order of fields matters in a way that is hard to predict.

I've switched from splitting entirely and just search in the entire column schema for each field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this does create a different kind of bug, though maybe note as critical as the found one: for columns where a object parameter occurs more than once, this may over-identify data as being present. specifically, id shows up in a lot of places.

if this :does: hit, it will prevent a table from being run, which is the spirit of the previous version - take a few false negatives for more true positives.

If this is blocking you now, that's fine, but we might need to more robustly solve this in the near future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this in person - I think we agree it is maybe not a new bug but just another instance of the existing flakiness of "searching in a big string for a field". We need to move to per-db-backend schema parsing to fix this and to allow deeper-depth schema searching.

@@ -4,5 +4,5 @@ SELECT
FROM information_schema.columns
WHERE
table_schema = '{{ schema_name }}'
AND table_name = '{{ table_name }}'
AND LOWER(column_name) IN ('{{ column_names|join("', '") }}') --noqa: LT05
AND table_name = '{{ table_name|lower }}'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this same line with LOWER(table_name) and Athena took a minute and then errored out about DELTA_LAKE_INVALID_SCHEMA and Metadata not found in transaction log... etc. 🤷 - removing the lower fixes that oddity and doesn't super matter - we don't make mixed case tables today.

@mikix mikix changed the title WIP: Be looser about case when checking schemas Be looser about case when checking schemas Feb 8, 2024
@mikix mikix marked this pull request as ready for review February 8, 2024 15:48
@mikix mikix force-pushed the mikix/case-fixes branch 3 times, most recently from 59d82d5 to 3a34973 Compare February 8, 2024 16:22
This fixes a couple bugs:
- We were not lower-casing the result of asking DuckDB for its schema.
  So we were searching inside its schema results for the exact camel
  case string in duckdb but Athena would need a lower case string.
  Impossible to work in both. So double-word depth-two-fields would
  never be matched.
- For depth-two fields, we were also assuming that we would be given
  fields in the same order that the SQL table would return them.
  So if you didn't provide them in that precise order (like if you used
  start, end but the SQL table had end, start -- you'd miss one).
  This was the exact case for Encounter fields in duckdb.

This commit allows the following new features/conveniences:
- Case-insensitive schema lookups (i.e. table name and column name
  arguments to get_column_datatype_query() can be any case)
- Case-insensitive schema validation (i.e. the expected column names
  passed to validate_table_schema() can be any case)
- The resulting validated schema will use the original case used by
  the expected-fields dictionary. So that the user will not be
  surprised and we'll use camel case if they do, lower case if they do.
date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}.{{ col[0] }}.{{ col[1] }})))
date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}."{{ col[0] }}"."{{ col[1] }}")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up because now that the regression test actually sees the period.end field, it tries to access it. But end is a keyword that can't just be injected like that.

So I took the safest but ugliest approach - quoting each arbitrary field access.

  • A less safe but prettier option is just quoting the last field access. That's where trickier words like end and date are likely to show up.
  • The least safe but prettiest option is to have the caller of this quote it. Like ('period', '"end"', 'period_end_day', 'day'), when calling truncate_date_cols

col_schema = col_schema[0]
else:
output[column][field] = True
col_schema = col_schema[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this does create a different kind of bug, though maybe note as critical as the found one: for columns where a object parameter occurs more than once, this may over-identify data as being present. specifically, id shows up in a lot of places.

if this :does: hit, it will prevent a table from being run, which is the spirit of the previous version - take a few false negatives for more true positives.

If this is blocking you now, that's fine, but we might need to more robustly solve this in the near future.

@mikix mikix merged commit 78045a3 into main Feb 8, 2024
3 checks passed
@mikix mikix deleted the mikix/case-fixes branch February 8, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants