Be looser about case when checking schemas #176

mikix · 2024-02-07T21:40:00Z

This fixes a couple bugs:

We were not lower-casing the result of asking DuckDB for its schema.
So we were searching inside its schema results for the exact camel
case string in duckdb but Athena would need a lower case string.
Impossible to work in both. So double-word depth-two-fields would
never be matched.
For depth-two fields, we were also assuming that we would be given
fields in the same order that the SQL table would return them.
So if you didn't provide them in that precise order (like if you used
start, end but the SQL table had end, start -- you'd miss one).
This was the exact case for Encounter fields in duckdb.

This commit allows the following new features/conveniences:

Case-insensitive schema lookups (i.e. table name and column name
arguments to get_column_datatype_query() can be any case)
Case-insensitive schema validation (i.e. the expected column names
passed to validate_table_schema() can be any case)
The resulting validated schema will use the original case used by
the expected-fields dictionary. So that the user will not be
surprised and we'll use camel case if they do, lower case if they do.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added
Update template repo if there are changes to study configuration

mikix · 2024-02-07T21:44:18Z

tests/test_duckdb.py

-    assert parsed, expected
+    assert parsed == expected


AHEM. 😭

Fixing this bad assert (point in favor of assertEqual I suppose...) revealed that duckdb's version that I wrote is returning naive timestamps - which I think I actually did manually change from aware timestamps at one point... but Athena looks like it uses aware ones too.... Needs more investigation. I just fixed the tests to work here, but didn't change any business logic for now.

Added a TODO in the code though.

mikix · 2024-02-08T15:36:36Z

tests/test_data/core/core__encounter.txt

-('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 16), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660')
+('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 17), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660')


These changes in the encounter tables were because of the bug where after the old code split the schema string, future searches were only in the second half of the split.

If you gave an expected field list with a different order than DuckDB gives it back (in this case, the encounter builder was feeding (start, end) but the schema was giving back (end, start)), then the end field would never be detected as existing in the schema. So for DuckDB and Athena, the end field would always fall back to the start value.

mikix · 2024-02-08T15:37:53Z

cumulus_library/databases.py

-                            col_schema = col_schema[0]
-                        else:
-                            output[column][field] = True
-                            col_schema = col_schema[1]


This here was the source of one of the bugs - after splitting on each field, future scans would only look at the part after the split. So order of fields matters in a way that is hard to predict.

I've switched from splitting entirely and just search in the entire column schema for each field.

So, this does create a different kind of bug, though maybe note as critical as the found one: for columns where a object parameter occurs more than once, this may over-identify data as being present. specifically, id shows up in a lot of places.

if this :does: hit, it will prevent a table from being run, which is the spirit of the previous version - take a few false negatives for more true positives.

If this is blocking you now, that's fine, but we might need to more robustly solve this in the near future.

We chatted about this in person - I think we agree it is maybe not a new bug but just another instance of the existing flakiness of "searching in a big string for a field". We need to move to per-db-backend schema parsing to fix this and to allow deeper-depth schema searching.

mikix · 2024-02-08T15:46:12Z

cumulus_library/template_sql/column_datatype.sql.jinja

@@ -4,5 +4,5 @@ SELECT
 FROM information_schema.columns
 WHERE
    table_schema = '{{ schema_name }}'
-    AND table_name = '{{ table_name }}'
-    AND LOWER(column_name) IN ('{{ column_names|join("', '") }}') --noqa: LT05
+    AND table_name = '{{ table_name|lower }}'


I tried this same line with LOWER(table_name) and Athena took a minute and then errored out about DELTA_LAKE_INVALID_SCHEMA and Metadata not found in transaction log... etc. 🤷 - removing the lower fixes that oddity and doesn't super matter - we don't make mixed case tables today.

This fixes a couple bugs: - We were not lower-casing the result of asking DuckDB for its schema. So we were searching inside its schema results for the exact camel case string in duckdb but Athena would need a lower case string. Impossible to work in both. So double-word depth-two-fields would never be matched. - For depth-two fields, we were also assuming that we would be given fields in the same order that the SQL table would return them. So if you didn't provide them in that precise order (like if you used start, end but the SQL table had end, start -- you'd miss one). This was the exact case for Encounter fields in duckdb. This commit allows the following new features/conveniences: - Case-insensitive schema lookups (i.e. table name and column name arguments to get_column_datatype_query() can be any case) - Case-insensitive schema validation (i.e. the expected column names passed to validate_table_schema() can be any case) - The resulting validated schema will use the original case used by the expected-fields dictionary. So that the user will not be surprised and we'll use camel case if they do, lower case if they do.

mikix · 2024-02-08T16:33:07Z

cumulus_library/studies/core/core_templates/core_utils.jinja

-        date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}.{{ col[0] }}.{{ col[1] }}))) 
+        date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}."{{ col[0] }}"."{{ col[1] }}")))


This came up because now that the regression test actually sees the period.end field, it tries to access it. But end is a keyword that can't just be injected like that.

So I took the safest but ugliest approach - quoting each arbitrary field access.

A less safe but prettier option is just quoting the last field access. That's where trickier words like end and date are likely to show up.

The least safe but prettiest option is to have the caller of this quote it. Like ('period', '"end"', 'period_end_day', 'day'), when calling truncate_date_cols

dogversioning · 2024-02-08T17:12:04Z

cumulus_library/databases.py

-                            col_schema = col_schema[0]
-                        else:
-                            output[column][field] = True
-                            col_schema = col_schema[1]


So, this does create a different kind of bug, though maybe note as critical as the found one: for columns where a object parameter occurs more than once, this may over-identify data as being present. specifically, id shows up in a lot of places.

if this :does: hit, it will prevent a table from being run, which is the spirit of the previous version - take a few false negatives for more true positives.

If this is blocking you now, that's fine, but we might need to more robustly solve this in the near future.

mikix commented Feb 7, 2024

View reviewed changes

mikix force-pushed the mikix/case-fixes branch from 8818757 to e98aecd Compare February 8, 2024 15:31

mikix commented Feb 8, 2024

View reviewed changes

mikix force-pushed the mikix/case-fixes branch from e98aecd to b6782ee Compare February 8, 2024 15:42

mikix commented Feb 8, 2024

View reviewed changes

mikix changed the title ~~WIP: Be looser about case when checking schemas~~ Be looser about case when checking schemas Feb 8, 2024

mikix marked this pull request as ready for review February 8, 2024 15:48

mikix force-pushed the mikix/case-fixes branch 3 times, most recently from 59d82d5 to 3a34973 Compare February 8, 2024 16:22

mikix force-pushed the mikix/case-fixes branch from 3a34973 to dc5e294 Compare February 8, 2024 16:30

mikix commented Feb 8, 2024

View reviewed changes

dogversioning approved these changes Feb 8, 2024

View reviewed changes

mikix merged commit 78045a3 into main Feb 8, 2024
3 checks passed

mikix deleted the mikix/case-fixes branch February 8, 2024 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be looser about case when checking schemas #176

Be looser about case when checking schemas #176

mikix commented Feb 7, 2024 •

edited

Loading

mikix Feb 7, 2024

mikix Feb 8, 2024

mikix Feb 8, 2024

dogversioning Feb 8, 2024

mikix Feb 8, 2024

mikix Feb 8, 2024

mikix Feb 8, 2024

dogversioning Feb 8, 2024

		('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 16), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660')
		('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 17), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660')

		date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}.{{ col[0] }}.{{ col[1] }})))
		date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}."{{ col[0] }}"."{{ col[1] }}")))

Be looser about case when checking schemas #176

Be looser about case when checking schemas #176

Conversation

mikix commented Feb 7, 2024 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikix commented Feb 7, 2024 •

edited

Loading