-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be looser about case when checking schemas #176
Conversation
assert parsed, expected | ||
assert parsed == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AHEM. 😭
Fixing this bad assert (point in favor of assertEqual I suppose...) revealed that duckdb's version that I wrote is returning naive timestamps - which I think I actually did manually change from aware timestamps at one point... but Athena looks like it uses aware ones too.... Needs more investigation. I just fixed the tests to work here, but didn't change any business logic for now.
Added a TODO in the code though.
8818757
to
e98aecd
Compare
('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 16), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660') | ||
('83d0d564-3bbf-48eb-7445-bd2b81130671', {'id': None, 'code': 'EMER', 'display': None, 'system': 'http://terminology.hl7.org/CodeSystem/v3-ActCode', 'userSelected': None, 'version': None}, 'EMER', 'emergency', 'finished', None, None, None, None, None, None, '55680006', 'http://snomed.info/sct', 69, datetime.date(2018, 6, 16), datetime.date(2018, 6, 17), datetime.date(2018, 6, 11), datetime.date(2018, 6, 1), datetime.date(2018, 1, 1), 'Patient/16be855b-ece2-8b96-1ef9-a4d93adf3289', 'Encounter/83d0d564-3bbf-48eb-7445-bd2b81130671', 'male', 'white', 'not hispanic or latino', '660') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes in the encounter tables were because of the bug where after the old code split
the schema string, future searches were only in the second half of the split.
If you gave an expected field list with a different order than DuckDB gives it back (in this case, the encounter builder was feeding (start, end) but the schema was giving back (end, start)), then the end
field would never be detected as existing in the schema. So for DuckDB and Athena, the end
field would always fall back to the start
value.
col_schema = col_schema[0] | ||
else: | ||
output[column][field] = True | ||
col_schema = col_schema[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This here was the source of one of the bugs - after splitting on each field, future scans would only look at the part after the split. So order of fields matters in a way that is hard to predict.
I've switched from splitting entirely and just search in the entire column schema for each field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this does create a different kind of bug, though maybe note as critical as the found one: for columns where a object parameter occurs more than once, this may over-identify data as being present. specifically, id shows up in a lot of places.
if this :does: hit, it will prevent a table from being run, which is the spirit of the previous version - take a few false negatives for more true positives.
If this is blocking you now, that's fine, but we might need to more robustly solve this in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted about this in person - I think we agree it is maybe not a new bug but just another instance of the existing flakiness of "searching in a big string for a field". We need to move to per-db-backend schema parsing to fix this and to allow deeper-depth schema searching.
e98aecd
to
b6782ee
Compare
@@ -4,5 +4,5 @@ SELECT | |||
FROM information_schema.columns | |||
WHERE | |||
table_schema = '{{ schema_name }}' | |||
AND table_name = '{{ table_name }}' | |||
AND LOWER(column_name) IN ('{{ column_names|join("', '") }}') --noqa: LT05 | |||
AND table_name = '{{ table_name|lower }}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this same line with LOWER(table_name)
and Athena took a minute and then errored out about DELTA_LAKE_INVALID_SCHEMA
and Metadata not found in transaction log...
etc. 🤷 - removing the lower fixes that oddity and doesn't super matter - we don't make mixed case tables today.
59d82d5
to
3a34973
Compare
This fixes a couple bugs: - We were not lower-casing the result of asking DuckDB for its schema. So we were searching inside its schema results for the exact camel case string in duckdb but Athena would need a lower case string. Impossible to work in both. So double-word depth-two-fields would never be matched. - For depth-two fields, we were also assuming that we would be given fields in the same order that the SQL table would return them. So if you didn't provide them in that precise order (like if you used start, end but the SQL table had end, start -- you'd miss one). This was the exact case for Encounter fields in duckdb. This commit allows the following new features/conveniences: - Case-insensitive schema lookups (i.e. table name and column name arguments to get_column_datatype_query() can be any case) - Case-insensitive schema validation (i.e. the expected column names passed to validate_table_schema() can be any case) - The resulting validated schema will use the original case used by the expected-fields dictionary. So that the user will not be surprised and we'll use camel case if they do, lower case if they do.
3a34973
to
dc5e294
Compare
date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}.{{ col[0] }}.{{ col[1] }}))) | ||
date_trunc('{{ col[3] }}', date(from_iso8601_timestamp({{ alias }}."{{ col[0] }}"."{{ col[1] }}"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came up because now that the regression test actually sees the period.end
field, it tries to access it. But end
is a keyword that can't just be injected like that.
So I took the safest but ugliest approach - quoting each arbitrary field access.
- A less safe but prettier option is just quoting the last field access. That's where trickier words like
end
anddate
are likely to show up. - The least safe but prettiest option is to have the caller of this quote it. Like
('period', '"end"', 'period_end_day', 'day'),
when callingtruncate_date_cols
col_schema = col_schema[0] | ||
else: | ||
output[column][field] = True | ||
col_schema = col_schema[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this does create a different kind of bug, though maybe note as critical as the found one: for columns where a object parameter occurs more than once, this may over-identify data as being present. specifically, id shows up in a lot of places.
if this :does: hit, it will prevent a table from being run, which is the spirit of the previous version - take a few false negatives for more true positives.
If this is blocking you now, that's fine, but we might need to more robustly solve this in the near future.
This fixes a couple bugs:
So we were searching inside its schema results for the exact camel
case string in duckdb but Athena would need a lower case string.
Impossible to work in both. So double-word depth-two-fields would
never be matched.
fields in the same order that the SQL table would return them.
So if you didn't provide them in that precise order (like if you used
start, end but the SQL table had end, start -- you'd miss one).
This was the exact case for Encounter fields in duckdb.
This commit allows the following new features/conveniences:
arguments to get_column_datatype_query() can be any case)
passed to validate_table_schema() can be any case)
the expected-fields dictionary. So that the user will not be
surprised and we'll use camel case if they do, lower case if they do.
Checklist
docs/
) needs to be updated