Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quote local parquet CTAS queries #315

Merged
merged 1 commit into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions cumulus_library/template_sql/ctas_from_parquet.sql.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@
{%- if db_type == 'athena' -%}
CREATE EXTERNAL TABLE IF NOT EXISTS `{{ schema_name }}`.`{{ table_name }}` (
{%- elif db_type == 'duckdb' -%}
CREATE TABLE IF NOT EXISTS {{ table_name }} AS SELECT
CREATE TABLE IF NOT EXISTS "{{ table_name }}" AS SELECT
{%- endif %}
{%- for col in table_cols %}
{{ col }}{% if db_type == 'athena' %} {{ remote_table_cols_types[loop.index0] }}{%- endif -%}
{%- if db_type == 'athena' %} {{ col }} {{ remote_table_cols_types[loop.index0] }}
{%- elif db_type == 'duckdb' %} "{{ col }}"
Comment on lines +8 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only quote for duckdb and not all the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Athena didn't like that, irritatingly - the serde style queries follow a slightly different set of syntax rules, it seems?

{%- endif -%}
{{- syntax.comma_delineate(loop) }}
{%- endfor %}
{%- if db_type == 'athena' %}
Expand Down
8 changes: 2 additions & 6 deletions tests/test_base_templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -545,9 +545,7 @@ def test_ctas_empty_query_creation(expected, schema, table, cols, types):
"expected,db_type,schema,table,cols,remote_types",
[
(
"""CREATE EXTERNAL TABLE IF NOT EXISTS `test_athena`.`remote_table` (
a String,
b Int
"""CREATE EXTERNAL TABLE IF NOT EXISTS `test_athena`.`remote_table` ( a String, b Int
)
STORED AS PARQUET
LOCATION 's3://bucket/data/'
Expand All @@ -559,9 +557,7 @@ def test_ctas_empty_query_creation(expected, schema, table, cols, types):
["String", "Int"],
),
(
"""CREATE TABLE IF NOT EXISTS local_table AS SELECT
a,
b
"""CREATE TABLE IF NOT EXISTS "local_table" AS SELECT "a", "b"
Comment on lines -562 to +560
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the risk you are guarding against in this PR? Like, that a table name wouldn't have a prefix and also be a reserved word? Cause I assume no table with a study prefix could be a problem. Should we instead/additionally enforce that there is a prefix (look for a dunder)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, the LOINC groups dataset has a file just named 'Group', and columns named 'group'. I think you're right that, if a dunder was present, this would largely not be an issue. But since this is one of these static dataset situations that you want to use across multiple studies/datasources, I'm inclined to just live with the quoting here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah and the "dundering" there is done via a different namespace. Sure - quoting is fine, was just curious if we should go even further.

FROM read_parquet('./tests/test_data/*.parquet')""",
"duckdb",
"test_duckdb",
Expand Down
Loading