feat: add new DuckDB backend for reading ndjson directly #144

mikix · 2023-11-22T17:20:22Z

This adds some new arguments:
--db-type [athena,duckdb] (defaulting to athena)
--load-ndjson-dir DIR (tells DuckDB where to find source ndjsons)

A light abstraction layer has been added in databases.py to choose the correct backend based on the args.

Mostly the SQL is the same. Some light tweaks for standardization, plus some compatibility user-defined functions injected into duckdb allow both backends to work on the same SQL.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added
Run pylint if you're making changes beyond adding studies
Update template repo if there are changes to study configuration

dogversioning

Generally looks good, though I'm skipping the unit tests for now per your comments - tacitly approved but i'll look again when you're done.

cumulus_library/cli.py

cumulus_library/cli_parser.py

cumulus_library/databases.py

dogversioning · 2023-11-22T18:11:29Z

cumulus_library/template_sql/show_tables.sql.jinja

for this and show views: we :could: change the template var to database_name?

At this level of code, I think schema_name is fine -- at this (low) level, it is called a schema. And at the user (high) level, we call it a database, which I think makes sense.

It's some of the in-between where maybe we could/should futz with it. Like, from CLI to DatabaseBackend, maybe the name "database" should be preserved, and the backend turns that into whatever is appropriate for it (schema_name or filename)? But I'm not fighting for that -- an improvement maybe, but not a necessary one.

mikix · 2023-11-24T14:12:29Z

cumulus_library/template_sql/utils.py

-        if allow_partial:
-            required_fields + ["code", "system", "display"]
+        if not allow_partial:
+            required_fields += ["code", "system", "display"]


This was a real bug where allow_partial was being ignored. I think I fixed it correctly, and then I added some unit testing for it.

tests/test_template_utils.py

cumulus_library/studies/core/documentreference.sql

dogversioning

🦆

cumulus_library/studies/core/documentreference.sql

docs/creating-studies.md

mikix · 2023-11-24T18:48:14Z

cumulus_library/template_sql/codeable_concept_denormalize.sql.jinja

@@ -54,6 +54,7 @@ CREATE TABLE {{ target_table }} AS (
            ROW_NUMBER()
                OVER (
                    PARTITION BY id
+                    ORDER BY priority ASC


This was a bug found during testing -- without this order-by, you can get inconsistent results run-to-run

This adds some new arguments: --db-type [athena,duckdb] (defaulting to athena) --load-ndjson-dir DIR (tells DuckDB where to find source ndjsons) A light abstraction layer has been added in databases.py to choose the correct backend based on the args. Mostly the SQL is the same. Some light tweaks for standardization, plus some compatibility user-defined functions injected into duckdb allow both backends to work on the same SQL.

dogversioning reviewed Nov 22, 2023

View reviewed changes

mikix force-pushed the mikix/duckdb branch from 7cb43da to 90abb0e Compare November 24, 2023 13:36

mikix commented Nov 24, 2023

View reviewed changes

tests/test_template_utils.py Show resolved Hide resolved

mikix commented Nov 24, 2023

View reviewed changes

cumulus_library/studies/core/documentreference.sql Show resolved Hide resolved

mikix force-pushed the mikix/duckdb branch from 90abb0e to 8e98c8e Compare November 24, 2023 14:15

dogversioning approved these changes Nov 24, 2023

View reviewed changes

cumulus_library/studies/core/documentreference.sql Show resolved Hide resolved

docs/creating-studies.md Outdated Show resolved Hide resolved

mikix force-pushed the mikix/duckdb branch 2 times, most recently from 02180e0 to 53e2614 Compare November 24, 2023 16:05

mikix marked this pull request as ready for review November 24, 2023 16:05

mikix changed the title ~~WIP: feat: add new DuckDB backend for reading ndjson directly~~ feat: add new DuckDB backend for reading ndjson directly Nov 24, 2023

mikix force-pushed the mikix/duckdb branch from 53e2614 to 23ff5ee Compare November 24, 2023 18:47

mikix commented Nov 24, 2023

View reviewed changes

mikix force-pushed the mikix/duckdb branch from 23ff5ee to 20203c8 Compare November 24, 2023 18:54

mikix merged commit d65bbe0 into main Nov 27, 2023
3 checks passed

mikix deleted the mikix/duckdb branch November 27, 2023 13:00

dogversioning mentioned this pull request Dec 8, 2023

Research Spike: DuckDB #130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add new DuckDB backend for reading ndjson directly #144

feat: add new DuckDB backend for reading ndjson directly #144

mikix commented Nov 22, 2023 •

edited

Loading

dogversioning left a comment

dogversioning Nov 22, 2023

mikix Nov 24, 2023

mikix Nov 24, 2023

dogversioning left a comment

mikix Nov 24, 2023

feat: add new DuckDB backend for reading ndjson directly #144

feat: add new DuckDB backend for reading ndjson directly #144

Conversation

mikix commented Nov 22, 2023 • edited Loading

Checklist

dogversioning left a comment

Choose a reason for hiding this comment

dogversioning Nov 22, 2023

Choose a reason for hiding this comment

mikix Nov 24, 2023

Choose a reason for hiding this comment

mikix Nov 24, 2023

Choose a reason for hiding this comment

dogversioning left a comment

Choose a reason for hiding this comment

mikix Nov 24, 2023

Choose a reason for hiding this comment

mikix commented Nov 22, 2023 •

edited

Loading