Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add new DuckDB backend for reading ndjson directly #144

Merged
merged 1 commit into from
Nov 27, 2023
Merged

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Nov 22, 2023

This adds some new arguments:
--db-type [athena,duckdb] (defaulting to athena)
--load-ndjson-dir DIR (tells DuckDB where to find source ndjsons)

A light abstraction layer has been added in databases.py to choose the correct backend based on the args.

Mostly the SQL is the same. Some light tweaks for standardization, plus some compatibility user-defined functions injected into duckdb allow both backends to work on the same SQL.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added
  • Run pylint if you're making changes beyond adding studies
  • Update template repo if there are changes to study configuration

Copy link
Contributor

@dogversioning dogversioning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, though I'm skipping the unit tests for now per your comments - tacitly approved but i'll look again when you're done.

cumulus_library/cli.py Show resolved Hide resolved
cumulus_library/cli_parser.py Show resolved Hide resolved
cumulus_library/cli_parser.py Outdated Show resolved Hide resolved
cumulus_library/databases.py Outdated Show resolved Hide resolved
cumulus_library/databases.py Show resolved Hide resolved
cumulus_library/databases.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this and show views: we :could: change the template var to database_name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this level of code, I think schema_name is fine -- at this (low) level, it is called a schema. And at the user (high) level, we call it a database, which I think makes sense.

It's some of the in-between where maybe we could/should futz with it. Like, from CLI to DatabaseBackend, maybe the name "database" should be preserved, and the backend turns that into whatever is appropriate for it (schema_name or filename)? But I'm not fighting for that -- an improvement maybe, but not a necessary one.

Comment on lines -167 to +168
if allow_partial:
required_fields + ["code", "system", "display"]
if not allow_partial:
required_fields += ["code", "system", "display"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a real bug where allow_partial was being ignored. I think I fixed it correctly, and then I added some unit testing for it.

Copy link
Contributor

@dogversioning dogversioning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦆

docs/creating-studies.md Outdated Show resolved Hide resolved
@mikix mikix force-pushed the mikix/duckdb branch 2 times, most recently from 02180e0 to 53e2614 Compare November 24, 2023 16:05
@mikix mikix marked this pull request as ready for review November 24, 2023 16:05
@mikix mikix changed the title WIP: feat: add new DuckDB backend for reading ndjson directly feat: add new DuckDB backend for reading ndjson directly Nov 24, 2023
@@ -54,6 +54,7 @@ CREATE TABLE {{ target_table }} AS (
ROW_NUMBER()
OVER (
PARTITION BY id
ORDER BY priority ASC
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug found during testing -- without this order-by, you can get inconsistent results run-to-run

This adds some new arguments:
--db-type [athena,duckdb] (defaulting to athena)
--load-ndjson-dir DIR (tells DuckDB where to find source ndjsons)

A light abstraction layer has been added in databases.py to choose
the correct backend based on the args.

Mostly the SQL is the same. Some light tweaks for standardization,
plus some compatibility user-defined functions injected into duckdb
allow both backends to work on the same SQL.
@mikix mikix merged commit d65bbe0 into main Nov 27, 2023
3 checks passed
@mikix mikix deleted the mikix/duckdb branch November 27, 2023 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants