-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add new DuckDB backend for reading ndjson directly #144
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, though I'm skipping the unit tests for now per your comments - tacitly approved but i'll look again when you're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this and show views: we :could: change the template var to database_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this level of code, I think schema_name
is fine -- at this (low) level, it is called a schema. And at the user (high) level, we call it a database, which I think makes sense.
It's some of the in-between where maybe we could/should futz with it. Like, from CLI to DatabaseBackend, maybe the name "database" should be preserved, and the backend turns that into whatever is appropriate for it (schema_name or filename)? But I'm not fighting for that -- an improvement maybe, but not a necessary one.
if allow_partial: | ||
required_fields + ["code", "system", "display"] | ||
if not allow_partial: | ||
required_fields += ["code", "system", "display"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a real bug where allow_partial
was being ignored. I think I fixed it correctly, and then I added some unit testing for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🦆
02180e0
to
53e2614
Compare
@@ -54,6 +54,7 @@ CREATE TABLE {{ target_table }} AS ( | |||
ROW_NUMBER() | |||
OVER ( | |||
PARTITION BY id | |||
ORDER BY priority ASC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug found during testing -- without this order-by, you can get inconsistent results run-to-run
This adds some new arguments: --db-type [athena,duckdb] (defaulting to athena) --load-ndjson-dir DIR (tells DuckDB where to find source ndjsons) A light abstraction layer has been added in databases.py to choose the correct backend based on the args. Mostly the SQL is the same. Some light tweaks for standardization, plus some compatibility user-defined functions injected into duckdb allow both backends to work on the same SQL.
This adds some new arguments:
--db-type [athena,duckdb] (defaulting to athena)
--load-ndjson-dir DIR (tells DuckDB where to find source ndjsons)
A light abstraction layer has been added in databases.py to choose the correct backend based on the args.
Mostly the SQL is the same. Some light tweaks for standardization, plus some compatibility user-defined functions injected into duckdb allow both backends to work on the same SQL.
Checklist
docs/
) needs to be updated