-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Export compressed GTFS schedule to SQLITE db #388
Conversation
0c4dba4
to
e70969b
Compare
e70969b
to
744c25e
Compare
Coverage of commit
|
Coverage of commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one comment on the sqlite generation for you to take or leave.
lgtm.
@@ -57,6 +64,7 @@ def frame_parquet_diffs( | |||
how="anti", | |||
on=join_columns, | |||
join_nulls=True, | |||
coalesce=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a sense of why this isn't the default behavior? i've found it non-intuitive in my work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a recent change to the library, I think the PR talks about matching the behavior of all join operations: pola-rs/polars#13441
export_ds = pd.dataset((pd.dataset(tmp_path), pd.dataset(merge_df))) | ||
with pq.ParquetWriter(export_path, schema=merge_df.schema) as writer: | ||
for batch in export_ds.to_batches(batch_size=batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm guessing these datasets arent big enough to warrant partitioning them into row groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop_times
and maybe the trips
table could benefit from that, but I think the most sensible partitioning would be quarterly (every 3 months), and the added complexity of that logic didn't currently seem worth the rub
for file in os.listdir(year_path): | ||
local_path = os.path.join(year_path, file) | ||
upload_path = os.path.join(GTFS_PATH, year, file) | ||
upload_file(local_path, upload_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wooohoo
""" | ||
return CREATE TABLE query for sqlite table from pyarrow schema | ||
""" | ||
logger = ProcessLogger("sqlite_create_table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we're logging how long it takes to generate the query here rather than how long it takes to create and populate the table which might be more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped the logging of the full CREATE TABLE
query.
Should be able to track the duration of SQLITE table creation/insertion when pq_folder_to_sqlite
calls logger.add_metadata(current_file=file)
Coverage of commit
|
Will wait to merge this until https://github.com/mbta/devops/pull/1947 is applied. |
Coverage of commit
|
This change adds the ability to export compressed GTFS schedule data to an SQLITE db file.
For each year partition folder, in the compressed gtfs archives, one SQLITE db file will be produced that contains a table for each GTFS schedule file that has been compressed.
This implementation utilizes the python built-in sqlite3 library. Locally, I am able to produce each SQLITE db file in about 2 minutes.
This change also has all of the S3 sync and upload logic to fully implement the compressed GTFS process.
Asana Task: https://app.asana.com/0/1205827492903547/1207450430015372