Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Export compressed GTFS schedule to SQLITE db #388

Merged
merged 4 commits into from
Jul 3, 2024
Merged

Conversation

rymarczy
Copy link
Collaborator

This change adds the ability to export compressed GTFS schedule data to an SQLITE db file.

For each year partition folder, in the compressed gtfs archives, one SQLITE db file will be produced that contains a table for each GTFS schedule file that has been compressed.

This implementation utilizes the python built-in sqlite3 library. Locally, I am able to produce each SQLITE db file in about 2 minutes.

This change also has all of the S3 sync and upload logic to fully implement the compressed GTFS process.

Asana Task: https://app.asana.com/0/1205827492903547/1207450430015372

Copy link

Coverage of commit 744c25e

Summary coverage rate:
  lines......: 75.2% (2076 of 2759 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |68.3%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |16.3%     49|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |89.5%     95|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |57.4%    101|    -     0|    -      0

Download coverage report

@rymarczy rymarczy requested a review from mzappitello June 28, 2024 19:58
Copy link

github-actions bot commented Jul 1, 2024

Coverage of commit bedbe97

Summary coverage rate:
  lines......: 76.4% (2120 of 2774 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |68.3%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |88.0%     50|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |87.6%     97|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.9%    113|    -     0|    -      0

Download coverage report

Copy link
Contributor

@mzappitello mzappitello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment on the sqlite generation for you to take or leave.

lgtm.

@@ -57,6 +64,7 @@ def frame_parquet_diffs(
how="anti",
on=join_columns,
join_nulls=True,
coalesce=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a sense of why this isn't the default behavior? i've found it non-intuitive in my work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a recent change to the library, I think the PR talks about matching the behavior of all join operations: pola-rs/polars#13441

Comment on lines +122 to +124
export_ds = pd.dataset((pd.dataset(tmp_path), pd.dataset(merge_df)))
with pq.ParquetWriter(export_path, schema=merge_df.schema) as writer:
for batch in export_ds.to_batches(batch_size=batch_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm guessing these datasets arent big enough to warrant partitioning them into row groups?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_times and maybe the trips table could benefit from that, but I think the most sensible partitioning would be quarterly (every 3 months), and the added complexity of that logic didn't currently seem worth the rub

Comment on lines +323 to +326
for file in os.listdir(year_path):
local_path = os.path.join(year_path, file)
upload_path = os.path.join(GTFS_PATH, year, file)
upload_file(local_path, upload_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wooohoo

"""
return CREATE TABLE query for sqlite table from pyarrow schema
"""
logger = ProcessLogger("sqlite_create_table")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we're logging how long it takes to generate the query here rather than how long it takes to create and populate the table which might be more useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the logging of the full CREATE TABLE query.

Should be able to track the duration of SQLITE table creation/insertion when pq_folder_to_sqlite calls logger.add_metadata(current_file=file)

Copy link

github-actions bot commented Jul 2, 2024

Coverage of commit 0ec0e63

Summary coverage rate:
  lines......: 76.6% (2125 of 2773 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |76.2%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |87.8%     49|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |85.6%     97|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.9%    113|    -     0|    -      0

Download coverage report

@rymarczy
Copy link
Collaborator Author

rymarczy commented Jul 2, 2024

Will wait to merge this until https://github.com/mbta/devops/pull/1947 is applied.

Copy link

github-actions bot commented Jul 2, 2024

Coverage of commit 52abbaa

Summary coverage rate:
  lines......: 76.6% (2125 of 2773 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |76.2%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |87.8%     49|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |85.6%     97|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.9%    113|    -     0|    -      0

Download coverage report

@rymarczy rymarczy merged commit a51c2a4 into main Jul 3, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants