FEAT: Export compressed GTFS schedule to SQLITE db #388

rymarczy · 2024-06-28T18:05:12Z

This change adds the ability to export compressed GTFS schedule data to an SQLITE db file.

For each year partition folder, in the compressed gtfs archives, one SQLITE db file will be produced that contains a table for each GTFS schedule file that has been compressed.

This implementation utilizes the python built-in sqlite3 library. Locally, I am able to produce each SQLITE db file in about 2 minutes.

This change also has all of the S3 sync and upload logic to fully implement the compressed GTFS process.

Asana Task: https://app.asana.com/0/1205827492903547/1207450430015372

github-actions · 2024-06-28T19:18:53Z

Coverage of commit `744c25e`

Summary coverage rate:
  lines......: 75.2% (2076 of 2759 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |68.3%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |16.3%     49|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |89.5%     95|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |57.4%    101|    -     0|    -      0

Download coverage report

github-actions · 2024-07-01T10:27:29Z

Coverage of commit `bedbe97`

Summary coverage rate:
  lines......: 76.4% (2120 of 2774 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |68.3%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |88.0%     50|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |87.6%     97|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.9%    113|    -     0|    -      0

Download coverage report

mzappitello

one comment on the sqlite generation for you to take or leave.

lgtm.

mzappitello · 2024-07-01T13:57:28Z

src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py

@@ -57,6 +64,7 @@ def frame_parquet_diffs(
        how="anti",
        on=join_columns,
        join_nulls=True,
+        coalesce=True,


do you have a sense of why this isn't the default behavior? i've found it non-intuitive in my work.

It was a recent change to the library, I think the PR talks about matching the behavior of all join operations: pola-rs/polars#13441

mzappitello · 2024-07-01T14:28:12Z

src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py

+        export_ds = pd.dataset((pd.dataset(tmp_path), pd.dataset(merge_df)))
+        with pq.ParquetWriter(export_path, schema=merge_df.schema) as writer:
+            for batch in export_ds.to_batches(batch_size=batch_size):


i'm guessing these datasets arent big enough to warrant partitioning them into row groups?

stop_times and maybe the trips table could benefit from that, but I think the most sensible partitioning would be quarterly (every 3 months), and the added complexity of that logic didn't currently seem worth the rub

mzappitello · 2024-07-01T14:35:43Z

src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py

+        for file in os.listdir(year_path):
+            local_path = os.path.join(year_path, file)
+            upload_path = os.path.join(GTFS_PATH, year, file)
+            upload_file(local_path, upload_path)


mzappitello · 2024-07-01T14:53:49Z

src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py

+    """
+    return CREATE TABLE query for sqlite table from pyarrow schema
+    """
+    logger = ProcessLogger("sqlite_create_table")


seems like we're logging how long it takes to generate the query here rather than how long it takes to create and populate the table which might be more useful.

Dropped the logging of the full CREATE TABLE query.

Should be able to track the duration of SQLITE table creation/insertion when pq_folder_to_sqlite calls logger.add_metadata(current_file=file)

github-actions · 2024-07-02T14:32:38Z

Coverage of commit `0ec0e63`

Summary coverage rate:
  lines......: 76.6% (2125 of 2773 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |76.2%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |87.8%     49|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |85.6%     97|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.9%    113|    -     0|    -      0

Download coverage report

rymarczy · 2024-07-02T14:37:52Z

Will wait to merge this until https://github.com/mbta/devops/pull/1947 is applied.

github-actions · 2024-07-02T19:14:48Z

Coverage of commit `52abbaa`

Summary coverage rate:
  lines......: 76.6% (2125 of 2773 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |76.2%    101|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/pq_to_sqlite.py                                |87.8%     49|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |85.6%     97|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.9%    113|    -     0|    -      0

Download coverage report

rymarczy force-pushed the feat-gtfs-sqlite branch from 0c4dba4 to e70969b Compare June 28, 2024 18:23

load gtfs parquet into sqlite

744c25e

rymarczy force-pushed the feat-gtfs-sqlite branch from e70969b to 744c25e Compare June 28, 2024 19:03

rymarczy requested a review from mzappitello June 28, 2024 19:58

gzip compress sqlite file

bedbe97

mzappitello approved these changes Jul 1, 2024

View reviewed changes

drop create table log

0ec0e63

add env var, run conversion

52abbaa

rymarczy merged commit a51c2a4 into main Jul 3, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Export compressed GTFS schedule to SQLITE db #388

FEAT: Export compressed GTFS schedule to SQLITE db #388

rymarczy commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

github-actions bot commented Jul 1, 2024

mzappitello left a comment

mzappitello Jul 1, 2024

rymarczy Jul 1, 2024

mzappitello Jul 1, 2024

rymarczy Jul 1, 2024

mzappitello Jul 1, 2024

mzappitello Jul 1, 2024

rymarczy Jul 2, 2024

github-actions bot commented Jul 2, 2024

rymarczy commented Jul 2, 2024

github-actions bot commented Jul 2, 2024

FEAT: Export compressed GTFS schedule to SQLITE db #388

FEAT: Export compressed GTFS schedule to SQLITE db #388

Conversation

rymarczy commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

Coverage of commit 744c25e

github-actions bot commented Jul 1, 2024

Coverage of commit bedbe97

mzappitello left a comment

Choose a reason for hiding this comment

mzappitello Jul 1, 2024

Choose a reason for hiding this comment

rymarczy Jul 1, 2024

Choose a reason for hiding this comment

mzappitello Jul 1, 2024

Choose a reason for hiding this comment

rymarczy Jul 1, 2024

Choose a reason for hiding this comment

mzappitello Jul 1, 2024

Choose a reason for hiding this comment

mzappitello Jul 1, 2024

Choose a reason for hiding this comment

rymarczy Jul 2, 2024

Choose a reason for hiding this comment

github-actions bot commented Jul 2, 2024

Coverage of commit 0ec0e63

rymarczy commented Jul 2, 2024

github-actions bot commented Jul 2, 2024

Coverage of commit 52abbaa

Coverage of commit `744c25e`

Coverage of commit `bedbe97`

Coverage of commit `0ec0e63`

Coverage of commit `52abbaa`