FEAT: Ingestion Light Rail Raw GPS files #422

rymarczy · 2024-08-15T12:51:02Z

This change allows the Ingestion app to ingest Light Rail RAW GPS files.

The process is currently configured to process a maximum batch size of ~5000 files on each Ingestion loop.

All processing is done in memory with polars, this dataset is relatively small and 5000 files worth of data utilizes less than 2GB of data. A full event loop of 5000 files takes ~2 mins to process locally. This can be adjusted if event loop is taking too long on AWS.

The process will also merge export files with existing files found in the public S3 bucket.

Asana Task: https://app.asana.com/0/1205827492903547/1207059327607972

rymarczy · 2024-08-15T12:51:24Z

src/lamp_py/aws/s3.py

+def object_exists(obj: str) -> bool:
+    """
+    check if s3 object exists
+
+    :param obj - expected as 's3://my_bucket/object' or 'my_bucket/object'
+
+    :return: True is object exists, otherwise false
+    """


Convenience function to check if object exists on S3

rymarczy · 2024-08-15T12:52:09Z

src/lamp_py/ingestion/convert_gtfs_rt.py

+        elif config_type == ConfigType.LIGHT_RAIL:
+            raise IgnoreIngestion("Ignore LIGHT_RAIL files")


This seemed like the least intrusive way to handle skipping these files in GTFS-RT processing. Specific exception will trigger the ignoring of this file type.

rymarczy · 2024-08-15T12:52:28Z

src/lamp_py/ingestion/ingest_gtfs.py

+            except IgnoreIngestion:
+                continue


Skip ignored files, don't move them

rymarczy · 2024-08-15T12:53:30Z

src/lamp_py/ingestion/light_rail_gps.py

+def thread_gps_to_frame(path: str) -> Tuple[Optional[pl.DataFrame], str]:
+    """
+    gzip to dataframe converter function meant to be run in ThreadPool
+
+    :param path: path to gzip that will be converted to polars dataframe
+    """
+    file_system = current_thread().__dict__["file_system"]
+    path = path.replace("s3://", "")
+
+    logger = ProcessLogger(process_name="light_rail_gps_to_frame", path=path)
+    logger.log_start()


This is similar to our gzip process for RT files. In testing, about 16 threads produced the maximum throughput with minimal resource usage.

rymarczy · 2024-08-15T12:53:57Z

src/lamp_py/ingestion/light_rail_gps.py

+                else:
+                    error_files.append(path)
+
+        dataframe: pl.DataFrame = pl.concat(dfs)


Gather all records from all files into single dataframe.

rymarczy · 2024-08-15T12:54:52Z

src/lamp_py/ingestion/light_rail_gps.py

+    for date in dataframe.get_column("date").unique():
+        logger = ProcessLogger(
+            process_name="light_rail_write_parquet", date=date
+        )
+        logger.log_start()


Write parquet files by date pulled from "updated_at" column.

This will merge existing parquet files from S3, if they exist and de-dupe the entire dataframe.

github-actions · 2024-08-15T12:55:08Z

Coverage of commit `0246a97`

Summary coverage rate:
  lines......: 75.9% (2447 of 3224 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/aws/s3.py                                                              |48.6%    278|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs_rt.py                                           |49.3%    213|    -     0|    -      0
  src/lamp_py/ingestion/error.py                                                     | 100%     16|    -     0|    -      0
  src/lamp_py/ingestion/light_rail_gps.py                                            |55.1%     78|    -     0|    -      0

Download coverage report

rymarczy · 2024-08-15T12:55:43Z

tests/ingestion/test_light_rail_gps.py

+def test_light_rail_gps() -> None:
+    """
+    test gtfs_events_for_date pipeline
+    """
+    dataframe, archive_files, error_files = dataframe_from_gz(mock_file_list)


Not totally sure if it's worth testing the rest of the process, which just includes a file_list_from_s3 call and the parquet writing.

mzappitello

one question on the new aws utility.

mzappitello · 2024-08-19T12:05:34Z

src/lamp_py/aws/s3.py

+    except botocore.exceptions.ClientError as exception:
+        if exception.response["Error"]["Code"] == "404":
+            return False
+
+    return False


do we want to catch a more generic exception here as well?

if something in our try raises, it'll raise up to whatever called this function.

per this documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadObject.html#API_HeadObject_Errors

the "404" error is the only thing that should be returned if the object does not exist in the bucket. Any other error would be un-defined behavior. Such as a permissions error, that should probably raise so that it's not masked by the function.

mzappitello · 2024-08-19T13:05:26Z

src/lamp_py/ingestion/light_rail_gps.py

+        with file_system.open_input_stream(path, compression="gzip") as f:
+            df = (
+                pl.read_json(f.read())
+                .transpose(
+                    include_header=True,
+                    header_name="serial_number",
+                    column_names=("data",),
+                )
+                .select(
+                    pl.col("serial_number").cast(pl.String),
+                    pl.col("data").struct.field("speed").cast(pl.Float64),
+                    (
+                        pl.col("data")
+                        .struct.field("updated_at")
+                        .str.slice(0, length=10)
+                        .str.to_date()
+                        .alias("date")
+                    ),
+                    pl.col("data").struct.field("updated_at").cast(pl.String),
+                    pl.col("data").struct.field("bearing").cast(pl.Float64),
+                    pl.col("data").struct.field("latitude").cast(pl.String),
+                    pl.col("data").struct.field("longitude").cast(pl.String),
+                    pl.col("data").struct.field("car").cast(pl.String),
+                )
+            )


this seems to be a nicer way to do this than using pyarrow.

Yeah, I also think most conversions between pyarrow and polars are zero-copy operations. The only tricky thing is making sure we're handling any type conversions correctly.

rymarczy · 2024-08-21T12:58:28Z

src/lamp_py/aws/s3.py

+    except botocore.exceptions.ClientError as exception:
+        if exception.response["Error"]["Code"] == "404":
+            return False
+        raise exception


raise all other exceptions, The 404 error code is the only thing that should be returned if the object does not exist. Any other exceptions would be undefined behavior

github-actions · 2024-09-04T10:35:15Z

Coverage of commit `20568b0`

Summary coverage rate:
  lines......: 75.8% (2455 of 3239 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/aws/s3.py                                                              |48.6%    278|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs_rt.py                                           |49.5%    214|    -     0|    -      0
  src/lamp_py/ingestion/error.py                                                     | 100%     16|    -     0|    -      0
  src/lamp_py/ingestion/light_rail_gps.py                                            |55.1%     78|    -     0|    -      0

Download coverage report

rymarczy commented Aug 15, 2024

View reviewed changes

rymarczy requested a review from mzappitello August 15, 2024 12:55

mzappitello reviewed Aug 19, 2024

View reviewed changes

rymarczy commented Aug 21, 2024

View reviewed changes

rymarczy force-pushed the feat-ingest-lr-gps branch from 45f1f01 to afb7932 Compare September 4, 2024 10:17

rymarczy added 3 commits September 4, 2024 06:27

ingest light rail gps files

98f5674

raise all other exceptions

26c57d2

pull LAMP prefix from remote_files

20568b0

rymarczy force-pushed the feat-ingest-lr-gps branch from afb7932 to 20568b0 Compare September 4, 2024 10:31

rymarczy merged commit a6f817f into main Sep 4, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Ingestion Light Rail Raw GPS files #422

FEAT: Ingestion Light Rail Raw GPS files #422

rymarczy commented Aug 15, 2024

rymarczy Aug 15, 2024

rymarczy Aug 15, 2024

rymarczy Aug 15, 2024

rymarczy Aug 15, 2024

rymarczy Aug 15, 2024

rymarczy Aug 15, 2024

github-actions bot commented Aug 15, 2024

rymarczy Aug 15, 2024

mzappitello left a comment

mzappitello Aug 19, 2024

rymarczy Aug 21, 2024

mzappitello Aug 19, 2024

rymarczy Aug 19, 2024

rymarczy Aug 21, 2024

github-actions bot commented Sep 4, 2024

		elif config_type == ConfigType.LIGHT_RAIL:
		raise IgnoreIngestion("Ignore LIGHT_RAIL files")

FEAT: Ingestion Light Rail Raw GPS files #422

FEAT: Ingestion Light Rail Raw GPS files #422

Conversation

rymarczy commented Aug 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Aug 15, 2024

Coverage of commit 0246a97

Choose a reason for hiding this comment

mzappitello left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 4, 2024

Coverage of commit 20568b0

Coverage of commit `0246a97`

Coverage of commit `20568b0`