Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: S3 remote file paths in one location #423

Merged
merged 6 commits into from
Aug 20, 2024

Conversation

rymarczy
Copy link
Collaborator

This changes attempts to consolidate references to S3 paths all in one location.

Original version of remote_files.py was modified to reduce the verbosity of calling resources defined by the file. Also added all bucket references and lamp file prefix constants to the file.

Hopefully I found all the places where we reference S3 file locations....

Asana Task: https://app.asana.com/0/1205827492903547/1207771349226027

Comment on lines -67 to -92
tm_stop_crossing = LocalS3Location(
bucket_name=springboard_bucket,
file_prefix=os.path.join(tm_prefix, "STOP_CROSSING"),
)
tm_geo_node_file = LocalS3Location(
bucket_name=springboard_bucket,
file_prefix=os.path.join(tm_prefix, "TMMAIN_GEO_NODE.parquet"),
)
tm_route_file = LocalS3Location(
bucket_name=springboard_bucket,
file_prefix=os.path.join(tm_prefix, "TMMAIN_ROUTE.parquet"),
)
tm_trip_file = LocalS3Location(
bucket_name=springboard_bucket,
file_prefix=os.path.join(tm_prefix, "TMMAIN_TRIP.parquet"),
)
tm_vehicle_file = LocalS3Location(
bucket_name=springboard_bucket,
file_prefix=os.path.join(tm_prefix, "TMMAIN_VEHICLE.parquet"),
)

# output of public bus events published by LAMP
bus_events = LocalS3Location(
bucket_name=public_bucket,
file_prefix="bus_vehicle_events",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like this stuff was being used

Comment on lines +33 to +36
rt_vehicle_positions = S3Location(
bucket=S3_SPRINGBOARD,
prefix=os.path.join(LAMP, "RT_VEHICLE_POSITIONS"),
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled all of these declarations out of the class to reduce the verbosity of referencing them in other parts of the application.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. it was too long and too gross already.

)


class GTFSArchive(S3Location):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub class with additional function for compressed gtfs archive files

@rymarczy rymarczy force-pushed the feat_remote_locations_everywhere branch from 9c8aa4a to 1f52ba3 Compare August 16, 2024 16:13
@rymarczy rymarczy force-pushed the feat_remote_locations_everywhere branch from 1f52ba3 to c60cb65 Compare August 16, 2024 17:09
Copy link

Coverage of commit c60cb65

Summary coverage rate:
  lines......: 76.7% (2409 of 3140 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/bus_performance_manager/gtfs.py                                        |90.9%     77|    -     0|    -      0
  src/lamp_py/bus_performance_manager/gtfs_utils.py                                  | 100%      9|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |76.5%    102|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |85.4%     96|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs.py                                              |79.6%     49|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs_rt.py                                           |48.8%    211|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.6%    112|    -     0|    -      0
  src/lamp_py/ingestion_tm/jobs/parition_table.py                                    |45.0%     80|    -     0|    -      0
  src/lamp_py/ingestion_tm/jobs/whole_table.py                                       |75.8%     91|    -     0|    -      0
  src/lamp_py/performance_manager/alerts.py                                          |53.3%    199|    -     0|    -      0
  src/lamp_py/performance_manager/flat_file.py                                       |82.7%    104|    -     0|    -      0
  src/lamp_py/performance_manager/l0_gtfs_static_load.py                             |88.1%    135|    -     0|    -      0
  src/lamp_py/runtime_utils/remote_files.py                                          | 100%     43|    -     0|    -      0

Download coverage report

@@ -54,6 +54,7 @@ pylint = "^3.2.2"
pytest = "^8.2.1"
pytest-cov = "^5.0.0"
types-python-dateutil = "^2.9.0.20240316"
pytest-env = "^1.1.3"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this package for env vars during testing to avoid any ENV VAR mocking requirements.

@rymarczy rymarczy requested a review from mzappitello August 16, 2024 17:18
Comment on lines -307 to -314
@pytest.fixture(name="_set_env_vars")
def fixture_set_env_vars() -> None:
"""setup bucket names for this test"""
os.environ["SPRINGBOARD_BUCKET"] = "springboard"
os.environ["ERROR_BUCKET"] = "error"
os.environ["ARCHIVE_BUCKET"] = "archive"


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't need this with pytest-env package

Copy link

Coverage of commit 2927301

Summary coverage rate:
  lines......: 76.3% (2396 of 3140 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/bus_performance_manager/gtfs.py                                        |90.9%     77|    -     0|    -      0
  src/lamp_py/bus_performance_manager/gtfs_utils.py                                  | 100%      9|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |67.6%    102|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |81.2%     96|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs.py                                              |79.6%     49|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs_rt.py                                           |48.8%    211|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.6%    112|    -     0|    -      0
  src/lamp_py/ingestion_tm/jobs/parition_table.py                                    |45.0%     80|    -     0|    -      0
  src/lamp_py/ingestion_tm/jobs/whole_table.py                                       |75.8%     91|    -     0|    -      0
  src/lamp_py/performance_manager/alerts.py                                          |53.3%    199|    -     0|    -      0
  src/lamp_py/performance_manager/flat_file.py                                       |82.7%    104|    -     0|    -      0
  src/lamp_py/performance_manager/l0_gtfs_static_load.py                             |88.1%    135|    -     0|    -      0
  src/lamp_py/runtime_utils/remote_files.py                                          | 100%     43|    -     0|    -      0

Download coverage report

Copy link
Contributor

@mzappitello mzappitello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have a few comments, mostly around if we should be making more constants for remote locations in remote_files.py or generating those paths in the files using them based on the constants we have.

my intuition says everything should be defined there, but i would concede its unlikely we'd need some of these paths outside of the context they're defined in (like hyper file publication).

@@ -77,6 +78,13 @@ log_cli = true
log_cli_level = "DEBUG"
verbose = true

[tool.pytest.ini_options]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a cool addition.

Comment on lines 6 to 10
S3_SPRINGBOARD: str = os.environ.get("SPRINGBOARD_BUCKET", "")
S3_PUBLIC: str = os.environ.get("PUBLIC_ARCHIVE_BUCKET", "")
S3_INCOMING: str = os.environ.get("INCOMING_BUCKET", "")
S3_ARCHIVE: str = os.environ.get("ARCHIVE_BUCKET", "")
S3_ERROR: str = os.environ.get("ERROR_BUCKET", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would suggest changing the defaults from "" to "<bucket_name>_unset" to help with debugging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is the issue in publishing/performancedata.py around this though where we string match the bucket, but we could handle that differently with a block in the run_on_app_start similarly to how we run the alembic_upgrade_to_head using an env var directly.

gtfs_tmp_folder = GTFS_PATH.replace(
os.getenv("PUBLIC_ARCHIVE_BUCKET"), "/tmp"
)
gtfs_tmp_folder = compressed_gtfs.s3_uri.replace(S3_PUBLIC, "/tmp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to build in the /tmp local file locations into the S3Location class as well, since it comes up as a pattern somewhat often and the location info is always the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at it more that looks like a hairy mess thats outside the scope here.

Comment on lines 23 to 35
from lamp_py.runtime_utils.remote_files import (
S3_PUBLIC,
LAMP,
)

from .gtfs_utils import BOSTON_TZ


class AlertsS3Info:
"""S3 Constant info for Alerts Parquet File"""

bucket_name: str = os.environ.get("PUBLIC_ARCHIVE_BUCKET", "")
s3_path: str = "s3://" + os.path.join(
bucket_name, "lamp", "tableau", "alerts", "LAMP_RT_ALERTS.parquet"
S3_PUBLIC, LAMP, "tableau", "alerts", "LAMP_RT_ALERTS.parquet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be another instance of the S3Location class rather than building up the path in this file.

Comment on lines +33 to +36
rt_vehicle_positions = S3Location(
bucket=S3_SPRINGBOARD,
prefix=os.path.join(LAMP, "RT_VEHICLE_POSITIONS"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. it was too long and too gross already.

Comment on lines 23 to 27
remote_parquet_path=f"s3://{os.getenv('PUBLIC_ARCHIVE_BUCKET')}/lamp/tableau/rail/LAMP_ALL_RT_fields.parquet",
remote_parquet_path=f"s3://{S3_PUBLIC}/{LAMP}/tableau/rail/LAMP_ALL_RT_fields.parquet",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again it feels like something that could be a constant in remote_files.py rather than building it up here.

@rymarczy rymarczy force-pushed the feat_remote_locations_everywhere branch from dcce7c1 to 6e32feb Compare August 20, 2024 16:37
Copy link

Coverage of commit 6e32feb

Summary coverage rate:
  lines......: 76.7% (2412 of 3143 lines)
  functions..: no data found
  branches...: no data found

Files changed coverage rate:
                                                                                     |Lines       |Functions  |Branches    
  Filename                                                                           |Rate     Num|Rate    Num|Rate     Num
  =========================================================================================================================
  src/lamp_py/bus_performance_manager/gtfs.py                                        |90.9%     77|    -     0|    -      0
  src/lamp_py/bus_performance_manager/gtfs_utils.py                                  | 100%      9|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/gtfs_to_parquet.py                             |76.5%    102|    -     0|    -      0
  src/lamp_py/ingestion/compress_gtfs/schedule_details.py                            |85.4%     96|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs.py                                              |79.6%     49|    -     0|    -      0
  src/lamp_py/ingestion/convert_gtfs_rt.py                                           |48.8%    211|    -     0|    -      0
  src/lamp_py/ingestion/utils.py                                                     |61.6%    112|    -     0|    -      0
  src/lamp_py/ingestion_tm/jobs/parition_table.py                                    |45.0%     80|    -     0|    -      0
  src/lamp_py/ingestion_tm/jobs/whole_table.py                                       |75.8%     91|    -     0|    -      0
  src/lamp_py/performance_manager/alerts.py                                          |53.3%    199|    -     0|    -      0
  src/lamp_py/performance_manager/flat_file.py                                       |82.7%    104|    -     0|    -      0
  src/lamp_py/performance_manager/l0_gtfs_static_load.py                             |88.1%    135|    -     0|    -      0
  src/lamp_py/runtime_utils/remote_files.py                                          | 100%     46|    -     0|    -      0

Download coverage report

@rymarczy rymarczy requested a review from mzappitello August 20, 2024 18:08
@mzappitello
Copy link
Contributor

LGTM. Merge away.

@rymarczy rymarczy merged commit b39c127 into main Aug 20, 2024
5 checks passed
rymarczy added a commit that referenced this pull request Aug 20, 2024
PR #423 re-factor broke compressed GTFS process by prefixing the gtfs_temp_folder declaration with s3://
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants