From b1ffb4912d0d5819e603799a85ff657eb7cec1ed Mon Sep 17 00:00:00 2001 From: James Lott Date: Tue, 6 Sep 2022 15:53:57 -0400 Subject: [PATCH] Deploy elavon SFTP credentials into production (#1749) * airtable: start renaming int to base * airtable: refactor staging tables to be historical; refactor get latest macro to enable daily extract selection * airtable: convert staging to views rather than tables * airtable: convert intermediate mapping tables to base * always compile, but only check dbt run success after docs/metabase * run tests even if run failed * airtable: define key as metabase PK * airtable: add equal row count tests for models with id mapping * airtable: rename map to bridge * update poetry.lock for dbt-metabase * airtable: latest-only-ify bridge tables * missed a couple * airtable: make mart latest-only * airtable: refactor dim service components * airtable: specify metabase FK columns * airtable: new fields & tables to address #1630 * airtable: make bridge tables date-aware and assorted small fixes * get us going! * airtable: address failing dbt tests -- minor tweaks * airtable: more failing dbt tests * airtable: refactor service components to handle duplicates * airtable: fix legacy airtable source definition to reference views * airtable: remove redundant metabase FK metadata * airtable: fix test syntax * airtable: use QUALIFY to simplify ranked queries * fix: make airtable gcs operator use timestamps rather than time string * fix(timestamp partitions): update calitp version to get schedule partition updates * warehouse (payments): migrated payments_views_staging cleaned dags to models as well as validation tables to tests * use new calitp version * fix(timestamp partitions): explicitly use isoformat string * style: rename CTEs to be more specific * farm surrogate key macro: coalesce nulls in macro itself * add notebook used to re-name a partition * chore: remove pyup config file no longer in use * chore: remove pyup ignore statement * airtable: use ts instead of time * add airtable mart to list of things synced to metabase * update metabase database names again * warehouse(payments_views_staging): split yml files into staging and source, added documentation for cleaned files, deleted old validation tables * warehouse(payments_views_staging): added generic tests, added composite unique tests from dbt_packages, added docs file with references, materialized staging tables as views * warehouse(payments_views_staging): added configuration to persist singular tests as tables in the warehouse * warehouse(payments_views): migrated airflow dags for payments views to its own model in dbt, added metadata and generic tests, added dbt references * print message if deploy is not set * round lat/lons, specify 4m accuracy, add new resources * print the documentation file being written * add coord system, disable shapes for now due to size limit * fix(fact daily trips timeout): wip incremental table * update to good stable version of sqlfluff * fix: make fact daily trips incremental -- WIP * pass and/or ignore new rules * linter * fact daily trips: remove dev incremental check * docs: update airtable prod maintenance instructions * docs: add new dags to dependency diagram * docs: add spacing to help w line wrapping * docs: more spaces for line wrapping... * dbt-metabase: update version in poetry; comment out failing relationship tests * warehouse(payments_views): got payments_rides working and migrated, added yml and metadata, added payments_views validation tests and persisted tables, added payments_views_refactored with intermedite tables and got that to work * get new calitp version * import gcs models from calitp-py! * missed a couple * get us going! * fix: make airtable gcs operator use timestamps rather than time string * fix(timestamp partitions): update calitp version to get schedule partition updates * fix(timestamp partitions): explicitly use isoformat string * use new calitp version * start experimenting with task queue options and metrics * get this working and test performance with greenlets * couple more metrics * wip testing with multiple consumers at high volume * start optimizing for lots of small tasks; have to make redis interaction fast * fix key str format * couple more libs * wip * wip on discussed changes * get the keys from environ for now * use new calitp py * print a bit more * we are just gonna get stuff in the env * commit this before I break anything * fmt * bump calitp-py * lint * rename v2 to v3 since 2.X tags already exist * kinda make this runnable * new node pool just dropped * get running in docker compose to kick the tires * start on RT v3 k8s * get the consumer working mostly? * label redis pod appropriately * tell consumer about temp rt secrets * that was dumb * ticker k8s! * set expire time on the huey instance * point consumer at svc account json * avoid pulling the stacktrace in * scrape on 9102 * bump to 16 workers per consumer * bump jupyterhub storage to 32gi * add these back! * add comment * bring in new calitp and fix tick rounding * improve metrics and labels * warehouse(payments): removed payemnts_rides_refactor from yml file * clean up labels * get secrets from secret manager sdk before the consumer starts... * missed this * fix secrets volume and adjust affinities * warehouse(payments): removed the airflow dags for the payments_views that were migrated, as well as the two test tables * warehouse(payments): removed the old intermediate tables from the dbt project yaml file * add content type header to bytes * ugh whitespace * warehouse: fixing linting error * warehouse: fixing linting error again * warehouse(dbt_project): added to-do comments in project config to remind where to move model schemas in the future * fix: update Mountain Transit URL * remove celery and gevent from pyproject deps Co-authored-by: Mjumbe Poe * we might as well specify huey app name by env as well just in case we end up on the same redis in the future * write to the prod bucket! * create a preprod version and deploy it * run fewer workers in preprod * move pull policies to patches, and only run 1 dev consumer * add redis considerations to readme * docs(datasets and tables): revised informationon dbt docs for views tables based on PR review * docs(datasets and tables): revised for readability * docs(datasets and tables): revised docs information for gtfs schedule based on PR review * docs(datasets and tables): fixed readability * docs(datasets and tables): added new formatting, added gtfs rt dbt docs instructions * docs(datasets and tables): revamped the overview page for datasets and tables * docs(datasets and tables): cleaned up readability * bump version and start adding more logging context * specifically log request errors that do not come from raise_for_status * set v3 image versions separately * bump to 8 workers and improve log formatting * formatting * fix string representation of exception type in logs * bump prod to 3.1 * oops * hotfix version * bump to 30m * warehouse(airflow): deleted the empty payments_views_staging dag directory * warehouse(airflow): deleted dummy_staging airflow task, removed gusty dependencies from other tables that relied on that task * docs(airflow): edited the production dags docs to reflect changes in payments staging views dags * docs(airflow): revised docs based on lauries comment re only listing enfoorced dependencies * Update new-team-member.md Fixed added missing meetings, deleted old meetings. deleted auto-assign * docs(datasets ans tables): reconfigured some pages for readability * docs(datasets and tables): re-reviewed and added clarity * fix (open data): align column publish metadata with open data dictionary -- suppress calitp hash, synthetic keys, and extraction date, add calitp_itp_id and url_number * docs(production maintenance): added n/a for dependencies for payments_views * docs(datasets and tables): created new page with content on how to use dbt docs, added to toc * docs(datasets and tables): removed information on how to navigate dbt docs in favor of the new page created, added info to warehouse schema sections, created dbt project cirectory sections * (analyst_docs): update gcloud commands * fix(open data): make test_metadata attribute optional to account for singular tests * docs(datasets and tables): reformatted for readability and conciseness * docs(datasets and tables): revisions based on Laurie's review * docs(datasets and tables): revised PR to put gtfs views tables used by ckan under the views doc * fix(open data): suppress publishing stop_times because of size limit issue * agencies.yml: update FCRTA and add Escalon Transit * agencies.yml: rename escalon transit to etrans * fix(airflow/gtfs_loader): replace non-utf-8 characters * feat(airtable): add new columns per request #1674 * fix(airtable data): address review comments PR #1677 * fix: add WeHo RT URLs * fix(ckan publishing): only add columns to data dictionary if they don't have publish.ignore set * update calitp py and change log * make docker compose work * specify buckets and bump version in dev * now do prod * change logging * add weho key * bump gtfs rt v3 version * bump calitp py * deploy new image to dev * get dev and prod working with bucket env vars * bump calitp py and expire cache every 5 minutes * deploy new cache clearing to prod/dev * make sure calitp is updated, load secrets in ticker too * fix docker compose, use new flags, deploy new image to dev * bump prod * add airtable age metric, bump version, scrape ticker * delete experimental fact_daily_trips_inc incremental table that was not functioning correctly (#1681) * docs: correct Transit Technology Stacks title (#1565) The Transit Technology Stacks header was not properly being linked to in the overview table. This fixes that. * fix: update GRaaS URLs (#1690) * New schedule pipeline validation job (#1648) * wip on validation in new schedule pipeline * bring in stuff from calitp storage, work on saving validations/outcomes * wip getting this working * use new calitp prerelease, fix filenames/content, remove break * oops * working! * update lockfile * unzip/validate schedule dag * remove this * bring in latest calitp-py * extra print * pass env vars into pod * fix lint * add readme * bring in latest calitp * fix print and formatting * bring the outcome-only classes over, and use env var for bucket * filter out nones for RT airtable records * bring in latest calitp py * get latest calitp * use new env var and rename validation job results * start updating airflow with new calitp py and using bucket env vars * test schedule downloader with new calitp * new calitp * handle new calitp, better logging * add env vars for new calitp * put prefix_bucket back for parse_and_validate_rt and document env var configuration * comments * use new version of caltip py with good gcsfs (#1693) * use new version of caltip py with good gcsfs * use the regular release * docs(agency): adding reference table for analysts to define agency, reference for pre-commit hooks (#1430) * docs(agency): adding reference table for analysts to define agency in their research * docs(agency): fixed table formatting error * docs(agency): fixed table formatting error plus pre-commit hooks * docs(pre-commit hooks): added information for using and troubleshooting pre-commit hooks * docs: formatting errors, added missing capitalization * docs: formatting table with list * docs: formatting table with no line break - attempt 1 * docs: clarified language and spacing in table * docs: clarified language in table * docs: removing extra information from agency table * docs: removing extra information from agency table pt 2 * docs: removing extra information from agency table pt 3 * docs: reworked table to include gtfs-provider-service relationships * docs: added space for the gtfs provider's services section * docs: added space for the gtfs provider's services section syntax corrections * docs: added space for the gtfs provider's services section syntax corrections again * docs: clarified information arounf gtfs provider relationships * docs: clarified information around gtfs provider relationships and intro content * docs: agency table revisions based on call with E * docs(agency reference): incorporated E's feedback in the copy, added warehouse table instead of airtable table * docs(agency reference): reformatted table * docs(warehouse): added new table information for analyst agency reference now that the airtable migration is complete and the table was created. added css styling to prevent table scrolling * docs: renamed python library file h1 to be more intuitive * docs(conf): added comments explaining the added css preventing horizontal scroll in markdown tables * docs(add to what_is_agency) * docs(warehouse): fixed some typos, errors, and formatting issues Co-authored-by: natam1 Co-authored-by: Charles Costanzo * we also have to pin a specific fsspec version directly in the requirements (#1694) * Create SFTP ingest component for Elavon data (#1692) * kubernetes: sftp-ingest-elavon: add server component * kubernetes: sftp-server: add sshd configuration This enables functionality like chroot'd logins and disabling of shell logins. * kubernetes: sftp-server: add readinessProbe Since the container is essentially built at startup, there is a sizeable time delta between container startup and ssh server startup. This addition helps the operator easily detect when installation is complete and the service is running. * kubernetes: sftp-server: add cluster service This enables cluster workloads to login using a DNS names. * kubernetes: sftp-server: refactor bootstrap script for better DRY * kubernetes: prod-sftp-ingest-elavon: create production localization * kubernetes: prod-sftp-ingest-elavon: add internet-service.yaml This exposes the SFTP port for inbound connections from the vendor. * ci: prod-sftp-ingest-elavon.env: enable prod deployment * Fix typo in `what is agency` (#1698) it's --> it's * limit schedule validation jobs with a pool (#1700) * Created new row-level access policy macro and applied it to payments_rides (#1697) * created new row-level access policy and applied it to payments rides with newly generated service accounts * ran pre-commit hooks to fix failing actions Co-authored-by: Charles Costanzo * deploy voila fix (#1702) * disable autodetect if schema is specified (#1704) * Create v2 RT parsing and validation jobs in Airflow and creates external tables (#1691) * start on new parsing job * comment and fmt * wip getting parsing working * fmt * get parsing working! * save outcomes file properly * remove old validator and dupe log * this is only jsonl right now, so this workaround is bad * wip on validation * wip * get parsing working, start simplifying * get validation working with schedules referenced by airtable! * missed this * get the actual rt v2 airflow jobs mostly working * missed this * run v2 RT jobs at :15 instead of :30 * convert metadata field names to bq-safe * fix being able to template bucket and test out rt_service_alerts_v2 external table * add outcomes external table to test * wip trying to get a debugger to test pydantic custom serialization * fix rt outcome serialization to be bq safe * create rest of rt v2 external tables * couple small fixes * start addressing PR comments * address PR comment * add ci/cd action to build gtfs-rt-parser-v2 image * Fix: skip amplitude_benefits DAG if 404 (#1705) * fix(amplitude): mark skip when 404 is encountered * chore(amplitude): add some logging statements around API call * Gtfs schedule unzip v2 (#1696) * gtfs loader v2 wip * gtfs unzipper v2: semi-working WIP -- can unzip at least one zipfile * address initial review comments * bump calitp version * gtfs unzipper v2: working version with required functionality * update calitp and make the downloader run with it * gtfs unzipper v2: get working in airflow; use logging * rename to distinguish zipfile from extracted files within zipfile * resolve reviewer comments * gtfs unzipper v2: refactor to raise exceptions on unparseable zips * gtfs unzipper: further simplify exception handling * final tweaks -- refactor of checking for invalid zip structure, tighten up processing of valid files * comment typos/clarifications Co-authored-by: Andrew Vaccaro * warehouse: added fare_systems transit database mart table (#1701) * warehouse: added fare_systems transit database mart table * warehouse: fixed duplicate doc macro issue for fare_systems * explicitly declared schema * removed columns no longer relevant * warehouse: added bridge table for fare_systems x services * warehouse: added bridge table for fare_systems x services to yaml * Clean up RT outcomes (#1709) * remove unnecessary json_encoders * just save the extract path * add a dockerignore * grant access to payments_rides for non-agency users (#1714) * grant access to payments_rides for non-agency users * just use calitp domain and add a couple other users * Run CKAN weekly, with multipart uploads as needed (#1710) * wip getting multipart upload to ckan working * remove before I forget * mirror the example script... we get 500s with too many chunks * commit this while it is working * add this back * allow env var to control target and bucket * create weekly task to run publish california_open_data * allow manifest to be in gcs * get this actually working... * dockerignore * clean up names, add resource requests, make work in pod operator * address PR comments * load this from a secret (#1717) * Initial dbt models to support GTFS guidelines checks (#1712) * initial work towards #1688 * gtfs guidelines initial implementation: tweaks & improvements * gtfs guidelines: add metabase semantic type for calitp agency name * sync new dataset to metabase * gtfs guidelines: rename table, formatting updates * rename compliance gtfs feature per PR review * Add RT VP vs Sched Table (#1708) * add table * add table * add operator * fix sql syntax * fix failing indentations * add unique test * fix .yml test * Create local Dockerfile and bash script for dbt development (#1711) * start on local dev dockerfile * handle local profiles dir * make dbt docker work with local google credentials * add build-essentials per recommendation * update poetry install method and add libgdal-dev * poetry changed its bin location * Improvements to dbt artifacts and publish workflow (#1726) * add ts partition to publish artifacts * also save artifacts with timestamps vs just latest * start simplifying publish script, proper dry runs, reading manifest from gcs * fix publish assert, use env vars, simplify logging * allow resource descriptions in publishing, allow direct remote writing * ugh * need to be utc * bring in simplified descriptions * missed bucket * upload metadata/dictionary to gcs for ckan; also fix bug * update ckan docs to reflect publishing changes * actually these should always get written * env vars not templating * fix timestamped artifact names * pretty print * address pr comments * update ckan publishing docs * actually set ckan precision fields and use them * uppercase field types and allow specifying a model to publish * bad dict key * these are length 7 * lats are only 6 digits * warehouse documentation: add calitp_itp_id and calitp_url_number metadata to several dimensional columns (#1733) * airtable organizations: define external table schemas (#1734) * Upgrade schedule validator and save version as metadata (#1729) * update to v3 validator, fix dockerfile * finally deploy the schedule validator image through github actions * bring in latest calitp * use new calitp, simplify metadata, add version to notice rows, couple qol improvements * change flag per v3 * use poetry export install here too * lock * export install here too * add verbose, just copy jar instead of download * use environ directly * Set RT validator version as metadata and fix a bug (#1732) * set rt validator version as metadata * add validator version in metadata and put extract under a key * fix schedule data exception string representation and assert after outcomes upload * fix poetry in docker, lock * use export and install * update typer * fix schedule downloading... also add url filter to cli * get latest validator from github just in case, and keep name * rename this here too * address PR comments * add pool for airtable (#1743) * deprecate airtable v1 extracts (#1699) * deprecate airtable v1 extracts * delete v1 airtable operator * Change column name to fix run error (#1730) * change date col name * fix service_date col * chore: remove evansiroky from most CODEOWNERS items (#1735) * kubernetes: prod-sftp-ingest-elavon: add elavon ssh public key (#1742) Co-authored-by: Laurie Merrell Co-authored-by: Andrew Vaccaro Co-authored-by: Andrew Vaccaro Co-authored-by: Charlie Costanzo Co-authored-by: Kegan Maher Co-authored-by: Laurie <55149902+lauriemerrell@users.noreply.github.com> Co-authored-by: evansiroky Co-authored-by: Mjumbe Poe Co-authored-by: tiffanychu90 Co-authored-by: tiffanychu90 <49657200+tiffanychu90@users.noreply.github.com> Co-authored-by: natam1 Co-authored-by: Charles Costanzo Co-authored-by: Angela Tran Co-authored-by: natam1 <72096633+natam1@users.noreply.github.com> Co-authored-by: Github Action build-release-candidate --- kubernetes/apps/charts/jupyterhub/values.yaml | 2 +- .../apps/overlays/prod-sftp-ingest-elavon/sftp-user-config.yaml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kubernetes/apps/charts/jupyterhub/values.yaml b/kubernetes/apps/charts/jupyterhub/values.yaml index 69eb8e1200..364ae92340 100644 --- a/kubernetes/apps/charts/jupyterhub/values.yaml +++ b/kubernetes/apps/charts/jupyterhub/values.yaml @@ -8,7 +8,7 @@ jupyterhub: defaultUrl: "/lab" image: name: ghcr.io/cal-itp/calitp-py - tag: hub-v14 + tag: hub-v15 memory: # Much more than 10 and we risk bumping up against the actual capacity of e2-highmem-2 limit: 10G diff --git a/kubernetes/apps/overlays/prod-sftp-ingest-elavon/sftp-user-config.yaml b/kubernetes/apps/overlays/prod-sftp-ingest-elavon/sftp-user-config.yaml index 8eefa36acf..9f51c85130 100644 --- a/kubernetes/apps/overlays/prod-sftp-ingest-elavon/sftp-user-config.yaml +++ b/kubernetes/apps/overlays/prod-sftp-ingest-elavon/sftp-user-config.yaml @@ -3,4 +3,4 @@ kind: ConfigMap metadata: name: sftp-user-config data: - authorized_keys: '' + authorized_keys: 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDDo2kni8Bu16miTauaZfvZlIDj/90t9XIr7PP03SjTQb6bzQpioOBGENKcO2eOCbyYxFfTP1jwIUFElHgsY5OQy7LUbywbzIiZRCE0kU5B7O8uNwUY7kl7nZYHYzFccDug+czfkoUBZEHVj1pnVejgHjKEomp8XFRnaeBmpQm46A0IptM+AT0u3mNkJ7kt5RRC0BwKCD2a3Nn61gD37HEjqMK8seqw/c5i1UZ2EdDEFQXoiMH2P95JxyshRv0mpa8vVBdEjmOlDQXfNarWhDcll2an3h3dm0sAtbiTPdktRl2DC1pZeiWAiitqJ6f0g+YFfC5AwX+/4m/anlK8JnH7FTuiI1dSHukf98OutWMsBWl0huuC/bO9qfQTJkqcHsmCibkRujuHCP6FXNPmHwN1FFK3AYADeEiQ5nq4QRGtN1zOLX2jz21ylpgtK8V8LOxpu/r38OqkuzEh48n3v6YrqGY58w0P+z3ywQWAzNDLr0c05Q1kU9m4YOg8NkgAU/vilUXDNfjgBWsYJHyTQQQjavj7NGfuoFgItXTki5y+ccPFiU99YU+gbL6iqJvC8qhqY8H1fafWM0tx9i4TvPirrNxXty8mS1zw9eERtDb17SkNS794ydtZ3Ohui5L78Uo4Z/WTRKHmupuBP3oFLT+tZhYBwKgnG+Y/tFrz5ov6+w== USBank' # elavon