Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2070 add backup of source data to nssp #2072

Merged
merged 16 commits into from
Nov 7, 2024

Conversation

minhkhul
Copy link
Contributor

Description

Context at #2070

Changelog

  • Make nssp pull use delphi_utils.export.create_backup_csv() to save backup csv whenever it pulls new data in prod run.
  • Adjust run, prod params.json and params.json.template accordingly
  • Add test case accordingly

@minhkhul minhkhul linked an issue Oct 29, 2024 that may be closed by this pull request
8 tasks
print(result)

# Check logger used:
mock_logger.info.assert_called()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also assert on the actual log message by using caplog: (https://pytest-with-eric.com/fixtures/built-in/pytest-caplog/)

 def test_merge_backfill_file(self, caplog, monkeypatch):
        caplog.set_level(logging.INFO)
        logger = get_structured_logger()

        fn = "quidel_covidtest_202008.parquet"
        assert fn not in os.listdir(backfill_dir)
        
        # Check when no daily file stored
        today = datetime(2020, 8, 20)
        merge_backfill_file(backfill_dir, today, logger, test_mode=True)
        assert fn not in os.listdir(backfill_dir)
        assert "No new files to merge; skipping merging" in caplog.text
       

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

create_backup_csv(df=self.DF, backup_dir=tmp_path, custom_run=False, issue=None, geo_res=geo_res, metric=metric, sensor=sensor, logger=logger)
assert "Backup file created" in caplog.text

actual = pd.read_csv(join(tmp_path, f"{today}_{geo_res}_{metric}_{sensor}.csv.gz"), dtype=dtypes, parse_dates=["timestamp"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmdefries this gets me bit worried... when I was writing this, when we write to csv, the datatypes of the columns are not preserved for csv. Is there a particular reason we want to use csv.gz compared to parquet/other data format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, status quo. We use CSV for all of our other data storage (meaning we don't preserve data types anywhere else). Since these files are intended to be backups/not in regular use, having them mesh well with other systems doesn't matter so much. I don't have a strong preference. I'm worried, though, that using a different less-standard format will cause more engineering effort here. One of our goals for this was quick rollout.

The pro of using a data-type preserving format is that it lets us read these files in and continue the pipeline as if we were pulling from the source and doing our normal processing.

@minhkhul thoughts? Maybe y'all could briefly talk to Adam about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with adam and decided on storing both csv and parquet for now; shouldn't be a problem for a while as both nssp and nchs is small

@aysim319 aysim319 force-pushed the 2070-add-backup-of-source-data-to-nchs-and-nssp branch from a614b5f to 870024e Compare November 6, 2024 17:21
@aysim319
Copy link
Contributor

aysim319 commented Nov 6, 2024

daily run size result:
7MB for nssp (5MB for csv 2MB for parquet) 600KB for nchs (400KB for csv 200KB for parquet)

@minhkhul minhkhul merged commit 9dbbc59 into main Nov 7, 2024
16 checks passed
@minhkhul
Copy link
Contributor Author

minhkhul commented Nov 7, 2024

@aysim319 🙏

@aysim319
Copy link
Contributor

aysim319 commented Nov 7, 2024

Thanks for the double check :) @minhkhul

@nmdefries nmdefries deleted the 2070-add-backup-of-source-data-to-nchs-and-nssp branch December 4, 2024 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add backup of source data to nchs and nssp
3 participants