2070 add backup of source data to nssp #2072

minhkhul · 2024-10-29T19:56:46Z

Description

Context at #2070

Changelog

Make nssp pull use delphi_utils.export.create_backup_csv() to save backup csv whenever it pulls new data in prod run.
Adjust run, prod params.json and params.json.template accordingly
Add test case accordingly

ansible/templates/nssp-params-prod.json.j2

aysim319 · 2024-10-30T15:51:55Z

nssp/tests/test_pull.py

        print(result)

+        # Check logger used:
+        mock_logger.info.assert_called()


you can also assert on the actual log message by using caplog: (https://pytest-with-eric.com/fixtures/built-in/pytest-caplog/)

def test_merge_backfill_file(self, caplog, monkeypatch): caplog.set_level(logging.INFO) logger = get_structured_logger() fn = "quidel_covidtest_202008.parquet" assert fn not in os.listdir(backfill_dir) # Check when no daily file stored today = datetime(2020, 8, 20) merge_backfill_file(backfill_dir, today, logger, test_mode=True) assert fn not in os.listdir(backfill_dir) assert "No new files to merge; skipping merging" in caplog.text

nssp/delphi_nssp/pull.py

nmdefries

looks good

aysim319 · 2024-11-06T17:19:45Z

_delphi_utils_python/tests/test_export.py

+        create_backup_csv(df=self.DF, backup_dir=tmp_path, custom_run=False, issue=None, geo_res=geo_res, metric=metric, sensor=sensor, logger=logger)
+        assert "Backup file created" in caplog.text
+
+        actual = pd.read_csv(join(tmp_path, f"{today}_{geo_res}_{metric}_{sensor}.csv.gz"), dtype=dtypes, parse_dates=["timestamp"])


@nmdefries this gets me bit worried... when I was writing this, when we write to csv, the datatypes of the columns are not preserved for csv. Is there a particular reason we want to use csv.gz compared to parquet/other data format?

Basically, status quo. We use CSV for all of our other data storage (meaning we don't preserve data types anywhere else). Since these files are intended to be backups/not in regular use, having them mesh well with other systems doesn't matter so much. I don't have a strong preference. I'm worried, though, that using a different less-standard format will cause more engineering effort here. One of our goals for this was quick rollout.

The pro of using a data-type preserving format is that it lets us read these files in and continue the pipeline as if we were pulling from the source and doing our normal processing.

@minhkhul thoughts? Maybe y'all could briefly talk to Adam about this.

talked with adam and decided on storing both csv and parquet for now; shouldn't be a problem for a while as both nssp and nchs is small

aysim319 · 2024-11-06T21:51:35Z

daily run size result:
7MB for nssp (5MB for csv 2MB for parquet) 600KB for nchs (400KB for csv 200KB for parquet)

minhkhul · 2024-11-07T23:18:54Z

@aysim319 🙏

aysim319 · 2024-11-07T23:19:32Z

Thanks for the double check :) @minhkhul

minhkhul added 4 commits October 29, 2024 11:41

base changes

db561a3

backup dir

b9f5754

add test

3cbec82

lint

1071cb3

minhkhul linked an issue Oct 29, 2024 that may be closed by this pull request

Add backup of source data to nchs and nssp #2070

Closed

8 tasks

minhkhul requested review from nmdefries and aysim319 October 29, 2024 20:38

aysim319 reviewed Oct 29, 2024

View reviewed changes

ansible/templates/nssp-params-prod.json.j2 Show resolved Hide resolved

aysim319 reviewed Oct 30, 2024

View reviewed changes

nssp/delphi_nssp/pull.py Show resolved Hide resolved

nmdefries approved these changes Nov 5, 2024

View reviewed changes

aysim319 reviewed Nov 6, 2024

View reviewed changes

adding tests for create_backup_csv

870024e

aysim319 force-pushed the 2070-add-backup-of-source-data-to-nchs-and-nssp branch from a614b5f to 870024e Compare November 6, 2024 17:21

aysim319 added 2 commits November 6, 2024 14:18

also writing into parquet

cbc458b

adding pyarrow as dependency

5901c79

aysim319 and others added 9 commits November 6, 2024 17:00

clean test

97a2eff

adjusting logic to match new naming format and chunking

b24b4bf

moving dependencies

78ace13

lint

b3da58b

made test more robust

c471c5f

fix test

188a0af

clean up

4e47004

adding parqut into gitignore

e2aa3e0

placate the linter

f525727

minhkhul merged commit 9dbbc59 into main Nov 7, 2024
16 checks passed

nmdefries deleted the 2070-add-backup-of-source-data-to-nchs-and-nssp branch December 4, 2024 03:01

minhkhul mentioned this pull request Dec 11, 2024

Release covidcast-indicators 0.3.57 #2086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2070 add backup of source data to nssp #2072

2070 add backup of source data to nssp #2072

minhkhul commented Oct 29, 2024

aysim319 Oct 30, 2024

nmdefries left a comment

aysim319 Nov 6, 2024

nmdefries Nov 6, 2024

aysim319 Nov 6, 2024

aysim319 commented Nov 6, 2024

minhkhul commented Nov 7, 2024

aysim319 commented Nov 7, 2024

2070 add backup of source data to nssp #2072

2070 add backup of source data to nssp #2072

Conversation

minhkhul commented Oct 29, 2024

Description

Changelog

aysim319 Oct 30, 2024

Choose a reason for hiding this comment

nmdefries left a comment

Choose a reason for hiding this comment

aysim319 Nov 6, 2024

Choose a reason for hiding this comment

nmdefries Nov 6, 2024

Choose a reason for hiding this comment

aysim319 Nov 6, 2024

Choose a reason for hiding this comment

aysim319 commented Nov 6, 2024

minhkhul commented Nov 7, 2024

aysim319 commented Nov 7, 2024