-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2070 add backup of source data to nssp #2072
2070 add backup of source data to nssp #2072
Conversation
nssp/tests/test_pull.py
Outdated
print(result) | ||
|
||
# Check logger used: | ||
mock_logger.info.assert_called() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also assert on the actual log message by using caplog: (https://pytest-with-eric.com/fixtures/built-in/pytest-caplog/)
def test_merge_backfill_file(self, caplog, monkeypatch):
caplog.set_level(logging.INFO)
logger = get_structured_logger()
fn = "quidel_covidtest_202008.parquet"
assert fn not in os.listdir(backfill_dir)
# Check when no daily file stored
today = datetime(2020, 8, 20)
merge_backfill_file(backfill_dir, today, logger, test_mode=True)
assert fn not in os.listdir(backfill_dir)
assert "No new files to merge; skipping merging" in caplog.text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
create_backup_csv(df=self.DF, backup_dir=tmp_path, custom_run=False, issue=None, geo_res=geo_res, metric=metric, sensor=sensor, logger=logger) | ||
assert "Backup file created" in caplog.text | ||
|
||
actual = pd.read_csv(join(tmp_path, f"{today}_{geo_res}_{metric}_{sensor}.csv.gz"), dtype=dtypes, parse_dates=["timestamp"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmdefries this gets me bit worried... when I was writing this, when we write to csv, the datatypes of the columns are not preserved for csv. Is there a particular reason we want to use csv.gz compared to parquet/other data format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, status quo. We use CSV for all of our other data storage (meaning we don't preserve data types anywhere else). Since these files are intended to be backups/not in regular use, having them mesh well with other systems doesn't matter so much. I don't have a strong preference. I'm worried, though, that using a different less-standard format will cause more engineering effort here. One of our goals for this was quick rollout.
The pro of using a data-type preserving format is that it lets us read these files in and continue the pipeline as if we were pulling from the source and doing our normal processing.
@minhkhul thoughts? Maybe y'all could briefly talk to Adam about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked with adam and decided on storing both csv and parquet for now; shouldn't be a problem for a while as both nssp and nchs is small
a614b5f
to
870024e
Compare
daily run size result: |
Thanks for the double check :) @minhkhul |
Description
Context at #2070
Changelog