-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a configurable filter list to HSC data set #53
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## issue/35/cutout-interface-cleanup #53 +/- ##
=====================================================================
+ Coverage 44.08% 47.08% +3.00%
=====================================================================
Files 16 16
Lines 549 584 +35
=====================================================================
+ Hits 242 275 +33
- Misses 307 309 +2 ☔ View full report in Codecov by Sentry. |
Given dataloader:filters as config, the dataloader will: - Only scan files which are part of its filter set - Prune objects where the full list of filters provided are not present on the filesystem.
7693a07
to
5277c7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One minor comment.
m = re.match(full_regex, filename) | ||
|
||
# Skip files that don't match the pattern. | ||
if m is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't make the process super slow, can we log the name of the file being skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more worried about log spam. Adding a debug or info level log here shouldn't slow things down unless the log is being emitted to a console.
I am thinking though that the better solution is to output that manifest fits table, which will have all the skipped files explicitly and not create a potential foot-gun for people changing the logging level to info/debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advocate for @mtauraso's approach here. Perhaps a middle ground would be logging some summary metrics at the end along with a message saying to look in the manifest fits table for skipped files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, putting the info in the manifest table sounds good! I also like @drewoldag's idea of some summary metrics at the end if it's easy to implement!
This is based on PR #49 , and its important for @mtauraso to change the base branch to main before merging.
I've set the base branch properly so the change is just what's added to the PR #49 branch.
Given dataloader:filters as config, the dataloader will:
are not present on the filesystem.