Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a configurable filter list to HSC data set #53

Merged
merged 1 commit into from
Aug 29, 2024

Conversation

mtauraso
Copy link
Collaborator

@mtauraso mtauraso commented Aug 27, 2024

This is based on PR #49 , and its important for @mtauraso to change the base branch to main before merging.

I've set the base branch properly so the change is just what's added to the PR #49 branch.

Given dataloader:filters as config, the dataloader will:

  • Only scan files which are part of its filter set
  • Prune objects where the full list of filters provided
    are not present on the filesystem.

@mtauraso mtauraso self-assigned this Aug 27, 2024
@mtauraso mtauraso linked an issue Aug 27, 2024 that may be closed by this pull request
Copy link

Before [79293c8] After [78c2f2c] Ratio Benchmark (Parameter)
3.39±0.7s 1.19±0.9s ~0.35 benchmarks.time_computation
320 3.63k 11.35 benchmarks.mem_list

Click here to view all benchmarks.

Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 94.73684% with 4 lines in your changes missing coverage. Please review.

Project coverage is 47.08%. Comparing base (3bffb95) to head (7693a07).
Report is 11 commits behind head on issue/35/cutout-interface-cleanup.

Files Patch % Lines
src/fibad/data_loaders/hsc_data_loader.py 94.73% 4 Missing ⚠️
Additional details and impacted files
@@                          Coverage Diff                          @@
##           issue/35/cutout-interface-cleanup      #53      +/-   ##
=====================================================================
+ Coverage                              44.08%   47.08%   +3.00%     
=====================================================================
  Files                                     16       16              
  Lines                                    549      584      +35     
=====================================================================
+ Hits                                     242      275      +33     
- Misses                                   307      309       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mtauraso mtauraso changed the base branch from main to issue/35/cutout-interface-cleanup August 27, 2024 23:25
Given dataloader:filters as config, the dataloader will:
- Only scan files which are part of its filter set
- Prune objects where the full list of filters provided
  are not present on the filesystem.
@mtauraso mtauraso force-pushed the issue/34/filter-list branch from 7693a07 to 5277c7f Compare August 27, 2024 23:27
Copy link
Collaborator

@aritraghsh09 aritraghsh09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One minor comment.

m = re.match(full_regex, filename)

# Skip files that don't match the pattern.
if m is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't make the process super slow, can we log the name of the file being skipped?

Copy link
Collaborator Author

@mtauraso mtauraso Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more worried about log spam. Adding a debug or info level log here shouldn't slow things down unless the log is being emitted to a console.

I am thinking though that the better solution is to output that manifest fits table, which will have all the skipped files explicitly and not create a potential foot-gun for people changing the logging level to info/debug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate for @mtauraso's approach here. Perhaps a middle ground would be logging some summary metrics at the end along with a message saying to look in the manifest fits table for skipped files?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, putting the info in the manifest table sounds good! I also like @drewoldag's idea of some summary metrics at the end if it's easy to implement!

Base automatically changed from issue/35/cutout-interface-cleanup to main August 29, 2024 18:12
@mtauraso mtauraso merged commit 2fb04de into main Aug 29, 2024
@mtauraso mtauraso deleted the issue/34/filter-list branch August 29, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make HSC data loader accept a list of filters
3 participants