Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] v2 Work on tracking constant columns efficiently #121

Closed
wants to merge 12 commits into from

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Sep 2, 2023

Closes: #115

Changes proposed in this pull request:

  • enables us to track columns of features that are constant at a certain split node. Since the tree is built sequentially, this # of known constants is passed to lower levels, which makes use of this information
  • according to my benchmarks, this does not improve things much, but I am still unable to run asv benchmarks, so I think this can be improved

Some ideas for the next PR:

  1. sample projection vector within the while loop instead of sampling the matrix all at once. This allows us to hash the projection we sample and also ignore any constants we find in earlier stages of the split
  2. probably explore in-lining methods

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

@adam2392 adam2392 marked this pull request as ready for review September 11, 2023 18:50
@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.01% ⚠️

Comparison is base (b582895) 87.68% compared to head (7efaee4) 87.68%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #121      +/-   ##
==========================================
- Coverage   87.68%   87.68%   -0.01%     
==========================================
  Files          28       28              
  Lines        2323     2322       -1     
==========================================
- Hits         2037     2036       -1     
  Misses        286      286              
Files Changed Coverage Δ
sktree/tests/test_supervised_forest.py 99.41% <100.00%> (ø)
sktree/tree/tests/test_tree.py 99.51% <100.00%> (-0.01%) ⬇️
sktree/tree/tests/test_utils.py 98.79% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@sampan501 sampan501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you try to avoid splitting on constant features in node_split?

Wouldn't it be easier to preprocess X by just removing constant features from it? Which could be done before running any of the tree building code?

@adam2392
Copy link
Collaborator Author

It is actually to track constants after a certain point. E.g. you split 4 times and then the rest of column 10 and 20 are constant so at any node underneath, there is no point in splitting on column 10 or 20 in the oblique combination.

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we test that this feature works properly? The performance should stay very much the same, so we look for wall time differences? I also wonder how patch oblique works with it.

@adam2392
Copy link
Collaborator Author

Perhaps depth of the tree on a very simple setup.

@adam2392
Copy link
Collaborator Author

Patch oblique we would not use this feature as of now.

@adam2392
Copy link
Collaborator Author

How do we test that this feature works properly? The performance should stay very much the same, so we look for wall time differences? I also wonder how patch oblique works with it.

Anyone in the lab have exp benchmarking and profiling compiled code? Heh. It would be useful to determine if this vs main is faster via one of the benchmarks in benchmarks_nonasv/ or benchmarks/

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used time.perf_counter() to measure the wall times, like the example here:

https://github.com/neurodata/SDTF/blob/eb2545b8cd50503723d619497059510e37b3e7ad/benchmarks/code/cifar10.py#L35-L38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants