Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voting methods for feature ranking in efs #112

Open
wants to merge 75 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
a5d1b38
add stability selection article
bblodfon Jul 31, 2024
4cc3815
add Rcpp code for approval voting feature ranking method
bblodfon Jul 31, 2024
21ae7d7
add citation
bblodfon Jul 31, 2024
ccffa4b
extra check during init()
bblodfon Jul 31, 2024
108ddc2
update doc + use the Rcpp interface for approval voting
bblodfon Jul 31, 2024
589df2e
add templates for params in ArchiveBatchFSelect + updocs
bblodfon Jul 31, 2024
e520c77
use testthat expectations (not checkmate ones!)
bblodfon Jul 31, 2024
0ecc618
add test for newly implemented voting methods
bblodfon Jul 31, 2024
2622c96
update test for av
bblodfon Jul 31, 2024
97f21c4
fix note
bblodfon Jul 31, 2024
f84f91c
refactor AV_rcpp, add SAV_rcpp
bblodfon Aug 1, 2024
3614d93
add norm_score, and SAV R function
bblodfon Aug 1, 2024
0a1eb49
add sav, improve doc
bblodfon Aug 1, 2024
fc5d24d
fix efs test
bblodfon Aug 1, 2024
6df3bbd
update and improve test for AV
bblodfon Aug 1, 2024
fc86503
add sav test
bblodfon Aug 1, 2024
0d9eccf
Merge branch 'main' into voting_methods
bblodfon Aug 7, 2024
87d68d4
add borda score
bblodfon Aug 7, 2024
fa05f09
update tests
bblodfon Aug 7, 2024
6a89966
add seq and revseq PAV Rcpp methods
bblodfon Aug 12, 2024
5c09975
add R functions for the PAV methods
bblodfon Aug 12, 2024
103bf45
comment printing
bblodfon Aug 12, 2024
ff17d11
add tests for PAV methods
bblodfon Aug 12, 2024
b6f4b5e
add PAV methods to efs
bblodfon Aug 12, 2024
3a248cf
refactor: do not use C++ RNGs
bblodfon Aug 13, 2024
92ce0df
fix startsWith
bblodfon Aug 13, 2024
283003e
updocs
bblodfon Aug 13, 2024
567f456
fix data.table note
bblodfon Aug 13, 2024
e55ae24
add committee_size parameter, refactor borda score
bblodfon Aug 19, 2024
9a37e60
add large data test for seq pav
bblodfon Aug 19, 2024
58ab928
refactor C++ code, add optimized PAV
bblodfon Aug 21, 2024
61c0907
remove revseq-PAV method, use optimized seqPAV
bblodfon Aug 21, 2024
8654a38
update tests
bblodfon Aug 21, 2024
47e3dcf
remove suboptimal seqPAV function
bblodfon Aug 23, 2024
b369c6e
shuffle candidates outside Rcpp functions (same tie-breaking)
bblodfon Aug 23, 2024
6b7fb03
optimize Phragmen a bit => do not randomly select the candidate with …
bblodfon Aug 23, 2024
60065f9
add phragmen's rule in efs
bblodfon Aug 23, 2024
8ffa44f
correct borda score + use phragmens rule
bblodfon Aug 23, 2024
852ff35
add tests for Phragmen's rule
bblodfon Aug 23, 2024
5623812
correct weighted Phragmen's rule
bblodfon Sep 18, 2024
7e3be3e
add specific test for phragmen's rule
bblodfon Sep 18, 2024
25387c4
Merge branch 'main' into voting_methods
bblodfon Sep 19, 2024
1eef6c6
run document()
bblodfon Sep 19, 2024
f2ccbda
show data.table result after using ':='
bblodfon Oct 17, 2024
bea5e39
add n_resamples field + nicer obj print
bblodfon Oct 17, 2024
2d21fc7
cover edge case (eg lasso resulted in no features getting selected)
bblodfon Oct 24, 2024
ad9fd2e
Merge branch 'main' into voting_methods
bblodfon Oct 25, 2024
7f3ab3b
updocs
bblodfon Oct 25, 2024
4137404
small styling fix
bblodfon Oct 25, 2024
d151303
add Stabl ref
bblodfon Oct 31, 2024
83529b6
more descriptive name
bblodfon Oct 31, 2024
49bb097
add embedded ensemble feature selection
bblodfon Oct 31, 2024
6f3923f
remove print()
bblodfon Nov 1, 2024
123624e
add TOCHECK comment on benchmark design
bblodfon Nov 5, 2024
0581cdc
use internal valid task
be-marc Nov 11, 2024
14acd73
simplify
be-marc Nov 11, 2024
81b475d
...
be-marc Nov 11, 2024
79747ad
store_models = FALSE
be-marc Nov 11, 2024
331f231
...
be-marc Nov 11, 2024
081acc8
separate the use of inner_measure and measure used in the test sets
bblodfon Nov 18, 2024
efc0155
updocs
bblodfon Nov 18, 2024
0e2f93f
update tests
bblodfon Nov 18, 2024
3bca203
Merge branch 'main' into voting_methods
bblodfon Nov 18, 2024
d457221
refactor: expect_vector => expect_numeric
bblodfon Nov 18, 2024
9cb56b1
fix partial arg match
bblodfon Nov 18, 2024
cc36179
fix example
bblodfon Nov 18, 2024
816376a
use fastVoteR for feature ranking
bblodfon Nov 23, 2024
3dae249
pass named list to callback parameter
be-marc Nov 25, 2024
fd5afbc
skip test if fastVoteR is not available
bblodfon Nov 25, 2024
c937024
refactor: better handling of inner measure
bblodfon Nov 26, 2024
8e506c8
add tests for embedded_ensemble_fselect()
bblodfon Nov 26, 2024
3bd1772
update NEWs
bblodfon Nov 26, 2024
9e05dca
add active_measure field
bblodfon Nov 26, 2024
832bd7f
remove Remotes as fastVoteR is now on CRAN :)
bblodfon Nov 27, 2024
8c0d73f
refine doc
bblodfon Nov 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Imports:
mlr3misc (>= 0.15.1),
paradox (>= 1.0.0),
R6,
Rcpp,
stabm
Suggests:
e1071,
Expand Down Expand Up @@ -73,9 +74,11 @@ Collate:
'FSelectorBatchShadowVariableSearch.R'
'ObjectiveFSelect.R'
'ObjectiveFSelectBatch.R'
'RcppExports.R'
'assertions.R'
'auto_fselector.R'
'bibentries.R'
'embedded_ensemble_fselect.R'
'ensemble_fselect.R'
'extract_inner_fselect_archives.R'
'extract_inner_fselect_results.R'
Expand All @@ -85,4 +88,7 @@ Collate:
'mlr_callbacks.R'
'reexports.R'
'sugar.R'
'voting_methods.R'
'zzz.R'
LinkingTo:
Rcpp
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ export(auto_fselector)
export(callback_batch_fselect)
export(clbk)
export(clbks)
export(embedded_ensemble_fselect)
export(ensemble_fselect)
export(extract_inner_fselect_archives)
export(extract_inner_fselect_results)
Expand All @@ -56,6 +57,7 @@ import(mlr3)
import(mlr3misc)
import(paradox)
importFrom(R6,R6Class)
importFrom(Rcpp,sourceCpp)
importFrom(bbotk,mlr_terminators)
importFrom(bbotk,trm)
importFrom(bbotk,trms)
Expand All @@ -67,3 +69,4 @@ importFrom(utils,bibentry)
importFrom(utils,combn)
importFrom(utils,head)
importFrom(utils,packageVersion)
useDynLib(mlr3fselect, .registration = TRUE)
102 changes: 82 additions & 20 deletions R/EnsembleFSResult.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
#' Whether to add the learner, task and resampling information from the benchmark result.
#'
#' @references
#' `r format_bib("das1999")`
#' `r format_bib("das1999", "meinshausen2010")`
#'
#' @export
#' @examples
Expand Down Expand Up @@ -82,6 +82,10 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
private$.result = result
private$.features = assert_character(features, any.missing = FALSE, null.ok = FALSE)
private$.minimize = assert_logical(minimize, null.ok = FALSE)

# check that all feature sets are subsets of the task features
assert_subset(unlist(result$features), private$.features)

self$benchmark_result = if (!is.null(benchmark_result)) assert_benchmark_result(benchmark_result)

self$man = "mlr3fselect::ensemble_fs_result"
Expand All @@ -99,7 +103,8 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
#'
#' @param ... (ignored).
print = function(...) {
catf(format(self))
catf("%s with %s learners and %s initial resamplings",
format(self), self$n_learners, self$n_resamples)
print(private$.result[, c("resampling_iteration", "learner_id", "n_features"), with = FALSE])
},

Expand All @@ -113,37 +118,85 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
#' Calculates the feature ranking.
#'
#' @details
#' The feature ranking process is built on the following framework: models act as voters, features act as candidates, and voters select certain candidates (features).
#' The feature ranking process is built on the following framework: models act as *voters*, features act as *candidates*, and voters select certain candidates (features).
#' The primary objective is to compile these selections into a consensus ranked list of features, effectively forming a committee.
#' Currently, only `"approval_voting"` method is supported, which selects the candidates/features that have the highest approval score or selection frequency, i.e. appear the most often.
#'
#' For every feature a score is calculated, which depends on the `"method"` argument.
#' The higher the score, the higher the rank of the feature.
#' Most methods have a `"*_weighted"` version that outputs a weighted score.
#' The weights used are equal to the performance scores of each voter/model (or the inverse scores if the measure is minimized).
#' The un-weighted methods use same weights for all voters (equal to 1).
#'
#' Note that some methods output a feature ranking instead of a score per feature.
#' Therefore we also calculate **Borda's score**:
#' \eqn{s_{borda} = (p-i)/(p-1)}, where \eqn{p} is the total number of features, and \eqn{i} is the feature ranking.
#' So the best feature gets a borda score of \eqn{1} and the worst-ranked feature a borda score of \eqn{0}.
#' This score is method-agnostic, i.e. it can be used to compare the feature rankings across different methods.
#'
#' We randomly shuffle the input candidates/features so that we enforce the same tie-breaking mechanism for all available methods.
#' Users should use the same `seed` for consistent comparison between the different feature ranking methods and for reproducibility.
#'
#' The following methods are currently supported:
#'
#' - `"av"|"av_weighted"` (approval voting) selects the candidates that have the highest approval score, i.e. the features that appear the most often.
#' This is the default feature ranking method.
#' - `"sav"|"sav_weighted"` (satisfaction approval voting) selects the candidates that have a higher satisfaction score, in proportion to the size of the voters approval sets.
#' Voters who approve more candidates contribute a lesser score to the individual approved candidates.
#' - `"seq_pav"|"seq_pav_weighted"` (sequential proportional approval voting) sequentially builds a committee by iteratively selecting the candidate that maximizes the PAV score when added, ensuring proportional representation.
#' The **PAV score** (Proportional Approval Voting score) is a metric that calculates the weighted sum of harmonic numbers corresponding to the number of elected candidates supported by each voter, reflecting the overall satisfaction of voters in a committee selection process.
#' - `"seq_phragmen"|"seq_phragmen_weighted"` (sequential Phragmen's rule) distributes "loads" equally among voters for each candidate added to the committee.
#' The rule iteratively selects the candidate that results in the smallest increase in voter load.
#' This approach is suitable for scenarios where a balanced representation is desired, as it seeks to evenly distribute the "burden" of representation among all voters.
#'
#' @param method (`character(1)`)\cr
#' The method to calculate the feature ranking.
#' @param committee_size (`integer(1)`)\cr
#' Number of top selected features in the output ranking.
#' This parameter can be used to speed-up methods that build a committee sequentially (`"seq_pav"`), by requesting only the top N selected candidates/features and not the complete feature ranking.
#'
#' @return A [data.table::data.table] listing all the features, ordered by decreasing scores (depends on the `"method"`).
#' An extra column `"norm_score"` is produced for methods for which the original scores (i.e. approval counts in the case of approval voting) can be normalized and interpreted as **selection probabilities**, see Meinshausen et al. (2010).
#' The `"borda_score"` column is always included to incorporate feature ranking methods that don't output per-feature scores but only rankings.
#'
#' @return A [data.table::data.table] listing all the features, ordered by decreasing inclusion probability scores (depending on the `method`)
feature_ranking = function(method = "approval_voting") {
assert_choice(method, choices = "approval_voting")
feature_ranking = function(method = "av", committee_size = NULL) {
assert_choice(method, choices = c("av", "av_weighted", "sav", "sav_weighted",
"seq_pav", "seq_pav_weighted", "seq_phragmen",
"seq_phragmen_weighted"))
assert_int(committee_size, lower = 1, null.ok = TRUE)

# cached results
if (!is.null(private$.feature_ranking[[method]])) {
return(private$.feature_ranking[[method]])
}

count_tbl = sort(table(unlist(private$.result$features)), decreasing = TRUE)
features_selected = names(count_tbl)
features_not_selected = setdiff(private$.features, features_selected)
# candidates => all features, voters => list of selected (best) features sets
candidates = private$.features
voters = private$.result$features

res_fs = data.table(
feature = features_selected,
inclusion_probability = as.vector(count_tbl) / nrow(private$.result)
)

res_fns = data.table(
feature = features_not_selected,
inclusion_probability = 0
)
# calculate weights
use_weights = grepl(pattern = "weighted", x = method)
if (use_weights) {
# voter weights are the (inverse) scores
scores = private$.result[, get(private$.measure_id)]
weights = if (private$.minimize) 1 / scores else scores
} else {
# all voters are equal
weights = rep(1, length(voters))
}

res = rbindlist(list(res_fs, res_fns))
# shuffle candidates (force same tie-breaking between methods)
candidates = sample(candidates)

# calculate scores
if (startsWith(method, "av")) {
res = approval_voting(voters, candidates, weights)
} else if (startsWith(method, "sav")) {
res = satisfaction_approval_voting(voters, candidates, weights)
} else if (startsWith(method, "seq_pav")) {
res = seq_proportional_approval_voting(voters, candidates, weights, committee_size)
} else if (startsWith(method, "seq_phragmen")) {
res = seq_phragmen_rule(voters, candidates, weights, committee_size)
}

private$.feature_ranking[[method]] = res
private$.feature_ranking[[method]]
Expand Down Expand Up @@ -261,6 +314,8 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
# Transform the data (x => 1/x)
n_features_inv = NULL
pf[, n_features_inv := 1 / n_features]
# remove edge cases where no features were selected
pf = pf[n_features > 0]

# Fit the linear model
form = mlr3misc::formulate(lhs = measure_id, rhs = "n_features_inv")
Expand Down Expand Up @@ -351,6 +406,13 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
measure = function(rhs) {
assert_ro_binding(rhs)
private$.measure_id
},

#' @field n_resamples (`character(1)`)\cr
#' Returns the number of times the task was initially resampled in the ensemble feature selection.
n_resamples = function(rhs) {
assert_ro_binding(rhs)
uniqueN(self$result$resampling_iteration)
}
),

Expand Down
19 changes: 19 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

AV_rcpp <- function(voters, candidates, weights) {
.Call(`_mlr3fselect_AV_rcpp`, voters, candidates, weights)
}

seq_PAV_rcpp <- function(voters, candidates, weights, committee_size) {
.Call(`_mlr3fselect_seq_PAV_rcpp`, voters, candidates, weights, committee_size)
}

seq_Phragmen_rcpp <- function(voters, candidates, weights, committee_size) {
.Call(`_mlr3fselect_seq_Phragmen_rcpp`, voters, candidates, weights, committee_size)
}

SAV_rcpp <- function(voters, candidates, weights) {
.Call(`_mlr3fselect_SAV_rcpp`, voters, candidates, weights)
}

41 changes: 29 additions & 12 deletions R/bibentries.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ bibentries = c(
title = "ecr 2.0",
booktitle = "Proceedings of the Genetic and Evolutionary Computation Conference Companion"
),

bergstra_2012 = bibentry("article",
title = "Random Search for Hyper-Parameter Optimization",
author = "James Bergstra and Yoshua Bengio",
Expand All @@ -20,8 +19,7 @@ bibentries = c(
pages = "281--305",
url = "https://jmlr.csail.mit.edu/papers/v13/bergstra12a.html"
),

thomas2017 = bibentry("article",
thomas2017 = bibentry("article",
doi = "10.1155/2017/1421409",
year = "2017",
publisher = "Hindawi Limited",
Expand All @@ -31,8 +29,7 @@ bibentries = c(
title = "Probing for Sparse and Fast Variable Selection with Model-Based Boosting",
journal = "Computational and Mathematical Methods in Medicine"
),

wu2007 = bibentry("article",
wu2007 = bibentry("article",
doi = "10.1198/016214506000000843",
year = "2007",
month = "3",
Expand All @@ -44,8 +41,7 @@ bibentries = c(
title = "Controlling Variable Selection by the Addition of Pseudovariables",
journal = "Journal of the American Statistical Association"
),

guyon2002 = bibentry("article",
guyon2002 = bibentry("article",
title = "Gene Selection for Cancer Classification using Support Vector Machines",
volume = "46",
issn = "1573-0565",
Expand All @@ -56,7 +52,6 @@ bibentries = c(
author = "Isabelle Guyon and Jason Weston and Stephen Barnhill and Vladimir Vapnik",
year = "2002"
),

kuhn2013 = bibentry("Inbook",
author = "Kuhn, Max and Johnson, Kjell",
chapter = "Over-Fitting and Model Tuning",
Expand All @@ -67,7 +62,6 @@ bibentries = c(
pages = "61--92",
isbn = "978-1-4614-6849-3"
),

saeys2008 = bibentry("article",
author = "Saeys, Yvan and Abeel, Thomas and Van De Peer, Yves",
doi = "10.1007/978-3-540-87481-2_21",
Expand All @@ -79,7 +73,6 @@ bibentries = c(
volume = "5212 LNAI",
year = "2008"
),

abeel2010 = bibentry("article",
author = "Abeel, Thomas and Helleputte, Thibault and Van de Peer, Yves and Dupont, Pierre and Saeys, Yvan",
doi = "10.1093/BIOINFORMATICS/BTP630",
Expand All @@ -92,7 +85,6 @@ bibentries = c(
volume = "26",
year = "2010"
),

pes2020 = bibentry("article",
author = "Pes, Barbara",
doi = "10.1007/s00521-019-04082-3",
Expand All @@ -106,7 +98,6 @@ bibentries = c(
volume = "32",
year = "2020"
),

das1999 = bibentry("article",
author = "Das, I",
issn = "09344373",
Expand All @@ -118,5 +109,31 @@ bibentries = c(
title = "On characterizing the 'knee' of the Pareto curve based on normal-boundary intersection",
volume = "18",
year = "1999"
),
meinshausen2010 = bibentry("article",
author = "Meinshausen, Nicolai and Buhlmann, Peter",
doi = "10.1111/J.1467-9868.2010.00740.X",
eprint = "0809.2932",
issn = "1369-7412",
journal = "Journal of the Royal Statistical Society Series B: Statistical Methodology",
month = "sep",
number = "4",
pages = "417--473",
publisher = "Oxford Academic",
title = "Stability Selection",
volume = "72",
year = "2010"
),
hedou2024 = bibentry("article",
author = "Hedou, Julien and Maric, Ivana and Bellan, Gregoire and Einhaus, Jakob and Gaudilliere, Dyani K. and Ladant, Francois Xavier and Verdonk, Franck and Stelzer, Ina A. and Feyaerts, Dorien and Tsai, Amy S. and Ganio, Edward A. and Sabayev, Maximilian and Gillard, Joshua and Amar, Jonas and Cambriel, Amelie and Oskotsky, Tomiko T. and Roldan, Alennie and Golob, Jonathan L. and Sirota, Marina and Bonham, Thomas A. and Sato, Masaki and Diop, Maigane and Durand, Xavier and Angst, Martin S. and Stevenson, David K. and Aghaeepour, Nima and Montanari, Andrea and Gaudilliere, Brice", #nolint
doi = "10.1038/s41587-023-02033-x",
issn = "1546-1696",
journal = "Nature Biotechnology 2024",
month = "jan",
pages = "1--13",
publisher = "Nature Publishing Group",
title = "Discovery of sparse, reliable omic biomarkers with Stabl",
url = "https://www.nature.com/articles/s41587-023-02033-x",
year = "2024"
)
)
Loading
Loading