You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CategoricalCAP metric is currently available in SDMetrics. It is based on a well-researched and cited methodology for measuring the risk of data disclosure. It acts as a great measure of privacy for particular sensitive columns.
However, there are a number of issues which make this metric hard for users to use and interpret. In #675, we are addressing many of the quality-related issues for this via a new metric called DiclosureProtection. However, there still remain performance issues.
In this issue, we will address performance issues in DisclosureProtection by creating a new metric called DisclosureProtectionEstimate.
Expected behavior
The DisclosureProtectionEstimate metric should wrap around the DisclosureProtection metric. It should estimate the original metric's score by subsetting the data, iterating over many subsets, and returning the average score.
Parameters: All the same parameters as DisclosureProtection, plus:
num_rows_subsample: An integer describing the number of rows to subsample in each of the real and synthetic datasets. This subsampling occurs with replacement during every iteration (that is to say, every iteration should start subsampling from the same, original dataset)
(default) 1000: Subsample 1000 rows in both the real and synthetic data
<int>: Subsample the number of rows provided
None: Do not subsample
num_iterations: The number of iterations to do for different subsamples. The final score will be the average
(default) 10: Do 10 iterations of the subsamples.
<int>: Perform the number of iterations provided
verbose: A boolean describing whether to show the progress
(default) True: Print the steps being run and the progress bar
False: Do not print anything out
Computation:
The baseline_protection score computation is exactly the same as in DisclosureProtection, and it only needs to be computed once.
The cap_protection score will instead be a cap_protection_estimate by running through the desired # of iterations and averaging out the results.
The final score will then be score = min(cap_protection_estimate/baseline_protection, 1)
Verbosity/Progress Bar:
If verbosity is turned on, it should show a progress bar that increments per iteration. The progress bar should be updated with the overall score (using the updated average), rounded to 4 decimal places.
Problem Description
The CategoricalCAP metric is currently available in SDMetrics. It is based on a well-researched and cited methodology for measuring the risk of data disclosure. It acts as a great measure of privacy for particular sensitive columns.
However, there are a number of issues which make this metric hard for users to use and interpret. In #675, we are addressing many of the quality-related issues for this via a new metric called
DiclosureProtection
. However, there still remain performance issues.In this issue, we will address performance issues in
DisclosureProtection
by creating a new metric calledDisclosureProtectionEstimate
.Expected behavior
The
DisclosureProtectionEstimate
metric should wrap around theDisclosureProtection
metric. It should estimate the original metric's score by subsetting the data, iterating over many subsets, and returning the average score.Parameters: All the same parameters as
DisclosureProtection
, plus:num_rows_subsample
: An integer describing the number of rows to subsample in each of the real and synthetic datasets. This subsampling occurs with replacement during every iteration (that is to say, every iteration should start subsampling from the same, original dataset)1000
: Subsample 1000 rows in both the real and synthetic data<int>
: Subsample the number of rows providedNone
: Do not subsamplenum_iterations
: The number of iterations to do for different subsamples. The final score will be the average10
: Do 10 iterations of the subsamples.<int>
: Perform the number of iterations providedverbose
: A boolean describing whether to show the progressTrue
: Print the steps being run and the progress barFalse
: Do not print anything outComputation:
baseline_protection
score computation is exactly the same as inDisclosureProtection
, and it only needs to be computed once.cap_protection
score will instead be acap_protection_estimate
by running through the desired # of iterations and averaging out the results.score = min(cap_protection_estimate/baseline_protection, 1)
Compute breakdown:
Verbosity/Progress Bar:
If verbosity is turned on, it should show a progress bar that increments per iteration. The progress bar should be updated with the overall score (using the updated average), rounded to 4 decimal places.
The text was updated successfully, but these errors were encountered: