Add `DisclosureProtectionEstimate` metric #676

npatki · 2024-11-22T18:32:11Z

Problem Description

The CategoricalCAP metric is currently available in SDMetrics. It is based on a well-researched and cited methodology for measuring the risk of data disclosure. It acts as a great measure of privacy for particular sensitive columns.

However, there are a number of issues which make this metric hard for users to use and interpret. In #675, we are addressing many of the quality-related issues for this via a new metric called DiclosureProtection. However, there still remain performance issues.

In this issue, we will address performance issues in DisclosureProtection by creating a new metric called DisclosureProtectionEstimate.

Expected behavior

The DisclosureProtectionEstimate metric should wrap around the DisclosureProtection metric. It should estimate the original metric's score by subsetting the data, iterating over many subsets, and returning the average score.

Parameters: All the same parameters as DisclosureProtection, plus:

num_rows_subsample: An integer describing the number of rows to subsample in each of the real and synthetic datasets. This subsampling occurs with replacement during every iteration (that is to say, every iteration should start subsampling from the same, original dataset)
- (default) 1000: Subsample 1000 rows in both the real and synthetic data
- <int>: Subsample the number of rows provided
- None: Do not subsample
num_iterations: The number of iterations to do for different subsamples. The final score will be the average
- (default) 10: Do 10 iterations of the subsamples.
- <int>: Perform the number of iterations provided
verbose: A boolean describing whether to show the progress
- (default) True: Print the steps being run and the progress bar
- False: Do not print anything out

Computation:

The baseline_protection score computation is exactly the same as in DisclosureProtection, and it only needs to be computed once.
The cap_protection score will instead be a cap_protection_estimate by running through the desired # of iterations and averaging out the results.
The final score will then be score = min(cap_protection_estimate/baseline_protection, 1)

Compute breakdown:

from sdmetrics.single_table import DisclosureProtectionEstimate

DisclosureProtectionEstimate.compute_breakdown(
    real_data=real_table,
    synthetic_data=synthetic_table,
    known_columns=['age', 'gender'],
    sensitive_column=['political_affiliation'],
    columns_to_discretize=['age'],
)

{
    'score': 0.912731436159061,
    'cap_protection_estimate': 0.782341231,
    'baseline_protection': 0.85714285715
}

Verbosity/Progress Bar:
If verbosity is turned on, it should show a progress bar that increments per iteration. The progress bar should be updated with the overall score (using the updated average), rounded to 4 decimal places.

>>> DisclosureProtectionEstimate.compute(real_data, synthetic_data, verbose=True)

Estimating Disclosure Protection (Score=0.8744): 100%|██████████| 10/10 [00:34<00:00,  3.42s/it]

The text was updated successfully, but these errors were encountered:

npatki added the feature request Request for a new feature label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `DisclosureProtectionEstimate` metric #676

Add `DisclosureProtectionEstimate` metric #676

npatki commented Nov 22, 2024 •

edited

Loading

Add DisclosureProtectionEstimate metric #676

Add DisclosureProtectionEstimate metric #676

Comments

npatki commented Nov 22, 2024 • edited Loading

Problem Description

Expected behavior

Add `DisclosureProtectionEstimate` metric #676

Add `DisclosureProtectionEstimate` metric #676

npatki commented Nov 22, 2024 •

edited

Loading