Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DisclosureProtectionEstimate metric #676

Open
npatki opened this issue Nov 22, 2024 · 0 comments
Open

Add DisclosureProtectionEstimate metric #676

npatki opened this issue Nov 22, 2024 · 0 comments
Labels
feature request Request for a new feature

Comments

@npatki
Copy link
Contributor

npatki commented Nov 22, 2024

Problem Description

The CategoricalCAP metric is currently available in SDMetrics. It is based on a well-researched and cited methodology for measuring the risk of data disclosure. It acts as a great measure of privacy for particular sensitive columns.

However, there are a number of issues which make this metric hard for users to use and interpret. In #675, we are addressing many of the quality-related issues for this via a new metric called DiclosureProtection. However, there still remain performance issues.

In this issue, we will address performance issues in DisclosureProtection by creating a new metric called DisclosureProtectionEstimate.

Expected behavior

The DisclosureProtectionEstimate metric should wrap around the DisclosureProtection metric. It should estimate the original metric's score by subsetting the data, iterating over many subsets, and returning the average score.

Parameters: All the same parameters as DisclosureProtection, plus:

  • num_rows_subsample: An integer describing the number of rows to subsample in each of the real and synthetic datasets. This subsampling occurs with replacement during every iteration (that is to say, every iteration should start subsampling from the same, original dataset)
    • (default) 1000: Subsample 1000 rows in both the real and synthetic data
    • <int>: Subsample the number of rows provided
    • None: Do not subsample
  • num_iterations: The number of iterations to do for different subsamples. The final score will be the average
    • (default) 10: Do 10 iterations of the subsamples.
    • <int>: Perform the number of iterations provided
  • verbose: A boolean describing whether to show the progress
    • (default) True: Print the steps being run and the progress bar
    • False: Do not print anything out

Computation:

  • The baseline_protection score computation is exactly the same as in DisclosureProtection, and it only needs to be computed once.
  • The cap_protection score will instead be a cap_protection_estimate by running through the desired # of iterations and averaging out the results.
  • The final score will then be score = min(cap_protection_estimate/baseline_protection, 1)

Compute breakdown:

from sdmetrics.single_table import DisclosureProtectionEstimate

DisclosureProtectionEstimate.compute_breakdown(
    real_data=real_table,
    synthetic_data=synthetic_table,
    known_columns=['age', 'gender'],
    sensitive_column=['political_affiliation'],
    columns_to_discretize=['age'],
)

{
    'score': 0.912731436159061,
    'cap_protection_estimate': 0.782341231,
    'baseline_protection': 0.85714285715
}

Verbosity/Progress Bar:
If verbosity is turned on, it should show a progress bar that increments per iteration. The progress bar should be updated with the overall score (using the updated average), rounded to 4 decimal places.

>>> DisclosureProtectionEstimate.compute(real_data, synthetic_data, verbose=True)
Estimating Disclosure Protection (Score=0.8744): 100%|██████████| 10/10 [00:34<00:00,  3.42s/it]
@npatki npatki added the feature request Request for a new feature label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant