ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer #613

candalfigomoro · 2023-02-13T11:11:07Z

When using CTGAN, data is normalized using ClusterBasedNormalizer.

In RDT, GaussianNormalizer is also implemented.

What are the advantages of ClusterBasedNormalizer and GaussianNormalizer compared to using sklearn's PowerTransformer (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html) with the Yeo-Johnson method? Couldn't a power transform be used instead (which would perhaps be faster than ClusterBasedNormalizer)?

Thank you

npatki · 2023-03-29T20:58:02Z

Hi @candalfigomoro, thanks for the feedback. We'll keep this issue open to share any information as we investigate the specifics of this transformers.

Some considerations:

Quality: Does this significantly improve the quality when used to create synthetic data? To evaluate quality, we use the SDMetrics quality report
Performance: How quickly is this transformer able to fit, transform and reverse transform compared to the others?
Memory: What would be the overall file size if you were to save a synthesizer that used this transformer vs. others?

If you have done any exploration yourself along these lines, we'd be very eager to see it!

candalfigomoro added new Label applied to new issues question General question about the software labels Feb 13, 2023

candalfigomoro mentioned this issue Feb 13, 2023

DataTransformer init parameters sdv-dev/CTGAN#146

Open

npatki added under discussion Issue is currently being discussed and removed new Label applied to new issues labels Mar 29, 2023

npatki removed the under discussion Issue is currently being discussed label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer #613

ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer #613

candalfigomoro commented Feb 13, 2023 •

edited

Loading

npatki commented Mar 29, 2023

ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer #613

ClusterBasedNormalizer vs GaussianNormalizer vs PowerTransformer #613

Comments

candalfigomoro commented Feb 13, 2023 • edited Loading

npatki commented Mar 29, 2023

candalfigomoro commented Feb 13, 2023 •

edited

Loading