Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing Helpful Anomaly Detection Metadata from Anomaly Strategies (ie Anomaly Check Range/Thresholds) through backwards compatible function #593

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

arsenalgunnershubert777
Copy link

Description of changes:

This PR adds functionality to expose anomaly detection metadata for anomaly checks.

There is a legacy PR here, and this PR addresses feedback to make the anomaly detection metadata functionality backwards compatible.

Let me know if anyone has any questions/feedback, thanks!

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Add Spark 3.5 support

* Replace with DataTypeUtils.fromAttributes

* Remove unintended new line
…mpleteness (awslabs#532)

* Modified Completeness analyzer to label filtered rows as null for row-level results

* Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results

* Adjustments for modifying the calculate method to take in a filterCondition

* Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True)

* Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait

* Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait

* Modify computeStateFrom to take in optional filterCondition
…um (awslabs#535)

* Address comments on PR awslabs#532

* Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers

* Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
…wslabs#537)

* Add analyzerOption to add filteredRowOutcome for isPrimaryKey Check
* Add analyzerOption to add filteredRowOutcome for hasUniqueValueRatio Check
…vior.EmptyString option, the where filter wasn't properly applied (awslabs#538)
…slabs#543)

* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
…aluation (awslabs#547)

* [MinLength/MaxLength] Apply filtered row behavior at the row level evaluation

- For certain scenarios, the filtered row behavior for MinLength and MaxLength was not working correctly.
- For example, when using both minLength and maxLength constraints in a single check, and with both using == <value> as an assertion. This was resulting in the row level outcome of the filtered rows to be false. This was because we were replacing values for filtered rows for Min to MaxValue and for Max to MinValue. But a number could not equal both at the same time.
- Updated the logic of the row level assertion to MinLength/MaxLength to match what was done for Min/Max.
- The satisfies constraint was incorrectly using the provided assertion to evaluate the row level outcomes. The assertion should only be used to evaluate the final outcome.
- As part of this change, we have updated the row level results to return a true/false. The cast to an integer happens as part of the aggregation result.
- Added a test to verify the row level results using checks made up of different assertions.
* Added RatioOfSums analyzer and tests

* Unit test for divide by zero and code cleanup.

* More detailed Scaladoc

* Fixed docs to include Double.NegativeInfinity

* Add copyright to new file
* Fix flaky KLL test

* Move CustomSql state to CustomSql analyzer

* Implement new Analyzer to count columns

* Improve documentation, remove unused parameter, replace if/else with map

---------

Co-authored-by: Yannis Mentekidis <[email protected]>
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const
* Configurable RetainCompletenessRule

* Add doc string

* Add default completeness const

* Add ConfidenceIntervalStrategy

* Add Separate Wilson and Wald Interval Test

* Add License information, Fix formatting

* Add License information

* formatting fix

* Update documentation

* Make WaldInterval the default strategy for now

* Formatting import to per line

* Separate group import to per line import
* Add support for EntityTypes dqdl rule

* Add support for Conditional Aggregation Analyzer

---------

Co-authored-by: Joshua Zexter <[email protected]>
* Generate row-level results with withColumns

Iteratively using withColumn (singular) causes performance
issues when iterating over a large sequence of columns.

* Add back UNIQUENESS_ID
… for the bound value and isThresholdInclusive, also adding anomaly detection with extended results README, and adding anomaly detection test with 2 anomaly checks on the same suite
@rdsharma26
Copy link
Contributor

@arsenalgunnershubert777 Thanks for the PR! Please allow for additional time to review the PR given the amount of changes. In the meantime, would it be possible to rebase your branch off of latest master so that it reduces the number of commits in the PR?

@arsenalgunnershubert777
Copy link
Author

Hey @rdsharma26 thanks for reaching out. I think I have some issues with the conflicts (and I also can improve my understanding of Git)
So for now I may just create a clean branch and apply the changes on top of that manually and open a new PR, so there's just 1 commit

@rdsharma26
Copy link
Contributor

rdsharma26 commented Nov 13, 2024

Thanks @arsenalgunnershubert777
After syncing your forked repository with the main repository, you could try:

git checkout master
git pull
git checkout exposeAnomalyThresholdBackwardsCompatible
git rebase master
git push origin exposeAnomalyThresholdBackwardsCompatible

@arsenalgunnershubert777
Copy link
Author

@rdsharma26 got it yea when I run that it has more conflicts, and I notice some classes have reverted to previous states

@arsenalgunnershubert777
Copy link
Author

arsenalgunnershubert777 commented Dec 11, 2024

Hey @rdsharma26 hope you are well, just checking in if you or anyone else has any comments so far. After your review/comments, I will also push another branch to have less commits and create another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.