Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show sample data for failed checks #14

Open
ghost opened this issue Jul 16, 2015 · 8 comments
Open

Show sample data for failed checks #14

ghost opened this issue Jul 16, 2015 · 8 comments
Milestone

Comments

@ghost
Copy link

ghost commented Jul 16, 2015

If a data quality check fails , show some sample rows, which break the rule. E.g. with unique key, show the column values with more than one row. Switch on and off using some (Object?) variable, number of sample rows should be configurable, too.

@FRosner FRosner modified the milestone: Backlog Jul 16, 2015
@FRosner
Copy link
Owner

FRosner commented Jul 16, 2015

@mfsny we need to think how to make this easy to use. I think for now we can add it as a parameter to the constructor of the Check object and give it the default of disabled. We may then think about refactoring all the parameters to some configuration object that can be passed or read from a properties file.

@FRosner FRosner modified the milestones: Backlog, 3.1.0 Mar 28, 2016
@FRosner
Copy link
Owner

FRosner commented May 7, 2016

Makes sense to implement #95 first, I guess.

@Rahumathulla
Copy link

Rahumathulla commented Apr 27, 2017

Hi @FRosner , I tried to run DDQ code in my local spark environment and did some quality checks on sample data which worked perfectly.
Now, I would like to save the records which satisfies the check criteria in "success.txt" and those which doesn't satisfy to "fail.txt".

Consider the below example:Check(customers).isAnyOf("gender", Set("m", "f")).run().

Suppose the DF 'customers' has 1000 records, with 800 records having the gender as 'm' or 'f' and remaining 200 records have some other values as gender. Assuming those 200 as invalid records, I would like to save them to "fail.txt" and the 800 valid records to "success.txt".

Could you please suggest me an option to meet this requirement?

@FRosner
Copy link
Owner

FRosner commented Apr 27, 2017

Hi @Rahumathulla,

thanks for asking :)

Currently this is not possible as you have no programmatic access to the resulting data frame (#95). If you had that you could just take the resulting dataframe and write it to a file using Spark.

However it is not trivial to design the API for two main reasons:

  1. It is not general enough to be applied to all checks as it only works for row wise checks
  2. Due to the sad fact of not having binary logic (we have to deal with null) it is not trivial to answer how to compute the other side if you have failing or satisfying rows.

Taking your example from above and the following example data set:

id | gender
-- + ------
 1 |      M
 2 |      x
 3 |   null

What would you expect to be failing and succeeding when checking for isAnyOf("gender", Set("m", "f"))?

  • select * from customers where gender = M or gender = W would show row 1 to suceed
  • select * from customers where not (gender = M or gender = W) would show row 2 to fail

What about row 3? The problem is that null is neither equal nor not-equal to M. Of course you can just say that you are going to use all the other records by building the complement but this has to be considered. Maybe you'd like to ignore the cases where it is null?

We currently have a proposal how to implement it at #99 and there is some discussion started on how to do it. Feel free to comment also there if you want to contribute.

Does it make sense?

@Rahumathulla
Copy link

Rahumathulla commented Apr 27, 2017

Hi @FRosner,
Thank you very much for your quick response on this ask.
I got your point. Can we think of doing this in another method? I am adding one more record to the table to explain my view.

id | gender
-- + ------
1 | M
2 | X
3 | W
4 | null

Step1: eliminating the null values and writing to an intermediate file called "temp.txt": Check(customers).isNeverNull("gender").run().

Step2: Use the data in temp.txt, run the check and write the result to "success.txt": Check(customers).isAnyOf("gender", Set("M", "W")).run().

Will this produce my expected result of fetching the valid records (1,3 from the above example)?

@FRosner
Copy link
Owner

FRosner commented Apr 27, 2017

Yes what you describe will work. But for this you need to be able to get the actual dataframe containing the failing / successful results. This is what #95 is about and #99 shows how it could be done.

Unfortunately it is not available in the current release.

@Rahumathulla
Copy link

Rahumathulla commented May 3, 2017

Hi FRosner,
I am using MAVEN instead of SBT to build DDQ-4.1.0 with Spark version 2.1.0 and Scala version 2.12.1 .

When I tried to build the code, I am getting the below exception. Could you please help to figure out the issue?

[INFO] D:\Projects\DDQ\drunken-data-quality\src\main\scala: -1: info: compiling
[INFO] Compiling 32 source files to D:\Projects\DDQ\drunken-data-quality\target\classes at 1493817340627
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\constraints\FunctionalDependencyConstraint.scala:19: error: value =!= is not a member of org.apache.spark.sql.Column
[INFO] val maybeViolatingDeterminantValuesCount = maybeDeterminantValueCounts.map(.filter(new Column("count") =!= 1).count)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\core\Check.scala:11: error: object SparkSession is not a member of package org.apache.spark.sql
[INFO] import org.apache.spark.sql.{Column, DataFrame, SparkSession}
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\core\Check.scala:227: error: not found: type SparkSession
[INFO] def sqlTable(spark: SparkSession,
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\core\Check.scala:249: error: not found: type SparkSession
[INFO] def hiveTable(spark: SparkSession,
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:7: error: not found: object courier
[INFO] import courier.

[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:148: error: not found: value Mailer
[INFO] val mailer = Mailer(smtpServer, smtpPort).startTtls(true)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:153: error: not found: value Envelope
[INFO] val envelope = Envelope.from(from.addr)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:153: error: value addr is not a member of String
[INFO] val envelope = Envelope.from(from.addr)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:154: error: value addr is not a member of String
[INFO] .to(to.map(.addr).toSeq:)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:155: error: value addr is not a member of String
[INFO] .cc(cc.map(.addr).toSeq:
)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:157: error: not found: value Multipart
[INFO] .content(Multipart().html(contentString))
[INFO] ^
[ERROR] 11 errors found

[My pom file is attached
ddq_pom.txt

]

@FRosner
Copy link
Owner

FRosner commented May 3, 2017

@Rahumathulla thanks for asking. I replied on gitter. Let's move the discussion there as it is not really related to the ticket?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants