Show sample data for failed checks #14

ghost · 2015-07-16T11:57:27Z

If a data quality check fails , show some sample rows, which break the rule. E.g. with unique key, show the column values with more than one row. Switch on and off using some (Object?) variable, number of sample rows should be configurable, too.

FRosner · 2015-07-16T12:04:24Z

@mfsny we need to think how to make this easy to use. I think for now we can add it as a parameter to the constructor of the Check object and give it the default of disabled. We may then think about refactoring all the parameters to some configuration object that can be passed or read from a properties file.

FRosner · 2016-05-07T17:37:50Z

Makes sense to implement #95 first, I guess.

Rahumathulla · 2017-04-27T05:22:29Z

Hi @FRosner , I tried to run DDQ code in my local spark environment and did some quality checks on sample data which worked perfectly.
Now, I would like to save the records which satisfies the check criteria in "success.txt" and those which doesn't satisfy to "fail.txt".

Consider the below example:Check(customers).isAnyOf("gender", Set("m", "f")).run().

Suppose the DF 'customers' has 1000 records, with 800 records having the gender as 'm' or 'f' and remaining 200 records have some other values as gender. Assuming those 200 as invalid records, I would like to save them to "fail.txt" and the 800 valid records to "success.txt".

Could you please suggest me an option to meet this requirement?

FRosner · 2017-04-27T05:56:30Z

Hi @Rahumathulla,

thanks for asking :)

Currently this is not possible as you have no programmatic access to the resulting data frame (#95). If you had that you could just take the resulting dataframe and write it to a file using Spark.

However it is not trivial to design the API for two main reasons:

It is not general enough to be applied to all checks as it only works for row wise checks
Due to the sad fact of not having binary logic (we have to deal with null) it is not trivial to answer how to compute the other side if you have failing or satisfying rows.

Taking your example from above and the following example data set:

id | gender
-- + ------
 1 |      M
 2 |      x
 3 |   null

What would you expect to be failing and succeeding when checking for isAnyOf("gender", Set("m", "f"))?

select * from customers where gender = M or gender = W would show row 1 to suceed
select * from customers where not (gender = M or gender = W) would show row 2 to fail

What about row 3? The problem is that null is neither equal nor not-equal to M. Of course you can just say that you are going to use all the other records by building the complement but this has to be considered. Maybe you'd like to ignore the cases where it is null?

We currently have a proposal how to implement it at #99 and there is some discussion started on how to do it. Feel free to comment also there if you want to contribute.

Does it make sense?

Rahumathulla · 2017-04-27T06:19:05Z

Hi @FRosner,
Thank you very much for your quick response on this ask.
I got your point. Can we think of doing this in another method? I am adding one more record to the table to explain my view.

id | gender
-- + ------
1 | M
2 | X
3 | W
4 | null

Step1: eliminating the null values and writing to an intermediate file called "temp.txt": Check(customers).isNeverNull("gender").run().

Step2: Use the data in temp.txt, run the check and write the result to "success.txt": Check(customers).isAnyOf("gender", Set("M", "W")).run().

Will this produce my expected result of fetching the valid records (1,3 from the above example)?

FRosner · 2017-04-27T08:20:36Z

Yes what you describe will work. But for this you need to be able to get the actual dataframe containing the failing / successful results. This is what #95 is about and #99 shows how it could be done.

Unfortunately it is not available in the current release.

Rahumathulla · 2017-05-03T13:24:03Z

Hi FRosner,
I am using MAVEN instead of SBT to build DDQ-4.1.0 with Spark version 2.1.0 and Scala version 2.12.1 .

When I tried to build the code, I am getting the below exception. Could you please help to figure out the issue?

[INFO] D:\Projects\DDQ\drunken-data-quality\src\main\scala: -1: info: compiling
[INFO] Compiling 32 source files to D:\Projects\DDQ\drunken-data-quality\target\classes at 1493817340627
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\constraints\FunctionalDependencyConstraint.scala:19: error: value =!= is not a member of org.apache.spark.sql.Column
[INFO] val maybeViolatingDeterminantValuesCount = maybeDeterminantValueCounts.map(.filter(new Column("count") =!= 1).count)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\core\Check.scala:11: error: object SparkSession is not a member of package org.apache.spark.sql
[INFO] import org.apache.spark.sql.{Column, DataFrame, SparkSession}
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\core\Check.scala:227: error: not found: type SparkSession
[INFO] def sqlTable(spark: SparkSession,
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\core\Check.scala:249: error: not found: type SparkSession
[INFO] def hiveTable(spark: SparkSession,
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:7: error: not found: object courier
[INFO] import courier.
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:148: error: not found: value Mailer
[INFO] val mailer = Mailer(smtpServer, smtpPort).startTtls(true)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:153: error: not found: value Envelope
[INFO] val envelope = Envelope.from(from.addr)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:153: error: value addr is not a member of String
[INFO] val envelope = Envelope.from(from.addr)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:154: error: value addr is not a member of String
[INFO] .to(to.map(.addr).toSeq:)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:155: error: value addr is not a member of String
[INFO] .cc(cc.map(.addr).toSeq:)
[INFO] ^
[ERROR] D:\Projects\DDQ\drunken-data-quality\src\main\scala\de\frosner\ddq\reporters\EmailReporter.scala:157: error: not found: value Multipart
[INFO] .content(Multipart().html(contentString))
[INFO] ^
[ERROR] 11 errors found

[My pom file is attached
ddq_pom.txt

]

FRosner · 2017-05-03T16:45:41Z

@Rahumathulla thanks for asking. I replied on gitter. Let's move the discussion there as it is not really related to the ticket?

FRosner modified the milestone: Backlog Jul 16, 2015

FRosner modified the milestones: Backlog, 3.1.0 Mar 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show sample data for failed checks #14

Show sample data for failed checks #14

ghost commented Jul 16, 2015

FRosner commented Jul 16, 2015

FRosner commented May 7, 2016

Rahumathulla commented Apr 27, 2017 •

edited

Loading

FRosner commented Apr 27, 2017 •

edited

Loading

Rahumathulla commented Apr 27, 2017 •

edited

Loading

FRosner commented Apr 27, 2017

Rahumathulla commented May 3, 2017 •

edited

Loading

FRosner commented May 3, 2017

Show sample data for failed checks #14

Show sample data for failed checks #14

Comments

ghost commented Jul 16, 2015

FRosner commented Jul 16, 2015

FRosner commented May 7, 2016

Rahumathulla commented Apr 27, 2017 • edited Loading

FRosner commented Apr 27, 2017 • edited Loading

Rahumathulla commented Apr 27, 2017 • edited Loading

FRosner commented Apr 27, 2017

Rahumathulla commented May 3, 2017 • edited Loading

FRosner commented May 3, 2017

Rahumathulla commented Apr 27, 2017 •

edited

Loading

FRosner commented Apr 27, 2017 •

edited

Loading

Rahumathulla commented Apr 27, 2017 •

edited

Loading

Rahumathulla commented May 3, 2017 •

edited

Loading