-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show sample data for failed checks #14
Comments
@mfsny we need to think how to make this easy to use. I think for now we can add it as a parameter to the constructor of the |
Makes sense to implement #95 first, I guess. |
Hi @FRosner , I tried to run DDQ code in my local spark environment and did some quality checks on sample data which worked perfectly. Consider the below example:Check(customers).isAnyOf("gender", Set("m", "f")).run(). Suppose the DF 'customers' has 1000 records, with 800 records having the gender as 'm' or 'f' and remaining 200 records have some other values as gender. Assuming those 200 as invalid records, I would like to save them to "fail.txt" and the 800 valid records to "success.txt". Could you please suggest me an option to meet this requirement? |
Hi @Rahumathulla, thanks for asking :) Currently this is not possible as you have no programmatic access to the resulting data frame (#95). If you had that you could just take the resulting dataframe and write it to a file using Spark. However it is not trivial to design the API for two main reasons:
Taking your example from above and the following example data set:
What would you expect to be failing and succeeding when checking for
What about row 3? The problem is that We currently have a proposal how to implement it at #99 and there is some discussion started on how to do it. Feel free to comment also there if you want to contribute. Does it make sense? |
Hi @FRosner, id | gender Step1: eliminating the null values and writing to an intermediate file called "temp.txt": Check(customers).isNeverNull("gender").run(). Step2: Use the data in temp.txt, run the check and write the result to "success.txt": Check(customers).isAnyOf("gender", Set("M", "W")).run(). Will this produce my expected result of fetching the valid records (1,3 from the above example)? |
Hi FRosner, When I tried to build the code, I am getting the below exception. Could you please help to figure out the issue? [INFO] D:\Projects\DDQ\drunken-data-quality\src\main\scala: -1: info: compiling [My pom file is attached ] |
@Rahumathulla thanks for asking. I replied on gitter. Let's move the discussion there as it is not really related to the ticket? |
If a data quality check fails , show some sample rows, which break the rule. E.g. with unique key, show the column values with more than one row. Switch on and off using some (Object?) variable, number of sample rows should be configurable, too.
The text was updated successfully, but these errors were encountered: