Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Hands-on - Part 1 (Discover, Prepare, and Enrich)

As a Business Analyst or Data Steward, you need to understand and gain insight into your data. After accessing various data, profiling them and reviewing the profiling results you will be able to add a glossary term to a column in your dataset as well as provide a rating and comment on the dataset.

In this exercise, you will discover and interact with various connected systems, upload a dataset, profile the data, rate dataset, create a relationship with data and a glossary term

Dataset Overview

You will interact with two different dataset:

  • A table stored in a SAP HANA Database.
  • A flat file (csv) which you will upload in a cloud data lake data repository

The first dataset, the SAP HANA Table, contains information about pharmaceutic claims for an insurance company.

It contains 8 fields:

  • RECORD_ID (Unique Identifier associated to a claim)
  • INSURANCE (Name of the insurance company)
  • PLAN (Plan associated at the insurance company)
  • PATIENT_ID (Unique Identifier of the patient this claim is for)
  • OUTSTANDING (Amount for the drug)
  • CO_PAY (Amount of co pay, if any)
  • VISIT (Date of the visit associated to this claim)
  • DRUG_NAME (Drug name for the claim)

The second dataset, the flat file you retrieved on the main page of this hands-on, contains a list of drugs which are supported by this insurance company.

It contains 6 fields:

  • ORIG_PRODUCT (Non split entry)
  • DRUG_NAME (Drug name)
  • POTENCY (Potency associated to the drug)
  • DOSAGE (Dosage for the drug)
  • ROUTE_ADMINISTERED (How the drug is administered)
  • NOTES (Additional notes, if any)

This hands on will focus on discovering these data, find patterns and data quality issues, and fix them.

Log Into SAP Data Intelligence

After completing these steps, you will have logged into SAP Data Intelligence.

  1. Open Chrome and go to the SAP Data Intelligence url you were provided. You might need to use the URL and credentials from Getting started guide

  2. Enter 'dat163-1' or 'dat163-2' for Tenant Name depending on your session and click 'Proceed'.

Note:

  • the first session, November 17 2021 05:30 AM UTC is using 'dat163-1'.
  • the second session, November 17 2021 10:00 PM UTC is using 'dat163-2'.
  1. Enter the Username that was assigned to you (e.g. 'teched-dat163-##'), for Tenant Name.

Note:

  • where # is the number assigned to you.
  • If your user number is 01 then your login is 'teched-dat163-01'.
  1. Enter the Password that was assigned to you, for your Password.

  2. Click 'Sign In'.

  3. You are now signed in the application.

You have now logged into SAP Data Intelligence.

Browse data in a Connected System (Database)

After completing these steps you will have discovered dataset stored in a database.

  1. Click on 'Metadata Explorer'.

  2. Click on 'Browse Connections'.

  3. Click 'List View'.

  4. This allows to show the list of connections as a list.

  5. Select 'View Capabilities' for the 'HANA_DEMO' or 'HANA_LOCALHOST' connection.

  6. This lists all the features supported for a given connected system.

  7. Click 'Grid View' to come back to tiles.

  8. Click on the 'HANA_DEMO' or 'HANA_LOCALHOST' tile.

  9. Select 'TECHED_DAT163' or 'TECHED'.

  10. The list of all available tables within the schema shows up. Or you might have 'PHARMA_CLAIMS' and 'QMTICKET' tables instead.

  11. Type 'PHARMA_CLAIMS_##' in the 'Filter items' text field (where ## is your user number, for example if your user number is 01, then type 'PHARMA_CLAIMS_01'). Or you might need use 'PHARMA_CLAIMS' table instead.

  12. Click 'View FactSheet' on the 'PHARMA_CLAIM_##' database table tile (where ## is your user number).
    .

  13. This shows the 'Fact Sheet'.

The 'Fact Sheet' is the central place in SAP Data Intelligence Metatadata Explorer to find information about your data.

You can easily profile the data and get access to metadata information. It also contains links and information about business terms and tags associated to the dataset or the columns. Users can describe, rate, and comment the data collaboratively. You can preparare the data for other downstream usage.

  1. Click 'Start Profiling'.

  2. Click 'Yes'.

  3. Click 'Notification'.

  4. This shows the details of the notifications. Click anywhere outside the notification window to continue interacting with the application.

  5. Wait for the profiling task to finish, and Click 'Refresh' (Note: this action can take some time).

  6. The factsheet was updated with the profiling information once the task is done.

  7. Click 'Columns'.

  8. Select the Line for 'DRUG_NAME'.

  9. Observe Data Preview and the 'Top 10 Distinct Values'.

  10. We can see there are data quality issues such as spelling mistakes on the drug names.

  11. We can also see there is an important number of null values.

  12. Click 'Data Intelligence Metadata Explorer'.

  13. Click 'Home'.

You have now discovered a table in a database, profiled the data and found some data quality issues.

Upload your Dataset

After completing these steps you will have uploaded a dataset from a flat flat to a cloud data lake data repository using SAP Data Intelligence.

  1. Click on 'Browse Connections'.

  2. Click on the 'DI_DATA_LAKE' tile.

  3. Click on the 'shared' tile.

  4. Click on the New Folder icon (folder with a +).

  5. Enter 'TechEd_DAT163_##' for folder name (where ## is the number assigned to you).

  6. Click 'OK'.

  7. Search for 'TechEd_DAT163_##' to isolate your newly created folder, then click on your newly added 'TechEd_DAT163_##' folder (where ## is the number assigned to you).

  8. Upload a file, click on the 'Upload Files' icon on the toolbar.

  9. Click on 'Upload' in the upper right hand corner of the Upload Files pop-up window.

  10. Browse to Sample Data folder where you downloaded and extracted 'DRUG_##.csv' (where ## is the number assigned to you) and select it.

  11. Click 'Open'.

  12. Click 'Upload'.

  13. The file will upload on the data lake.

  14. After Upload is Complete, click 'Close'.

  15. The file is now uploaded and available in the data lake.

You have now uploaded a dataset from a flat file on your local folder to a cloud data lake data repository using SAP Data Intelligence.

Enrich Dataset and Isolate Data Quality Issues

After completing these steps you will have created a new dataset using self-service data preparation. This new dataset will help to easily isolate invalid claims. Additionally you will profile this dataset, add a rating and description and publish it in the catalog so it can be easily retrieved.

  1. Click 'More Actions'.

  2. Select 'Prepare Data'.

  3. The self-service data preparation room shows up.

  4. The first record of the data is actually the column header.

  5. Check 'Use first row as header'.

  6. Click 'Continue'.

  7. The application will automatically recreate a new sample with the updated metadata structure.

  8. The dataset now has the proper header.

  9. Click 'Actions'.

  10. Click 'Enrich Preparation'.

  11. the enrich preparation main user interface shows up.

  12. Click '+' to add a new source of data to merge with.

  13. Click 'Browse'.

  14. Select 'HANA_DEMO as a 'Connection'.

  15. Click 'TECHED_DAT163'.

  16. Type 'CLAIMS_##' in the 'Filter items' text field (where ## is your user number).

  17. Select 'PHARMA_CLAIMS_##' (where ## is your user number).

  18. Click 'OK'.

  19. The application is acquiring a sample of the new selected dataset.

  20. The new selected dataset can now be used to merge data.

  21. Drag and drop 'PHARMA_CLAIMS' on the cell on the left hand-side of the main dataset.

  22. Select 'Left Join'.

  23. Scroll down the list of output columns and uncheck 'ORIG_PRODUCT', 'POTENCY', 'DOSAGE', 'ROUTE_ADMINISTERED', 'NOTES'.

  24. Click Apply.

  25. The application displays a preview of the merge data.

  26. The merged data now shows a null value for the column 'DRUG_NAME_0' when a record from the claim data is for a drug that is not listed in the list of supported drugs.

  27. Click 'Apply Enrichment'.

  28. The main self-service data preparation room now shows the enriched dataset.

The enriched dataset now contains null records for the field 'DRUG_NAME_0' for the records in the claim dataset which the drug name did not exists in our reference.

There are potential multiple reasons for that. Some might be spelling mistakes of drug names, some other might be drugs that are not taken into account by the insurance company, some could be that the drug name in our claim was null.

You can now use this enriched dataset to isolate the data quality issues to further understand the data.

  1. Click 'Actions'.

  2. Click 'Add Columns'.

  3. Type 'ValidClaim' for the 'Column Name'.

  4. Click 'Expression'.

  5. Type the following expression: 'CASE WHEN "DRUG_NAME_0" IS NULL THEN 'NO' ELSE 'YES' END'.

  6. Click 'OK'.

  7. Click 'Apply'.

  8. A new column is now created.

  9. Select the column 'DRUG_NAME_0'.

  10. Click 'Remove'.

  11. The column 'DRUG_NAME_0' has been deleted.

  12. Click '<' to navigate back to the 'Actions' menu.

  13. Click 'Run Preparation'.

  14. Type 'PHARMA_CLAIMS_ENRICHED_##' (Where ## is your user number) for the 'Dataset Name'.

  15. Click 'Apply'.

  16. Click 'Data Intelligence Metadata Explorer'.

  17. Select 'Monitor' and click 'Monitor Tasks'.

  18. The 'Monitoring' application shows the current running tasks. Wait for your task to complete.

  19. The task is completed.

  20. Click 'Data Intelligence Metadata Explorer', and click 'HOME'.

  21. Click 'Browse Connections'.

  22. Click 'DI_DATA_LAKE'.

  23. Click 'shared'.

  24. Type 'TechEd_DAT163_##' (where ## is your user number) in the Filter field.

  25. Click TechEd_DAT163_## (where ## is your user number).

  26. Click 'More Actions' on the newly created dataset named PHARMA_CLAIMS_ENRICH_## (Where ## is your user number).

  27. Select 'View Fact Sheet', Click 'Overview'.

  28. The factsheet for the dataset is not profiled and not published.

  29. Click the 'Profiling' icon.

  30. Click 'Yes'.

  31. Wait for the profiling to be executed (there will be two notifications which you can check by clicking on the notification icon). Then Click 'Refresh'.

  32. The dataset is now profiled.

  33. Click '<' to come back to the connection browser.

  34. Click 'More Actions'.

  35. Click 'New Publication'.

  36. Type 'Pharma Claims Publication ##' (where ## is your user number) for the 'Name' text field. Type 'Publication for enriched claimed data' for the 'Description' text field.

  37. Click 'Publish'.

  38. The application sends a notification for the publication task trigger.

  39. The application sends another notification when the publication task is finished.

  40. Click 'Refresh'.

  41. The application now notifies that the dataset is both profiled and published in the application catalog.

  42. Click 'View Factsheet'.

  43. Click 'Reviews'.

  44. Click the pencil icon to post a rating.

  45. Click and define a rating (for example 4 stars rating is done by clicking the 4th star).

  46. Add a comment: 'This dataset helps to easily identify claims for drugs that are not compliant'.

  47. Click 'OK'.

  48. The dataset has been enriched with a rating and a comment.

  49. Click 'Data Intelligence Metadata Explorer' and Click 'Home'.

  50. You returned to the Metadata Explorer home page.

You have now created a new dataset using self-service data preparation. This new dataset helps to easily isolate invalid claims. You also profiled this dataset, added a rating and a description and published it in the catalog so it can be easily retrieved.

Summary

You've now used Metadata Explorer to connect and interact with different data repositories (Databases, Cloud Data Lake, Local File System). You profiled and discovered data to identify data quality issues. You created a new enriched dataset to isolate these data quality issues. You published this dataset to the catalog.

Continue to - Hands-on - Part 2