As a Business Analyst or Data Steward, you need to understand and gain insight into your data. After accessing various data, profiling them and reviewing the profiling results you will be able to add a glossary term to a column in your dataset as well as provide a rating and comment on the dataset.
In this exercise, you will discover and interact with various connected systems, upload a dataset, profile the data, rate dataset, create a relationship with data and a glossary term
You will interact with two different dataset:
- A table stored in a SAP HANA Database.
- A flat file (csv) which you will upload in a cloud data lake data repository
The first dataset, the SAP HANA Table, contains information about pharmaceutic claims for an insurance company.
It contains 8 fields:
- RECORD_ID (Unique Identifier associated to a claim)
- INSURANCE (Name of the insurance company)
- PLAN (Plan associated at the insurance company)
- PATIENT_ID (Unique Identifier of the patient this claim is for)
- OUTSTANDING (Amount for the drug)
- CO_PAY (Amount of co pay, if any)
- VISIT (Date of the visit associated to this claim)
- DRUG_NAME (Drug name for the claim)
The second dataset, the flat file you retrieved on the main page of this hands-on, contains a list of drugs which are supported by this insurance company.
It contains 6 fields:
- ORIG_PRODUCT (Non split entry)
- DRUG_NAME (Drug name)
- POTENCY (Potency associated to the drug)
- DOSAGE (Dosage for the drug)
- ROUTE_ADMINISTERED (How the drug is administered)
- NOTES (Additional notes, if any)
This hands on will focus on discovering these data, find patterns and data quality issues, and fix them.
After completing these steps, you will have logged into SAP Data Intelligence.
-
Open Chrome and go to the SAP Data Intelligence url you were provided. You might need to use the URL and credentials from Getting started guide
-
Enter 'dat163-1' or 'dat163-2' for Tenant Name depending on your session and click 'Proceed'.
Note:
- the first session, November 17 2021 05:30 AM UTC is using 'dat163-1'.
- the second session, November 17 2021 10:00 PM UTC is using 'dat163-2'.
Note:
- where # is the number assigned to you.
- If your user number is 01 then your login is 'teched-dat163-01'.
You have now logged into SAP Data Intelligence.
After completing these steps you will have discovered dataset stored in a database.
-
Select 'View Capabilities' for the 'HANA_DEMO' or 'HANA_LOCALHOST' connection.
-
This lists all the features supported for a given connected system.
-
The list of all available tables within the schema shows up. Or you might have 'PHARMA_CLAIMS' and 'QMTICKET' tables instead.
-
Type 'PHARMA_CLAIMS_##' in the 'Filter items' text field (where ## is your user number, for example if your user number is 01, then type 'PHARMA_CLAIMS_01'). Or you might need use 'PHARMA_CLAIMS' table instead.
-
Click 'View FactSheet' on the 'PHARMA_CLAIM_##' database table tile (where ## is your user number).
.
The 'Fact Sheet' is the central place in SAP Data Intelligence Metatadata Explorer to find information about your data.
You can easily profile the data and get access to metadata information. It also contains links and information about business terms and tags associated to the dataset or the columns. Users can describe, rate, and comment the data collaboratively. You can preparare the data for other downstream usage.
-
This shows the details of the notifications. Click anywhere outside the notification window to continue interacting with the application.
-
Wait for the profiling task to finish, and Click 'Refresh' (Note: this action can take some time).
-
The factsheet was updated with the profiling information once the task is done.
-
We can see there are data quality issues such as spelling mistakes on the drug names.
-
We can also see there is an important number of null values.
You have now discovered a table in a database, profiled the data and found some data quality issues.
After completing these steps you will have uploaded a dataset from a flat flat to a cloud data lake data repository using SAP Data Intelligence.
-
Enter 'TechEd_DAT163_##' for folder name (where ## is the number assigned to you).
-
Search for 'TechEd_DAT163_##' to isolate your newly created folder, then click on your newly added 'TechEd_DAT163_##' folder (where ## is the number assigned to you).
-
Upload a file, click on the 'Upload Files' icon on the toolbar.
-
Click on 'Upload' in the upper right hand corner of the Upload Files pop-up window.
-
Browse to Sample Data folder where you downloaded and extracted 'DRUG_##.csv' (where ## is the number assigned to you) and select it.
You have now uploaded a dataset from a flat file on your local folder to a cloud data lake data repository using SAP Data Intelligence.
After completing these steps you will have created a new dataset using self-service data preparation. This new dataset will help to easily isolate invalid claims. Additionally you will profile this dataset, add a rating and description and publish it in the catalog so it can be easily retrieved.
-
The application will automatically recreate a new sample with the updated metadata structure.
-
Type 'CLAIMS_##' in the 'Filter items' text field (where ## is your user number).
-
The application is acquiring a sample of the new selected dataset.
-
Drag and drop 'PHARMA_CLAIMS' on the cell on the left hand-side of the main dataset.
-
Scroll down the list of output columns and uncheck 'ORIG_PRODUCT', 'POTENCY', 'DOSAGE', 'ROUTE_ADMINISTERED', 'NOTES'.
-
The merged data now shows a null value for the column 'DRUG_NAME_0' when a record from the claim data is for a drug that is not listed in the list of supported drugs.
-
The main self-service data preparation room now shows the enriched dataset.
The enriched dataset now contains null records for the field 'DRUG_NAME_0' for the records in the claim dataset which the drug name did not exists in our reference.
There are potential multiple reasons for that. Some might be spelling mistakes of drug names, some other might be drugs that are not taken into account by the insurance company, some could be that the drug name in our claim was null.
You can now use this enriched dataset to isolate the data quality issues to further understand the data.
-
Type the following expression: 'CASE WHEN "DRUG_NAME_0" IS NULL THEN 'NO' ELSE 'YES' END'.
-
Type 'PHARMA_CLAIMS_ENRICHED_##' (Where ## is your user number) for the 'Dataset Name'.
-
The 'Monitoring' application shows the current running tasks. Wait for your task to complete.
-
Click 'Data Intelligence Metadata Explorer', and click 'HOME'.
-
Type 'TechEd_DAT163_##' (where ## is your user number) in the Filter field.
-
Click 'More Actions' on the newly created dataset named PHARMA_CLAIMS_ENRICH_## (Where ## is your user number).
-
The factsheet for the dataset is not profiled and not published.
-
Wait for the profiling to be executed (there will be two notifications which you can check by clicking on the notification icon). Then Click 'Refresh'.
-
Type 'Pharma Claims Publication ##' (where ## is your user number) for the 'Name' text field. Type 'Publication for enriched claimed data' for the 'Description' text field.
-
The application sends a notification for the publication task trigger.
-
The application sends another notification when the publication task is finished.
-
The application now notifies that the dataset is both profiled and published in the application catalog.
-
Click and define a rating (for example 4 stars rating is done by clicking the 4th star).
-
Add a comment: 'This dataset helps to easily identify claims for drugs that are not compliant'.
-
Click 'Data Intelligence Metadata Explorer' and Click 'Home'.
You have now created a new dataset using self-service data preparation. This new dataset helps to easily isolate invalid claims. You also profiled this dataset, added a rating and a description and published it in the catalog so it can be easily retrieved.
You've now used Metadata Explorer to connect and interact with different data repositories (Databases, Cloud Data Lake, Local File System). You profiled and discovered data to identify data quality issues. You created a new enriched dataset to isolate these data quality issues. You published this dataset to the catalog.
Continue to - Hands-on - Part 2