Skip to content

Data Enrichment

Jeremy Ho edited this page Oct 21, 2013 · 1 revision

Adding new and supplemental data to existing data is encountered very often throughout the course of software development. Enrichment of existing data sources is a value adding process, where external data from multiple sources is added to existing data to enhance its quality. The end result is that it will provide more data for the users.

In order to do that however, one must consider the how, when, and where the data enrichment will occur within the system's data flow. These considerations and decisions can significantly impact the data quality and perhaps integrity of the final data. This page attempts to give some insight on the breadth of considerations needed in order to perform data enrichment.

##Design Considerations

During the course of Iteration 2, there was the key question of how to add supplemental data to an incoming source. In this case, we wanted to add ATC class codes to the drug information found within each patient record. From Iteration 1, we already had drug codes coming through the E2E documents in the form of DIN, or Drug Identification Numbers. Although that was enough to answer the polypharmacy question in IT1, it is not enough data to answer the medication classes question posed in Iteration 2. The ATC codes, or Anatomical Therapeutic Chemical Classification System, would have enough resolution to not only specify what drug substance, but what its drug class it was.

Since the E2E document did not come with ATC codes, we would have to find a method of adding that type of data into the record to satisfy IT2. The question becomes where in the data flow do we supplement it with ATC codes, how would bring in this new external data, and which would be the best option to do given the current constraints of maintaining privacy as well as keeping the resources we use open source.

###Where to Enrich

After some discussion in the scoophealth mailing list, in terms of where in the pipeline to add the ATC data into the patient records, there were 4 major options to consider:

  1. Add at the EMR
  2. Add at the End Point
  3. Add at the Hub
  4. Add at the Researcher

These four options each have their advantages and disadvantages outlined below.

1 - EMR

Here, the EMR would export the ATC data within the E2E export. Given that the E2E specification allows this (and it does), then only minor gateway changes would be required. Of course, both the DIN and ATC has to be stored in the gateway database, but emitting ATC categories is not difficult.

Advantages

  • Minimal changes to the Gateway and Hub. The data enrichment step is reduced into merely adding more data into the E2E document which is easy to do.
  • ATC codes already exist in our EMR (OSCAR) so we would simply be leveraging that piece of information.

Disadvantages

  • We are in effect changing the E2E output, or on the worst case breaking the specification. This also will provide some problems when we expand to different types of EMRs and not just OSCAR, because we won't be guaranteed that the new EMR system will already be using DIN and ATC at the same time.

2 - End Point

Here, the Endpoint would have to be modified to have the ability to map DIN to ATC. This can be done passively or actively.

  1. Passive - With every record entering the gateway, it would have to read each DIN, look up the ATC and store both into the DB. This may have scalability issues with a large enough data set.
  2. Active - With every query, the DIN to ATC lookup would be done only when a query requires that information. It will however slow down the response time of the query result.

Advantages

  • We do not need to touch the IT1 E2E exporter from the EMR. We know we have an exporter that works, so we can remain confident that it continues to work if there are no changes.

Disadvantages

  • The workload on the gateway increases with either method.
  • Passively mapping can create excessive computation and storage (you'd be calculating and storing the ATC in parallel with the DIN), while actively mapping when the query is running will slow down the time to return the query result back to the hub.

3 - Hub

Here, the Hub could (at the UI level) allow for the Question Encoder to pick classes, etc. When the question is submitted, it could translate the categories/classes into specific codes to search for. All opioids get listed by DIN, for example. The endpoint is unaware of the coding schemes.

Advantages

  • Endpoint does not need to know about the coding schemes. It only has to return them.

Disadvantages

  • The queries end up being a list of coded elements (i.e. the hub breaks down the class).
  • It may be difficult to answer questions like which ATC classes are most prevalent.

4 - Researcher

Here, the researcher would have to perform the DIN to ATC mapping in some form. There are 2 main ways to go about this:

  1. The user would write a custom query that would categorize the DIN information from the output, where the user would write out an explicit list of DINs and what ATC categories they fell into.
  2. The user would simply get a count of all the existing DINs and how many times they show up. The user would then manually (or automated) do the mapping from DIN to ATC to get the answer.

Advantages

  • IT1's design already supports this functionality. Only the query needs to be changed.

Disadvantages

  • This puts a very heavy load on the user - either they write up a very large query to handle this or there would need to be something to allow the DIN to ATC mapping occur after the DIN list is retrieved.
  • If the system is just outputting a list of DIN and their frequency, this may provide more information than we would like - we want to know what are the most frequent categories, not the drugs that make up the most frequent categories.

###External Data Sources

Enriching incoming data sources is great because it adds value. However, we also need to make sure the sources we are drawing from are reliable and standardized.

The problem here becomes identifying what source we wish to use for performing the DIN to ATC mapping. Currently we have two reliable options.

  1. Canada's DPD - this is the list of DIN and ATC codes issued by the Canadian government and is updated about once a month. For all intensive purposes, this is the end-all be-all resource for mapping.
  2. Drugref2/3 - this is the library that our EMR (OSCAR) uses to draw the DIN and ATC data from. The Drugref tool draws its data from Canada's DPD, so there is a level of indirection. It is updated on a semiregular basis.

The problem with bringing in any external data source is whether we can rely on this new source or not. In our case, the EMR depends on a local copy of Drugref which may or may not be up to date. This can cause problems because when there becomes multiple practices with multiple EMRs and paired Gateways, we may run into an issue where the Drugref sources in each practice is different from each other, and may lead to conflicts in data quality/integrity.

Because of this, we need to consider how to make all the involved EMRs draw from the same data source, as well as picking a data source that allows us to enrich the E2E incoming data.

  1. It could be possible to force all EMRs to draw from a centralized Drugref that is under SCOOP control. This allows us to know with relative confidence that all data being saved into the E2E document will have the same type of information and we will not need to worry about time based discrepancies.
  2. We could allow the EMRs to have their own Drugrefs, and thus allow for discrepancies to show up. To correct for that, we would instead control the mapping enrichment at the Gateway. In this case, all gateways would be under SCOOP control, and we would force all gateways to use the same mapping source. This lets the EMR remain untouched, let the Gateway be the final say in what data is correct or not, as well as allow us to measure incoming data quality.
  3. We could combine approaches 1 and 2, and make it so that both the EMRs as well as Gateways both use the same external source. With this option, we know that the data will be the same, because we have forced everything to draw from the same resource. That resource would be under SCOOP conrol.

There are a few methods of approaching where to draw the data from. Data quality becomes a very prominent issue, so the sources we choose to draw from become a very important aspect in the architecture.

##Design Decision

At least for Iteration 2, since we wanted to have a relatively rapid cycle, we decided to go with option 4 - drawing the ATC data directly from the EMR. These were the reasons behind the decision:

  • It was proven to be easy to implement and deploy
  • The ATC data exists within the EMR already
  • This would allow us to later add data quality verification steps into future iterations and have something to compare against
  • The other options were significantly involved and would take a decent amount of time to implement
  • It would for the time being allow us to draw from a "reliable" source. In this case, we can currently use the assumption that all data coming from the EMR is "correct"
  • There would be minimal changes to hQuery's Gateway and Hub. As we are still learning about the tool, small changes are easier to deal with than large changes.

##Notes

By no means will the data enrichment question disappear. If anything, this will be a recurring topic as we expand and continue with our development. Data enrichment will appear again and again because it is inevitable that data coming from a source may not be sufficient enough to solve the problem posed. By enriching the data, you are able expand the quality of the data being used and can solve more problems, but one must be careful in choosing the right sources of data to draw from.

We fully expect to see this question reappear in a different form in subsequent iterations and we will have to weigh out the advantages and disadvantages of each potential approach as we encounter them.

Current Iteration: 13

General Topics

Resources


Previous Iteration: 12

Previous Iteration: 11

Previous Iteration: 10

Previous Iteration: 9

Previous Iteration: 8

Previous Iteration: 7

Previous Iteration: 6

Previous Iteration: 5

Previous Iteration: 4

Previous Iteration: 3

Previous Iteration: 2

Previous Iteration: 1

Previous Iteration: 0

Clone this wiki locally