You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data incident is an event when there is a problem within any aspects from the data point of view that can affect business outcome. In a fast-paced environment, the conditions where the data is being generated, transformed, and analyzed are changing rapidly. This would make data incidents almost unavoidable.
Knowing the root cause, the impact, the recurrence of data incidents would give us an advantage to mitigate the incidents, learn from them, and avoid them in the future. Therefore, there is a need to have proper data incident management to achieve it.
Problem 1
Several data-related issues could be identified and reported by the user itself. This would give bad user experience and could lead to some other issues that could affect business processes (e.g. blocking user productivity, serving stale data, etc).
There are two ways to solve this: to remediate and to avoid. There is a need to have a system that could detect data issues as early as possible and notify the responsible team so the fix could be executed early to reduce the Mean Time To Respond. However to avoid some issues, users should also be notified for some important events related to the data (e.g. schema changes) to keep up to date for specific data changes.
Problem 2
Several data-related issues are sometimes not being tracked in a centralized way. This would scatter the knowledge gained during the incident hence could reduce the team awareness about the learning. In this case, there is a high chance of the issues to recur if the learning is not being analyzed and shared properly to another team (especially for the new joiner who just onboarded to the team).
We need a system to organize all incidents properly so every member in the team could revisit and learn from the incident, hoping that the incident won't be happening in the future or if the incident is unavoidable, the team already has a proper plan to mitigate the incident.
🧠 Assumptions
Data Incident Management
Data Incident Management is the end-to-end solution covering observability, alerts, and incident catalog. The observability includes data related observability and observability of the systems that process the data. There are mainly 4 steps of process in Data Incident Management:
🔎 Observe
We need to define a Signal to observe.
For observability, we could follow the CNCF standard OpenTelemetry.
✔️ Detect
We define specific threshold rules that translate into an Alert. Although an Incident does not necessarily to be generated from an Alert, the collected Signals from the observation are aggregated could lead to an alert and an Incident will be generated if the computed Signals exceed a specific threshold.
SLO and alert policy definition could follow OpenSLO.
🔔 Notify
There might be possible events that are just an INFO and not an Alert caused by some important events e.g. schema changes.
A Notification could be sent to one or more Channels. Users could subscribe to a specific alert through one/several.
Channels. Once an incident is being generated, it will trigger an Alert that would generate several Notifications.
💻 Analyze
The details of the alerts & incidents could be revisited and some details regarding the incident could be added for further analysis and lessons learned.
From all 4 steps mentioned above, not all Alerts will go through all 4 steps. There are 3 types of Alert based on the steps above:
Standard alerts: All 4 steps will be applicable
Data changes: Last 2 steps will be applicable
Manually reported: Only the last step will be applicable
Issues Classification
Systems (Data System)
Response time
Data-related
and more... (TBA)
Data quality
Cloud-data service performance
User reported issue
and more... (TBA)
📜 Requirements
Capability to define alert policy
User Story: User could create alert policy, and notification channels
SLI is a kind of signals to read from the datasource.
Higher abstraction of signals.
Capability to observe and auto detect possible data-related issues
User Story: User able to define the provider/data source to observe signal and use the defined alert policy to auto-trigger alert if alert condition meet
Importance: Nice to Have
Notes:
Data-related issues = some issues that are not system issues (e.g. Data quality, BQ slots performance, User reported issue).
Can be push-based/pull-based.
How to observe and detect.
Not all the cases could be covered to automatically generate alerts.
Capability for a system to generate an incident
User Story: User could generate an incident
Importance: Must Have
Notes:
Define incident data model and features.
Can start with, manually generated an incident out of an alert/SLA breach.
Classification of issues (like tags).
Detail of feature will be discussed later.
Capability for a system to revisit/browse all incidents
User Story: User could revisit a specific incident
Importance: Must Have
Notes:
Classification of issues (like tags).
Capability for a system to add details on the related incident
User Story: User could revisit a specific incident
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Data incident is an event when there is a problem within any aspects from the data point of view that can affect business outcome. In a fast-paced environment, the conditions where the data is being generated, transformed, and analyzed are changing rapidly. This would make data incidents almost unavoidable.
Knowing the root cause, the impact, the recurrence of data incidents would give us an advantage to mitigate the incidents, learn from them, and avoid them in the future. Therefore, there is a need to have proper data incident management to achieve it.
Problem 1
Several data-related issues could be identified and reported by the user itself. This would give bad user experience and could lead to some other issues that could affect business processes (e.g. blocking user productivity, serving stale data, etc).
There are two ways to solve this: to remediate and to avoid. There is a need to have a system that could detect data issues as early as possible and notify the responsible team so the fix could be executed early to reduce the Mean Time To Respond. However to avoid some issues, users should also be notified for some important events related to the data (e.g. schema changes) to keep up to date for specific data changes.
Problem 2
Several data-related issues are sometimes not being tracked in a centralized way. This would scatter the knowledge gained during the incident hence could reduce the team awareness about the learning. In this case, there is a high chance of the issues to recur if the learning is not being analyzed and shared properly to another team (especially for the new joiner who just onboarded to the team).
We need a system to organize all incidents properly so every member in the team could revisit and learn from the incident, hoping that the incident won't be happening in the future or if the incident is unavoidable, the team already has a proper plan to mitigate the incident.
🧠 Assumptions
Data Incident Management
Data Incident Management is the end-to-end solution covering observability, alerts, and incident catalog. The observability includes data related observability and observability of the systems that process the data. There are mainly 4 steps of process in Data Incident Management:
INFO
and not an Alert caused by some important events e.g. schema changes.From all 4 steps mentioned above, not all Alerts will go through all 4 steps. There are 3 types of Alert based on the steps above:
Issues Classification
📜 Requirements
Beta Was this translation helpful? Give feedback.
All reactions