[Requirements] Data Incident Management #15

mabdh · 2022-08-29T06:12:35Z

mabdh
Aug 29, 2022
Collaborator

⚠️ Problems

Data incident is an event when there is a problem within any aspects from the data point of view that can affect business outcome. In a fast-paced environment, the conditions where the data is being generated, transformed, and analyzed are changing rapidly. This would make data incidents almost unavoidable.

Knowing the root cause, the impact, the recurrence of data incidents would give us an advantage to mitigate the incidents, learn from them, and avoid them in the future. Therefore, there is a need to have proper data incident management to achieve it.

Problem 1

Several data-related issues could be identified and reported by the user itself. This would give bad user experience and could lead to some other issues that could affect business processes (e.g. blocking user productivity, serving stale data, etc).

There are two ways to solve this: to remediate and to avoid. There is a need to have a system that could detect data issues as early as possible and notify the responsible team so the fix could be executed early to reduce the Mean Time To Respond. However to avoid some issues, users should also be notified for some important events related to the data (e.g. schema changes) to keep up to date for specific data changes.

Problem 2

Several data-related issues are sometimes not being tracked in a centralized way. This would scatter the knowledge gained during the incident hence could reduce the team awareness about the learning. In this case, there is a high chance of the issues to recur if the learning is not being analyzed and shared properly to another team (especially for the new joiner who just onboarded to the team).

We need a system to organize all incidents properly so every member in the team could revisit and learn from the incident, hoping that the incident won't be happening in the future or if the incident is unavoidable, the team already has a proper plan to mitigate the incident.

🧠 Assumptions

Data Incident Management

Data Incident Management is the end-to-end solution covering observability, alerts, and incident catalog. The observability includes data related observability and observability of the systems that process the data. There are mainly 4 steps of process in Data Incident Management:

🔎 Observe
- We need to define a Signal to observe.
- For observability, we could follow the CNCF standard OpenTelemetry.
✔️ Detect
- We define specific threshold rules that translate into an Alert. Although an Incident does not necessarily to be generated from an Alert, the collected Signals from the observation are aggregated could lead to an alert and an Incident will be generated if the computed Signals exceed a specific threshold.
- SLO and alert policy definition could follow OpenSLO.
🔔 Notify
- There might be possible events that are just an INFO and not an Alert caused by some important events e.g. schema changes.
- A Notification could be sent to one or more Channels. Users could subscribe to a specific alert through one/several.
- Channels. Once an incident is being generated, it will trigger an Alert that would generate several Notifications.
💻 Analyze
- The details of the alerts & incidents could be revisited and some details regarding the incident could be added for further analysis and lessons learned.

From all 4 steps mentioned above, not all Alerts will go through all 4 steps. There are 3 types of Alert based on the steps above:

Standard alerts: All 4 steps will be applicable
Data changes: Last 2 steps will be applicable
Manually reported: Only the last step will be applicable

Issues Classification

Systems (Data System)
- Response time
- Data-related
- and more... (TBA)
Data quality
- Cloud-data service performance
- User reported issue
- and more... (TBA)

📜 Requirements

Capability to define alert policy
- User Story: User could create alert policy, and notification channels
- Importance: Must Have
- Notes:
  - Use OpenSLO format to define.
  - Need to consider this features:
    - Grouping - categorizes alerts of similar nature into a single notification.
    - Inhibition - suppressing notifications for certain alerts if certain other alerts are already firing.
    - Silences - mute alerts for a given time
  - Separate alerts (critical) and info notifications.
  - Responsibility of alerts (cortex alert manager?).
Capability to silence the alerts
- User Story: User could silence triggered alerts
- Importance: Must Have
- Notes:
Capability to manually trigger an alert/notification and subscribe to an alert
- User Story: User could create/trigger an alert and notify to the subscribers through all channels
- Importance: Must Have
- Notes:
  - Need to define what channels are currently supported, what channels would be supported in the near future.
  - Figure out what identifier should be used to link with the defined alert.
    - So far we are defining receivers and label matchers. Each alert will have labels and receivers that have matched labels will get notifications.
  - Might want to separate alerts (critical) and info notifications based on its criticality/severity.
  - Subscription:
    - alert to subscribe
      - one alert or multiple
    - channel receivers
      - one channel or multiple receivers
    - could be individual or teams
Capability to track alert & subscription policy changes
- User Story: System could track alert & subscription policy changes
- Importance: Must Have
- Notes:
  - Something like audit history
Capability to define SLI and SLO
- User Story: User could create SLO that consists of alert policies
- Importance: Nice to Have
- Notes:
  - Use OpenSLO format to define.
  - SLO is a superset of alert policies.
  - SLI is a kind of signals to read from the datasource.
  - Higher abstraction of signals.
Capability to observe and auto detect possible data-related issues
- User Story: User able to define the provider/data source to observe signal and use the defined alert policy to auto-trigger alert if alert condition meet
- Importance: Nice to Have
- Notes:
  - Data-related issues = some issues that are not system issues (e.g. Data quality, BQ slots performance, User reported issue).
  - Can be push-based/pull-based.
  - How to observe and detect.
  - Not all the cases could be covered to automatically generate alerts.
Capability for a system to generate an incident
- User Story: User could generate an incident
- Importance: Must Have
- Notes:
  - Define incident data model and features.
  - Can start with, manually generated an incident out of an alert/SLA breach.
  - Classification of issues (like tags).
  - Detail of feature will be discussed later.
Capability for a system to revisit/browse all incidents
- User Story: User could revisit a specific incident
- Importance: Must Have
- Notes:
  - Classification of issues (like tags).
Capability for a system to add details on the related incident
- User Story: User could revisit a specific incident
- Importance: Nice to Have
- Notes:
  - Enrichment features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raystack Foundation

[Requirements] Data Incident Management #15

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Raystack Foundation

[Requirements] Data Incident Management #15

mabdh Aug 29, 2022 Collaborator

⚠️ Problems

Problem 1

Problem 2

🧠 Assumptions

Data Incident Management

Issues Classification

📜 Requirements

Replies: 0 comments

mabdh
Aug 29, 2022
Collaborator