[ResponseOps] research making alerting more observable in cloud deployments #124382

pmuellr · 2022-02-02T17:22:30Z

from [Meta][ResponseOps] An alerting rule can negatively impact the alerting system and the overall Kibana health #119653

Since the cloud contains a bit of diagnostic information for both customers, and even more for elastic support engineers, seems like it would be worth an effort to see if we could make alerting a little more visible there. For example, in this customer-viewable page:

This would obviously involve some work on the cloud, to make the new data visible. And then we need to figure out what we want to be visible, and make sure it's available.

So, lots of research to do.

Some other thoughts:

could we make the event log searchable by support engineers? Or even just downloadable? Seems like there are PII issues there ...
can we extend the kibana support diagnostics tool to provide more interesting data (presumably, "problem" related)
provide some API which would have alerting dump more log data into the Kibana log, presumably time-based (be verbose for the next 10 minutes)
perhaps alerting could log a summary every hour of "interesting" things it's found
could we extend the Kibana health API to provide some of this info
could we get the event log docs indexed directly into the Kibana log, so they'd be more visible

elasticmachine · 2022-02-02T17:22:32Z

Pinging @elastic/response-ops (Team:ResponseOps)

stefnestor · 2022-02-03T04:04:59Z

could we make the event log searchable by support engineers? Or even just downloadable? Seems like there are PII issues there ...

System indices are searchable-ish (redactions & only where justified) & where redacted, queries are provided to users to pull, e.g. my recent redundant PR to make sure it was externally documented: #122613.

provide some API which would have alerting dump more log data into the Kibana log, presumably time-based (be verbose for the next 10 minutes)

If we're willing to consider UI build out, the SIEM > Rules > Rule Monitoring has been helpful in SIEM cases to discover expensive rules and could be considered to be expanded to add into Stack Management > Alerts and Insights > Rules and Connectors > Rules. This would help users self-diagnose rather than requiring support, but could be used by both.

could we get the event log docs indexed directly into the Kibana log, so they'd be more visible

I'd taken note what I thought would answer this, which may be wrong, but it was to use your Event Log by setting xpack.eventLog.logEntries:true which would induce events into Kibana Cluster Logging on top of indexing, no? Found via #72058 but maybe isn't intended to carry forward.

perhaps alerting could log a summary every hour of "interesting" things it's found

Could we consider a small-cut of default enabling / recommending default user set-up of .kibana-event-log* slow logs (via template) to say event.duration: [warn:30s, error:1min]? As Kibana degrades it may just fill, but reviewing initial failures is the exact ballpark I currently introspect and this would make it easier for on-prem to see events across time while minimizing logs.

stefnestor · 2022-03-02T17:28:03Z

👋🏼 IMO before improving, I'd like to recommend this mitigation: elastic/support-diagnostics#578. Either that Kibana enable a single endpoint to pull all Kibana (SIEM) Rules or that we expend the diagnostic to be able to read from all spaces.

mikecote · 2022-09-22T16:41:50Z

cc @shanisagiv1

pmuellr · 2023-01-03T22:40:46Z

During a backlog grooming session, we decided to close this issue. Primarily due to the serverless effort, where things like this will presumably be changing anyway ...

pmuellr added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) research Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Feb 2, 2022

pmuellr added this to AppEx: ResponseOps - Execution & Connectors Feb 2, 2022

This was referenced Feb 2, 2022

[Meta][ResponseOps] An alerting rule can negatively impact the alerting system and the overall Kibana health #119653

Closed

add new API and spaces management for Kibana diagnostic v2 elastic/support-diagnostics#514

Closed

stefnestor mentioned this issue Mar 2, 2022

Enable Kibana Diagnostic to be Space Aware for Rules elastic/support-diagnostics#578

Open

pmuellr closed this as completed Jan 3, 2023

github-project-automation bot moved this to Done in AppEx: ResponseOps - Execution & Connectors Jan 3, 2023

ersin-erdal removed the status in AppEx: ResponseOps - Execution & Connectors Feb 13, 2024

ersin-erdal moved this to Done in AppEx: ResponseOps - Execution & Connectors Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ResponseOps] research making alerting more observable in cloud deployments #124382

[ResponseOps] research making alerting more observable in cloud deployments #124382

pmuellr commented Feb 2, 2022

elasticmachine commented Feb 2, 2022

stefnestor commented Feb 3, 2022

stefnestor commented Mar 2, 2022

mikecote commented Sep 22, 2022

pmuellr commented Jan 3, 2023

[ResponseOps] research making alerting more observable in cloud deployments #124382

[ResponseOps] research making alerting more observable in cloud deployments #124382

Comments

pmuellr commented Feb 2, 2022

elasticmachine commented Feb 2, 2022

stefnestor commented Feb 3, 2022

stefnestor commented Mar 2, 2022

mikecote commented Sep 22, 2022

pmuellr commented Jan 3, 2023