Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] research making alerting more observable in cloud deployments #124382

Closed
pmuellr opened this issue Feb 2, 2022 · 5 comments
Closed
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Feb 2, 2022

from [Meta][ResponseOps] An alerting rule can negatively impact the alerting system and the overall Kibana health #119653

Since the cloud contains a bit of diagnostic information for both customers, and even more for elastic support engineers, seems like it would be worth an effort to see if we could make alerting a little more visible there. For example, in this customer-viewable page:
image

This would obviously involve some work on the cloud, to make the new data visible. And then we need to figure out what we want to be visible, and make sure it's available.

So, lots of research to do.

Some other thoughts:

  • could we make the event log searchable by support engineers? Or even just downloadable? Seems like there are PII issues there ...
  • can we extend the kibana support diagnostics tool to provide more interesting data (presumably, "problem" related)
  • provide some API which would have alerting dump more log data into the Kibana log, presumably time-based (be verbose for the next 10 minutes)
  • perhaps alerting could log a summary every hour of "interesting" things it's found
  • could we extend the Kibana health API to provide some of this info
  • could we get the event log docs indexed directly into the Kibana log, so they'd be more visible
@pmuellr pmuellr added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) research Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Feb 2, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@stefnestor
Copy link
Contributor

could we make the event log searchable by support engineers? Or even just downloadable? Seems like there are PII issues there ...

System indices are searchable-ish (redactions & only where justified) & where redacted, queries are provided to users to pull, e.g. my recent redundant PR to make sure it was externally documented: #122613.

provide some API which would have alerting dump more log data into the Kibana log, presumably time-based (be verbose for the next 10 minutes)

If we're willing to consider UI build out, the SIEM > Rules > Rule Monitoring has been helpful in SIEM cases to discover expensive rules and could be considered to be expanded to add into Stack Management > Alerts and Insights > Rules and Connectors > Rules. This would help users self-diagnose rather than requiring support, but could be used by both.

could we get the event log docs indexed directly into the Kibana log, so they'd be more visible

I'd taken note what I thought would answer this, which may be wrong, but it was to use your Event Log by setting xpack.eventLog.logEntries:true which would induce events into Kibana Cluster Logging on top of indexing, no? Found via #72058 but maybe isn't intended to carry forward.

perhaps alerting could log a summary every hour of "interesting" things it's found

Could we consider a small-cut of default enabling / recommending default user set-up of .kibana-event-log* slow logs (via template) to say event.duration: [warn:30s, error:1min]? As Kibana degrades it may just fill, but reviewing initial failures is the exact ballpark I currently introspect and this would make it easier for on-prem to see events across time while minimizing logs.

@stefnestor
Copy link
Contributor

👋🏼 IMO before improving, I'd like to recommend this mitigation: elastic/support-diagnostics#578. Either that Kibana enable a single endpoint to pull all Kibana (SIEM) Rules or that we expend the diagnostic to be able to read from all spaces.

@mikecote
Copy link
Contributor

cc @shanisagiv1

@pmuellr
Copy link
Member Author

pmuellr commented Jan 3, 2023

During a backlog grooming session, we decided to close this issue. Primarily due to the serverless effort, where things like this will presumably be changing anyway ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

4 participants