You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The config.RuntimeConfig struct ensures data integrity through the use of its mu sync.RWMutex.
When creating lots of events in a bulk fashion, e.g., after loading multiple incidents on a fresh Icinga configuration, this mutex will result in long locking times and a high CPU-usage.
I experienced this bug after starting with a fresh Icinga 2-configuration, including lots of firing checks.
After configuring a new notification channel, it has not appeared after minutes. After taking a look at the process, I saw a huge CPU utilization and, after enabling debug logs, missing configuration reloads.
There's a delay between line 144 and 220 of over four minutes for a job which should be executed every second. Between those log messages, only Locking the RuntimeConfig's mutex mu takes place.
A small patch was applied to log mutex usage, the call origin and time.
Logging Patch
diff --git a/internal/config/runtime.go b/internal/config/runtime.go
index ad6dd2e..af70651 100644
--- a/internal/config/runtime.go+++ b/internal/config/runtime.go@@ -13,6 +13,7 @@ import (
"github.com/jmoiron/sqlx"
"go.uber.org/zap"
"golang.org/x/crypto/bcrypt"
+ "runtime/debug"
"strconv"
"strings"
"sync"
@@ -86,13 +87,19 @@ func (r *RuntimeConfig) PeriodicUpdates(ctx context.Context, interval time.Durat
}
// RLock locks the config for reading.
-func (r *RuntimeConfig) RLock() {+func (r *RuntimeConfig) RLock(start time.Time) {
r.mu.RLock()
+ r.logger.Debugw("RLock",+ zap.String("trace", string(debug.Stack())),+ zap.Duration("took", time.Since(start)))
}
// RUnlock releases a lock obtained by RLock().
-func (r *RuntimeConfig) RUnlock() {+func (r *RuntimeConfig) RUnlock(start time.Time) {
r.mu.RUnlock()
+ r.logger.Debugw("RUnlock",+ zap.String("trace", string(debug.Stack())),+ zap.Duration("took", time.Since(start)))
}
func (r *RuntimeConfig) GetRecipient(k recipient.Key) recipient.Recipient {
@@ -147,8 +154,9 @@ func (r *RuntimeConfig) GetContact(username string) *recipient.Contact {
// This method returns either a *Source or a nil pointer and logs the cause to the given logger. This is in almost all
// cases a debug logging message, except when something server-side is wrong, e.g., the hash is invalid.
func (r *RuntimeConfig) GetSourceFromCredentials(user, pass string, logger *logging.Logger) *Source {
- r.RLock()- defer r.RUnlock()+ start := time.Now()+ r.RLock(start)+ defer r.RUnlock(start)
sourceIdRaw, sourceIdOk := strings.CutPrefix(user, "source-")
if !sourceIdOk {
@@ -220,8 +228,14 @@ func (r *RuntimeConfig) fetchFromDatabase(ctx context.Context) error {
}
func (r *RuntimeConfig) applyPending() {
+ lockStart := time.Now()
r.mu.Lock()
- defer r.mu.Unlock()+ r.logger.Debugw("Lock", zap.Duration("took", time.Since(lockStart)))++ defer func() {+ r.mu.Unlock()+ r.logger.Debugw("Unlock", zap.Duration("took", time.Since(lockStart)))+ }()
r.logger.Debug("applying pending configuration")
start := time.Now()
diff --git a/internal/config/verify.go b/internal/config/verify.go
index d6b7bf4..446ba5d 100644
--- a/internal/config/verify.go+++ b/internal/config/verify.go@@ -7,12 +7,14 @@ import (
"github.com/icinga/icinga-notifications/internal/recipient"
"github.com/icinga/icinga-notifications/internal/rule"
"github.com/icinga/icinga-notifications/internal/timeperiod"
+ "time"
)
// debugVerify performs a set of config validity/consistency checks that can be used for debugging.
func (r *RuntimeConfig) debugVerify() error {
- r.RLock()- defer r.RUnlock()+ start := time.Now()+ r.RLock(start)+ defer r.RUnlock(start)
if r.Channels == nil {
return errors.New("RuntimeConfig.Channels is nil")
diff --git a/internal/incident/incident.go b/internal/incident/incident.go
index b0d60aa..a55f515 100644
--- a/internal/incident/incident.go+++ b/internal/incident/incident.go@@ -107,8 +107,9 @@ func (i *Incident) ProcessEvent(ctx context.Context, ev *event.Event, created bo
i.Lock()
defer i.Unlock()
- i.runtimeConfig.RLock()- defer i.runtimeConfig.RUnlock()+ start := time.Now()+ i.runtimeConfig.RLock(start)+ defer i.runtimeConfig.RUnlock(start)
tx, err := i.db.BeginTxx(ctx, nil)
if err != nil {
@@ -187,8 +188,9 @@ func (i *Incident) RetriggerEscalations(ev *event.Event) {
i.Lock()
defer i.Unlock()
- i.runtimeConfig.RLock()- defer i.runtimeConfig.RUnlock()+ start := time.Now()+ i.runtimeConfig.RLock(start)+ defer i.runtimeConfig.RUnlock(start)
if !i.RecoveredAt.IsZero() {
// Incident is recovered in the meantime.
@@ -636,8 +638,9 @@ func (i *Incident) processAcknowledgementEvent(ctx context.Context, tx *sqlx.Tx,
// RestoreEscalationStateRules restores this incident's rules based on the given escalation states.
func (i *Incident) RestoreEscalationStateRules(states []*EscalationState) {
- i.runtimeConfig.RLock()- defer i.runtimeConfig.RUnlock()+ start := time.Now()+ i.runtimeConfig.RLock(start)+ defer i.runtimeConfig.RUnlock(start)
for _, state := range states {
escalation := i.runtimeConfig.GetRuleEscalation(state.RuleEscalationID)
The same amount for UpdateFromDatabase results in the fact that both are called after another from UpdateFromDatabase as stated above.
However, most locks are originating in the Listener. A second look reveals that there are two different locking points:
awk '
/runtime-updates.*RLock.*ProcessEvent/ {
match($0, /"trace": "(.*)",/, m)
trace = ""
split(m[1], p, "\\\\n")
for (i = 6; i < length(p); i++) {
if (index(p[i], "\\t"))
continue
if (index(p[i], "(*Listener).ProcessEvent"))
break
trace = trace " " p[i]
}
gsub(/0x[0-9a-f]+/, "?", trace)
print trace
}' runtime-log3 | sort | uniq -c
The (*RuntimeConfig).GetSourceFromCredentials method holds a RLock unnecessary long over the whole method call, including the bcrypt verification. This might be narrowed down.
(*Incident).ProcessEvent holds its RLock even longer over a SQL statement and sub-methods.
Taking the caller into account, might unveil a pattern:
awk '
/runtime-updates[\t ]+(R|)Lock/ {
match($0, /"took": "(.*)"/, m)
took = m[1]
if ($4 == "Lock") {
origin = "UpdateFromDatabase->applyPending"
} else if (index($0, "debugVerify")) {
origin = "UpdateFromDatabase->debugVerify"
} else if (index($0, "GetSourceFromCredentials")) {
origin = "Listener->GetSourceFromCredentials"
} else if (index($0, "ProcessEvent")) {
origin = "Listener->ProcessEvent"
} else {
origin = $0
}
print NR, took, origin
}' runtime-log3
At least the first big spikes are related to a applyPending and not so much to GetSourceFromCredentials, which is all over the place anyway. (Edit: Comparing the times when a mutex lock was acquired does not make sense alone.)
The text was updated successfully, but these errors were encountered:
The
config.RuntimeConfig
struct ensures data integrity through the use of itsmu sync.RWMutex
.When creating lots of events in a bulk fashion, e.g., after loading multiple incidents on a fresh Icinga configuration, this mutex will result in long locking times and a high CPU-usage.
I experienced this bug after starting with a fresh Icinga 2-configuration, including lots of firing checks.
After configuring a new notification channel, it has not appeared after minutes. After taking a look at the process, I saw a huge CPU utilization and, after enabling debug logs, missing configuration reloads.
There's a delay between line 144 and 220 of over four minutes for a job which should be executed every second. Between those log messages, only
Lock
ing theRuntimeConfig
's mutexmu
takes place.A small patch was applied to log mutex usage, the call origin and time.
Logging Patch
The logs can be downloaded zipped here.
A first inspection of the new logs shows locking times of over four minutes.
Script
Taking a look at the
(R|)Lock
origins reveals only three places:Script
The same amount for UpdateFromDatabase results in the fact that both are called after another from
UpdateFromDatabase
as stated above.However, most locks are originating in the
Listener
. A second look reveals that there are two different locking points:Script
(*RuntimeConfig).GetSourceFromCredentials
method holds aRLock
unnecessary long over the whole method call, including the bcrypt verification. This might be narrowed down.(*Incident).ProcessEvent
holds itsRLock
even longer over a SQL statement and sub-methods.Taking the caller into account, might unveil a pattern:Script
At least the first big spikes are related to a(Edit: Comparing the times when a mutex lock was acquired does not make sense alone.)applyPending
and not so much toGetSourceFromCredentials
, which is all over the place anyway.The text was updated successfully, but these errors were encountered: