You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to go over all the commands we emit and make sure they are idempotent, otherwise retrying could result in a corrupted state, lost tests etc.
The redis gem has the necessary elements for that it's mostly just configuration.
Context
Sometimes our Redis server that handle the ci-queue workload experience a failover or some other availability issues.
When this happens it break builds even though it recovers pretty fast.
Examples
Error connecting to Redis on redacted.svc.cluster.local.:6379 (SocketError) (Redis::CannotConnectError)
./tmp/bundle/ruby/3.1.0/gems/redis-4.8.0/lib/redis/client.rb:162:in `call': MASTERDOWN Link with MASTER is down and replica-serve-stale-data is set to 'no'. (Redis::CommandError)
(that later one need to be better categorized by the redis gem though)
Solution
Ideally we'd be resilient to these small transient errors, this means retrying all or most commands and possibly waiting a bit before retrying. The redis gem has the necessary elements for that it's mostly just configuration.
However we need to go over all the commands we emit and make sure they are idempotent, otherwise retrying could result in a corrupted state, lost tests etc.
Acceptance Criteria
Context
Sometimes our Redis server that handle the ci-queue workload experience a failover or some other availability issues.
When this happens it break builds even though it recovers pretty fast.
Examples
(that later one need to be better categorized by the redis gem though)
Solution
Ideally we'd be resilient to these small transient errors, this means retrying all or most commands and possibly waiting a bit before retrying. The
redis
gem has the necessary elements for that it's mostly just configuration.However we need to go over all the commands we emit and make sure they are idempotent, otherwise retrying could result in a corrupted state, lost tests etc.
cc @ChrisBr
The text was updated successfully, but these errors were encountered: