-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix indis dual-read lix switch issue #964
Conversation
d2/src/main/java/com/linkedin/d2/balancer/dualread/DualReadStateManager.java
Outdated
Show resolved
Hide resolved
5b554dd
to
1bde9b1
Compare
To confirm, is the 5-min waiting for lix update to be propagated to lix client? Also, we should check for many downstream services' dual read modes instead of just "accountBalances". The goal is to ensure once the lix value is updated, all downstream services should see the updated value (with their individual rate limiters). So a solid test should be, after X mins (once the lix update is propagated to the lix client), hitting endpoint A, B, C, etc (nearly the same time, in different terminal tabs for example) which call different downstream services, all downstream services' dual read modes are updated. You don't need to set breakpoints to show the updated value which will cause hanging/delay, but just show the app log which will log about the updated values. |
d2/src/main/java/com/linkedin/d2/balancer/dualread/DualReadStateManager.java
Outdated
Show resolved
Hide resolved
|
Could you post the app log (or a snippet) about the updated dual read mode to the Test Done section? |
Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
lgtm |
1. Background
For gcn-39955
After indis-dual-read lix switched to observer-only mode, job-postings-mt was not able to make downstream calls to d2 services under a symlink cluster(This issue has been fixed by #956 ). Then D2 team tried to switch dual-read mode lix from
Observer-only
toDUAL_READ
, but it failed for partial downstream services.2. Reproduce and investigation
how to reproduce this :
si.indis.client_read_mode
toindis_automation_test
in containerSo we found, after Lix switch from
Observer-only
toDUAL_READ
, it will take a few minutes to update for partial downstream services. The root cause is all downstream services will use global rate-limit to update their dual-read mode which may cause partial services being starved and take a long time to update.3. Solution
After discussing with @bohhyang . We create a map, key is the downstream service name and value is the relative rate limiter for each service. So rate limiter will be service granularity.
4. Test
./scripts/local-release -s
to create a snapshotmint build && mint snapshot && mint release
pegasus version
andcontainer version
injob-postings-mt
and thenmint build-cfg -f qei-ltx1 && mint deploy -f qei-ltx1 --debug-app in job-postings-mt
DUAL_READ
mode)Observer-only
toDUAL_READ
, after 5 minutes, callWe found all downstream services are in
DUAL_READ