Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indis dual-read lix switch issue #964

Merged
merged 5 commits into from
Jan 3, 2024

Conversation

brycezhongqing
Copy link
Collaborator

@brycezhongqing brycezhongqing commented Jan 3, 2024

1. Background

For gcn-39955
After indis-dual-read lix switched to observer-only mode, job-postings-mt was not able to make downstream calls to d2 services under a symlink cluster(This issue has been fixed by #956 ). Then D2 team tried to switch dual-read mode lix from Observer-only to DUAL_READ, but it failed for partial downstream services.

2. Reproduce and investigation

how to reproduce this :

So we found, after Lix switch from Observer-only to DUAL_READ, it will take a few minutes to update for partial downstream services. The root cause is all downstream services will use global rate-limit to update their dual-read mode which may cause partial services being starved and take a long time to update.

3. Solution

After discussing with @bohhyang . We create a map, key is the downstream service name and value is the relative rate limiter for each service. So rate limiter will be service granularity.

4. Test

  • For rest.li(pegasus) repo, ./scripts/local-release -s to create a snapshot
  • Bump up pegasus version in container and for container mint build && mint snapshot && mint release
  • Bump up pegasus version and container version in job-postings-mt and then mint build-cfg -f qei-ltx1 && mint deploy -f qei-ltx1 --debug-app in job-postings-mt
  • Check the service dual read mode(it's currently in DUAL_READ mode)
grep "Dual read mode for service " job-postings-mt-war.log
curli --fabric ei-ltx1 --pretty-print "https://zhonchen-mn2.linkedin.biz:9607/jobPostingsMt/resources/jobPostings/2147622124?action=purchase" --data '{"owner":"urn:li:contract:92952029", "isFreemiumPromotion": true,
"jobPurchaseBillingParams":{"dailyBudget":{"currencyCode":"USD", "amount":"116"}, "unitCost":{"currencyCode":"USD","amount":"3"},
"costType":"COST_PER_VIEW"}, "viewer":"urn:li:system:0"}' --dv-auth=SELF
image We found all the downstream services switch to `DUAL_READ` mode - switch the lix from `DUAL_READ` to `Observer-only`, after 5 minutes, call ``` curli --fabric ei-ltx1 --pretty-print "https://zhonchen-mn2.linkedin.biz:9607/jobPostingsMt/resources/jobPostings/2147622124?action=purchase" --data '{"owner":"urn:li:contract:92952029", "isFreemiumPromotion": true, "jobPurchaseBillingParams":{"dailyBudget":{"currencyCode":"USD", "amount":"116"}, "unitCost":{"currencyCode":"USD","amount":"3"}, "costType":"COST_PER_VIEW"}, "viewer":"urn:li:system:0"}' --dv-auth=SELF ``` We found the service is in `Observer-only` image
  • switch the lix from Observer-only to DUAL_READ, after 5 minutes, call
curli --fabric ei-ltx1 --pretty-print "https://zhonchen-mn2.linkedin.biz:9607/jobPostingsMt/resources/jobPostings/2147622124?action=purchase" --data '{"owner":"urn:li:contract:92952029", "isFreemiumPromotion": true,
"jobPurchaseBillingParams":{"dailyBudget":{"currencyCode":"USD", "amount":"116"}, "unitCost":{"currencyCode":"USD","amount":"3"},
"costType":"COST_PER_VIEW"}, "viewer":"urn:li:system:0"}' --dv-auth=SELF

We found all downstream services are in DUAL_READ
image

@brycezhongqing brycezhongqing force-pushed the zhonchen/fix_dual_read_mode_switch branch from 5b554dd to 1bde9b1 Compare January 3, 2024 03:33
@bohhyang
Copy link
Contributor

bohhyang commented Jan 3, 2024

switch the lix from xxx to xxx, after 5 minutes, call

To confirm, is the 5-min waiting for lix update to be propagated to lix client?

Also, we should check for many downstream services' dual read modes instead of just "accountBalances". The goal is to ensure once the lix value is updated, all downstream services should see the updated value (with their individual rate limiters). So a solid test should be, after X mins (once the lix update is propagated to the lix client), hitting endpoint A, B, C, etc (nearly the same time, in different terminal tabs for example) which call different downstream services, all downstream services' dual read modes are updated. You don't need to set breakpoints to show the updated value which will cause hanging/delay, but just show the app log which will log about the updated values.

@brycezhongqing
Copy link
Collaborator Author

switch the lix from xxx to xxx, after 5 minutes, call

To confirm, is the 5-min waiting for lix update to be propagated to lix client?

Also, we should check for many downstream services' dual read modes instead of just "accountBalances". The goal is to ensure once the lix value is updated, all downstream services should see the updated value (with their individual rate limiters). So a solid test should be, after X mins (once the lix update is propagated to the lix client), hitting endpoint A, B, C, etc (nearly the same time, in different terminal tabs for example) which call different downstream services, all downstream services' dual read modes are updated. You don't need to set breakpoints to show the updated value which will cause hanging/delay, but just show the app log which will log about the updated values.

switch the lix from xxx to xxx, after 5 minutes, call

To confirm, is the 5-min waiting for lix update to be propagated to lix client?

Also, we should check for many downstream services' dual read modes instead of just "accountBalances". The goal is to ensure once the lix value is updated, all downstream services should see the updated value (with their individual rate limiters). So a solid test should be, after X mins (once the lix update is propagated to the lix client), hitting endpoint A, B, C, etc (nearly the same time, in different terminal tabs for example) which call different downstream services, all downstream services' dual read modes are updated. You don't need to set breakpoints to show the updated value which will cause hanging/delay, but just show the app log which will log about the updated values.

  1. Yes. The 5-min waiting is lix update
  2. Yes. I have checked all downstream services, accountBalances is one of them. All the downstream services will be updated dual-read mode.

@bohhyang
Copy link
Contributor

bohhyang commented Jan 3, 2024

  1. Yes. I have checked all downstream services, accountBalances is one of them. All the downstream services will be updated dual-read mode.

Could you post the app log (or a snippet) about the updated dual read mode to the Test Done section?

@brycezhongqing
Copy link
Collaborator Author

  1. Yes. I have checked all downstream services, accountBalances is one of them. All the downstream services will be updated dual-read mode.

Could you post the app log (or a snippet) about the updated dual read mode to the Test Done section?

Updated

Copy link
Contributor

@bohhyang bohhyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@brycezhongqing brycezhongqing merged commit 0bed54f into master Jan 3, 2024
2 checks passed
@brycezhongqing brycezhongqing deleted the zhonchen/fix_dual_read_mode_switch branch January 3, 2024 19:55
@shivamgupta1
Copy link
Contributor

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants