Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix/gql: Cache the SiteProductVersion query for up to 10 minutes #6111

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

dominiccooney
Copy link
Contributor

@dominiccooney dominiccooney commented Nov 12, 2024

We hammer the SiteProductVersion GraphQL endpoint really hard. This adds a cache in front of it to reduce the heavy traffic to that endpoint. The cache also makes the product faster.

Changes the currentSiteVersion to use this cache directly instead of burying it in an observable where it may cache transient errors like network errors.

I tried writing an exponential-backoff-and-retry Observable, and it is possible, but it is a bit tortured because the downstream consumer pings the upstream "(re)try" producer to cause the Observable to re-emit. I judged caching in the GraphQL client and retrying when called was simpler.

One subtlety among many: Aborts cause exponential backoff. This is a feature: it prevents a misbehaving client (which we have suffered recently) from initiating and then immediately aborting many requests rapidly.

Test plan

New automated test:

pnpm -C lib/shared unit -- cache.test.ts

Manual test:

  • Verify spans like graphql.fetch.SiteProductVersion.cached - 0.473875ms appear. For example, run the product in the debugger and check the Debug Console for those entries.
  • Change accounts, and verify cache miss spans like graphql.fetch.SiteProductVersion - 261.213334ms appear.

Changelog

  • SiteProductVersion, a heavily used GraphQL query, is cached for up to 10 minutes. Failures are retried with exponential backoff.

@dominiccooney
Copy link
Contributor Author

@valerybugakov I would love your input on all of this. Two things I'm particularly uncertain about:

The cache de-duplicates concurrent requests, because we tend to send a lot of the same request at once. Requests share a result because of the de-duping. We only abort a request when the last concurrent client aborts it. But this means it is possible to:

const controller = new AbortController()
const version = client.getSiteVersion()
await mumble() // do some stuff, another client starts up here
controller.abort()
await version // returns a real result! you could expect it should throw

Of course, we could preserve the one-aborts-one semantics. Do you think it is important to do that?

Errors use exponential backoff up to 10 minutes. Do you think it is reasonable to treat all kinds of errors like that? If your internet dropped out for 30 minutes, it seems weak to make you wait an extra 10 minutes to get back online...

@dominiccooney
Copy link
Contributor Author

Note to self: needs updated snapshots

Copy link
Member

@valerybugakov valerybugakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, we could preserve the one-aborts-one semantics. Do you think it is important to do that?

The scenario you mentioned seems unlikely. In the ideal world, we don't use promise-based APIs in user-land but rely on the siteVersion observable, which would handle this case for us and serve as one source of truth. My vote goes for keeping the de-duplication logic as-is for now.

Errors use exponential backoff up to 10 minutes. Do you think it is reasonable to treat all kinds of errors like that? If your internet dropped out for 30 minutes, it seems weak to make you wait an extra 10 minutes to get back online...

With the current state of the product, if I needed an updated config, I'd probably just reload the whole VS Code instance. To be thorough, it seems better to have a complete solution across the extension so that after any long connectivity issue, I can trust that everything in Cody is up to date without needing manual checks.

So, I agree that it seems weak, but handling this edge case here won't make a difference. To address this, we need a centralized solution.


this.fetchTimeMsec = now

const thisFetch = fetcher(this.aborts.signal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could fetcher throw before returning the promise? Do we want to handle is here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior as-is should be perfect for our GraphQL APIs, which lift network errors and so on into Promise resolves not rejects.

If we end up here, then we are starting a new fetch. If fetcher throws before returning, the current caller will get a rejected Promise. Because we do not update thisFetch, the next caller will start a new one.

Is the idea behind the comment that we should do exponential backoff in that case too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea behind the comment that we should do exponential backoff in that case too?

Yes. Would it be helpful to handle sync errors in the same way we handle async errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because they're probably not the kind of transient failure we're interested in retrying. But for now I've implemented the suggestion... I think this might me making the tests flaky for me, let's see what CI thinks...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is in 7ebfde2 if you are interested, but for now I have rolled it back to not do that. It seems slower and flakier, have not dug into why.

lib/shared/src/sourcegraph-api/graphql/cache.ts Outdated Show resolved Hide resolved
}

async get(
signal: AbortSignal | undefined,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where we want to pass undefined here? It would be strange to read cache.get(undefined, signal => fetch(...)). It may be worth changing the call signature or narrowing the first arg type.

Copy link
Contributor Author

@dominiccooney dominiccooney Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some callers to the GraphQL method don't care about aborts and pass undefined.

I like the callback in the end position because it can be a multiline expression without looking weird. setTimeout with the callback first makes my teeth itch.

vscode/src/auth/auth.ts Show resolved Hide resolved
Copy link
Member

@valerybugakov valerybugakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@dominiccooney dominiccooney force-pushed the dpc/site-project-version branch 2 times, most recently from 7ebfde2 to 2f7de58 Compare November 18, 2024 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants