fix/gql: Cache the SiteProductVersion query for up to 10 minutes #6111

dominiccooney · 2024-11-12T14:13:13Z

We hammer the SiteProductVersion GraphQL endpoint really hard. This adds a cache in front of it to reduce the heavy traffic to that endpoint. The cache also makes the product faster.

Changes the currentSiteVersion to use this cache directly instead of burying it in an observable where it may cache transient errors like network errors.

I tried writing an exponential-backoff-and-retry Observable, and it is possible, but it is a bit tortured because the downstream consumer pings the upstream "(re)try" producer to cause the Observable to re-emit. I judged caching in the GraphQL client and retrying when called was simpler.

One subtlety among many: Aborts cause exponential backoff. This is a feature: it prevents a misbehaving client (which we have suffered recently) from initiating and then immediately aborting many requests rapidly.

Test plan

New automated test:

pnpm -C lib/shared unit -- cache.test.ts

Manual test:

Verify spans like graphql.fetch.SiteProductVersion.cached - 0.473875ms appear. For example, run the product in the debugger and check the Debug Console for those entries.
Change accounts, and verify cache miss spans like graphql.fetch.SiteProductVersion - 261.213334ms appear.

Changelog

SiteProductVersion, a heavily used GraphQL query, is cached for up to 10 minutes. Failures are retried with exponential backoff.

dominiccooney · 2024-11-12T14:19:32Z

@valerybugakov I would love your input on all of this. Two things I'm particularly uncertain about:

The cache de-duplicates concurrent requests, because we tend to send a lot of the same request at once. Requests share a result because of the de-duping. We only abort a request when the last concurrent client aborts it. But this means it is possible to:

const controller = new AbortController()
const version = client.getSiteVersion()
await mumble() // do some stuff, another client starts up here
controller.abort()
await version // returns a real result! you could expect it should throw

Of course, we could preserve the one-aborts-one semantics. Do you think it is important to do that?

Errors use exponential backoff up to 10 minutes. Do you think it is reasonable to treat all kinds of errors like that? If your internet dropped out for 30 minutes, it seems weak to make you wait an extra 10 minutes to get back online...

dominiccooney · 2024-11-12T14:20:10Z

Note to self: needs updated snapshots

valerybugakov

Of course, we could preserve the one-aborts-one semantics. Do you think it is important to do that?

The scenario you mentioned seems unlikely. In the ideal world, we don't use promise-based APIs in user-land but rely on the siteVersion observable, which would handle this case for us and serve as one source of truth. My vote goes for keeping the de-duplication logic as-is for now.

Errors use exponential backoff up to 10 minutes. Do you think it is reasonable to treat all kinds of errors like that? If your internet dropped out for 30 minutes, it seems weak to make you wait an extra 10 minutes to get back online...

With the current state of the product, if I needed an updated config, I'd probably just reload the whole VS Code instance. To be thorough, it seems better to have a complete solution across the extension so that after any long connectivity issue, I can trust that everything in Cody is up to date without needing manual checks.

So, I agree that it seems weak, but handling this edge case here won't make a difference. To address this, we need a centralized solution.

lib/shared/src/sourcegraph-api/graphql/client.ts

valerybugakov · 2024-11-13T08:59:28Z

lib/shared/src/sourcegraph-api/graphql/cache.ts

+
+        this.fetchTimeMsec = now
+
+        const thisFetch = fetcher(this.aborts.signal)


Could fetcher throw before returning the promise? Do we want to handle is here too?

The behavior as-is should be perfect for our GraphQL APIs, which lift network errors and so on into Promise resolves not rejects.

If we end up here, then we are starting a new fetch. If fetcher throws before returning, the current caller will get a rejected Promise. Because we do not update thisFetch, the next caller will start a new one.

Is the idea behind the comment that we should do exponential backoff in that case too?

Is the idea behind the comment that we should do exponential backoff in that case too?

Yes. Would it be helpful to handle sync errors in the same way we handle async errors?

I don't think so because they're probably not the kind of transient failure we're interested in retrying. But for now I've implemented the suggestion... I think this might me making the tests flaky for me, let's see what CI thinks...

That is in 7ebfde2 if you are interested, but for now I have rolled it back to not do that. It seems slower and flakier, have not dug into why.

lib/shared/src/sourcegraph-api/graphql/cache.ts

valerybugakov · 2024-11-13T09:18:03Z

lib/shared/src/sourcegraph-api/graphql/cache.ts

+    }
+
+    async get(
+        signal: AbortSignal | undefined,


Are there cases where we want to pass undefined here? It would be strange to read cache.get(undefined, signal => fetch(...)). It may be worth changing the call signature or narrowing the first arg type.

Some callers to the GraphQL method don't care about aborts and pass undefined.

I like the callback in the end position because it can be a multiline expression without looking weird. setTimeout with the callback first makes my teeth itch.

lib/shared/src/sourcegraph-api/graphql/cache.ts

lib/shared/src/sourcegraph-api/graphql/client.ts

vscode/src/auth/auth.ts

This reverts commit 3d9b0196483cf5f73eefa0c2b23d1139c6e687c0.

lib/shared/src/sourcegraph-api/graphql/client.ts

valerybugakov

Looks great!

dominiccooney requested a review from valerybugakov November 12, 2024 14:13

valerybugakov reviewed Nov 13, 2024

View reviewed changes

dominiccooney added 12 commits November 18, 2024 11:43

A cache, plus debugging scree.

1015649

woops add missing cache file

25a6fa9

WIP making a mess

a7a39d7

Revert "WIP making a mess"

97dccc8

This reverts commit 3d9b0196483cf5f73eefa0c2b23d1139c6e687c0.

The cache works, but it also caches errors.

53cb723

Exponential backoff timer works, but hard to combine with values.

95e7d9e

Exponential backoff observable works.

e920625

Remove the iteration count.

33ee173

More tests.

ddaf091

Clean up logging

b4eba6c

Add missing test file.

917c098

Add a factory for vending caches invalidated by an observable.

ebe7f52

dominiccooney force-pushed the dpc/site-project-version branch from 3a6d347 to ebe7f52 Compare November 18, 2024 04:13

Fix a bunch of tests by desugaring using properly.

372bfd4

valerybugakov reviewed Nov 18, 2024

View reviewed changes

lib/shared/src/sourcegraph-api/graphql/client.ts Show resolved Hide resolved

dominiccooney added 2 commits November 18, 2024 16:57

Fix currentSiteVersion, with spammy logs.

8e61303

Remove logging.

5d45868

dominiccooney requested a review from valerybugakov November 18, 2024 08:30

Add Disposable marker interface.

a333aa8

valerybugakov approved these changes Nov 18, 2024

View reviewed changes

Use undefined as the AbortIgnorer.

2f7de58

dominiccooney force-pushed the dpc/site-project-version branch 2 times, most recently from 7ebfde2 to 2f7de58 Compare November 18, 2024 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/gql: Cache the SiteProductVersion query for up to 10 minutes #6111

fix/gql: Cache the SiteProductVersion query for up to 10 minutes #6111

dominiccooney commented Nov 12, 2024 •

edited

Loading

dominiccooney commented Nov 12, 2024

dominiccooney commented Nov 12, 2024

valerybugakov left a comment •

edited

Loading

valerybugakov Nov 13, 2024

dominiccooney Nov 18, 2024

valerybugakov Nov 18, 2024

dominiccooney Nov 18, 2024

dominiccooney Nov 18, 2024

valerybugakov Nov 13, 2024

dominiccooney Nov 18, 2024 •

edited

Loading

valerybugakov left a comment


		this.fetchTimeMsec = now

		const thisFetch = fetcher(this.aborts.signal)

fix/gql: Cache the SiteProductVersion query for up to 10 minutes #6111

Are you sure you want to change the base?

fix/gql: Cache the SiteProductVersion query for up to 10 minutes #6111

Conversation

dominiccooney commented Nov 12, 2024 • edited Loading

Test plan

Changelog

dominiccooney commented Nov 12, 2024

dominiccooney commented Nov 12, 2024

valerybugakov left a comment • edited Loading

Choose a reason for hiding this comment

valerybugakov Nov 13, 2024

Choose a reason for hiding this comment

dominiccooney Nov 18, 2024

Choose a reason for hiding this comment

valerybugakov Nov 18, 2024

Choose a reason for hiding this comment

dominiccooney Nov 18, 2024

Choose a reason for hiding this comment

dominiccooney Nov 18, 2024

Choose a reason for hiding this comment

valerybugakov Nov 13, 2024

Choose a reason for hiding this comment

dominiccooney Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

valerybugakov left a comment

Choose a reason for hiding this comment

dominiccooney commented Nov 12, 2024 •

edited

Loading

valerybugakov left a comment •

edited

Loading

dominiccooney Nov 18, 2024 •

edited

Loading