Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ConfigurationCallbackStorage Crash #1385

Merged
merged 15 commits into from
Aug 9, 2024
Merged

Conversation

scannillo
Copy link
Contributor

@scannillo scannillo commented Aug 6, 2024

Summary of changes

To Test

  • The configuration is being cached and we use the cache if available - verify via FPTI
  • When triggering a memory warning a crash is not present
  • There are no crashes when instantiating multiple instances at the same time

Checklist

  • Added a changelog entry

Authors

@scannillo @jaxdesmarais @KunJeongPark

@scannillo scannillo changed the title Fix ConfigurationCallbackStorage Crash [DO NOT REVIEW] Fix ConfigurationCallbackStorage Crash Aug 6, 2024
@scannillo scannillo force-pushed the fix-config-callback-crash branch from 25f95c3 to 8f8b5ab Compare August 7, 2024 20:35
@jaxdesmarais jaxdesmarais changed the title [DO NOT REVIEW] Fix ConfigurationCallbackStorage Crash Fix ConfigurationCallbackStorage Crash Aug 8, 2024
@jaxdesmarais jaxdesmarais marked this pull request as ready for review August 8, 2024 19:48
@jaxdesmarais jaxdesmarais requested a review from a team as a code owner August 8, 2024 19:48
}

if let error {
// TODO: - Consider updating all feature clients to use async version of this method?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created DTMOBILES-980 for this TODO. Want to focus on tackling the crash first, then we can follow up with this change

Copy link
Contributor

@richherrera richherrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!!!! 🚀 👏

if let error {
// TODO: - Consider updating all feature clients to use async version of this method?

Task { @MainActor in
Copy link
Contributor

@KunJeongPark KunJeongPark Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure that these don't cause issues with nested calls to fetchOrReturnConfig.
Checking if removing @mainactor here and in the async fetchConfiguration function and just running completion handler and setupHTTPCredentials on main would preserve all the analytics calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MainActor is equivalent to running on the main dispatch queue: https://developer.apple.com/documentation/swift/mainactor. So should behave the exact same.

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think main difference for our purposes is that @mainactor can have suspension points so we wouldn't have deadlocks with nested completion handlers that call fetchOrReturnRemoteConfiguration

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just puzzled why there are differences in analytics output with this commit as opposed to what's on main or previous commits on this PR.

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are just doing same on two call sites, so coordinating the two configurationLoader.getConfig calls.

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test you quoted above, testCallbacks_useMainDispatchQueue(), it expects the completion handler to return on main.
So entire Task block of fetchOrReturnRemoteConfiguration doesn't need to be annotated @mainactor,
just the setupHTTPCredentials and completion invocation need to be wrapped in await MainActor.run to make this test pass, as I mentioned at top of this thread.

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't personally seen weirdness or difference in the FPTI between what was on main before, before this crash fix was implemented, after adding it, with or without MainActor annotations for fetchConfig functions.

But I trust that you saw something, there is a lot going on with sending analytics.
I do understand your concerns that fetchOrReturnRemoteConfiguration previously had expectation to be called from main. But I am not sure if this was true in case in analytics. In BTAnalyticsService, timer's event handler that calls sendQueuedAnalyticsEvents is awaited in a Task block and will run in the background thread.
This particular implementation detail was same before the crash fix.
Some completion handlers were returned on main in BTHTTP callCompletionAsync.

One small issue I can see right now is if network request fails, events will be dropped.
But these are not related to @mainactor annotation on fetchConfiguration.

My main concern is potential for more latency for fetching configuration for sendQueuedAnalyticsEvents with @mainactor annotation. But it's hard to tell.

Let's keep an eye on latency metrics and then we can revisit. What are your thoughts on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test you quoted above, testCallbacks_useMainDispatchQueue(), it expects the completion handler to return on main.
So entire Task block of fetchOrReturnRemoteConfiguration doesn't need to be annotated @mainactor,
just the setupHTTPCredentials and completion invocation need to be wrapped in await MainActor.run to make this test pass, as I mentioned at top of this thread.

imo, adding await MainActor.run for just the completions/setupHTTPCredentials complicates the callsite vs running the task on MainActor. There is no observed latency since getConfig is a global isolated actor already. Having a cleaner callsite is more important as there is no difference in wrapping the completions in a MainActor.run vs what exists today, but this method will also entirely be removed in a future PR.

I haven't personally seen weirdness or difference in the FPTI between what was on main before, before this crash fix was implemented, after adding it, with or without MainActor annotations for fetchConfig functions.

If you look at ticket DTMOBILES-998 you will see this same behavior where we make multiple network called and the config isn't cached properly. There are screenshots in that ticket with additional details from FPTI.

My main concerns are potential for more latency for fetching configuration for sendQueuedAnalyticsEvents with @mainactor annotation. But it's hard to tell. Let's keep an eye on latency metrics and then we can revisit.

There is clearly a need to annotate as is, removing it causes the issues we can clearly see in FPTI noted above. Removing the duplicate config call and just relying on the async method should resolve any remaining confusion. There is not additional latency added that did not exist previously.

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the screenshots. Do these extra calls happen in 6.23.2? I can check.
I believe you that you observed these, I just didn't see it myself. I am trying to figure out why.
I want to make sure I understand what @mainactor annotation is fixing.

Looks like the a lot of extra v1/config are coming from analytics sent from fetchAPITiming function.
I'll look into this. But I'm ok to leave the @mainactor annotation and observing any effects on latency,
I just want to make sure we understand why this is happening.

This is strange, I got different results from FPTI. I will post on your PR after doing a few more runs.

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made comments on your jira ticket with screenshots. I tested 6.23.3 with @mainactor annotation as it is currently on main wit no changes, 6.23.3 without @mainactor annotation but with await MainActor.run on setHTTPCredentials and completion calls and 6.23.2 with no changes.

I didn't incorporate your changes on this branch, I am going to try that as well.

I found that there are no duplicate calls to v1/configuration in any of them, they are all fetching from cache but 6.23.3 without @mainactor annotation results in analytic events being listed more out of order, this was also true for 6.23.2, the analytics were out of order.
I just ran through each once, so I will try several times for verification.

@jaxdesmarais
Copy link
Contributor

Ran through another set of manual tests on sand and prod while triggering memory warnings and all looks good from my side. Let me know once you complete going through the flows as well @KunJeongPark and I'll merge this in.

@KunJeongPark
Copy link
Contributor

Roger! running tests and will let you know.

@jaxdesmarais jaxdesmarais merged commit 6c2ddb8 into main Aug 9, 2024
8 checks passed
@jaxdesmarais jaxdesmarais deleted the fix-config-callback-crash branch August 9, 2024 20:40
@KunJeongPark
Copy link
Contributor

KunJeongPark commented Aug 12, 2024

I want to point out for future that ConfigurationCache, as it is a singleton class, might need to ensure thread-safety as well.

}
}

func fetchConfiguration() async throws -> BTConfiguration {
@MainActor func fetchConfiguration() async throws -> BTConfiguration {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar ❓ - why did we need these MainActor attributes?

@@ -1,15 +1,20 @@
import Foundation

@globalActor actor ConfigurationActor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat! Can we add a docstring for what purpose this serves?

func getConfig(completion: @escaping (BTConfiguration?, Error?) -> Void) {
@ConfigurationActor
func getConfig() async throws -> BTConfiguration {
if let existingTask {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check the cache before bothering checking if there are any pending tasks?

@jaxdesmarais @KunJeongPark

Copy link
Contributor

@KunJeongPark KunJeongPark Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jax had the same question. I was thinking of scenario where there may be read/write cache race condition if there is a pending task. The task includes writing into the cache.

notifyCompletions(nil, BTAPIClientError.configurationUnavailable)
return

let task = Task { [weak self] in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do existingTask = Task { [weak self] in ... instead?

Line 75 is a little confusing why we're doing the await on the task constant and not the re-assigned existingTask variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash in ConfigurationCallbackStorage After Upgrading to 6.23.2
4 participants