Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding async methods to serialization API #908

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

DanielBeckwith
Copy link

This draft is to bring up the question of adding Async versions of the Serialize and Deserialize APIs, and all of the complexity that that will entail. This was brought up originally in issue #430. So far my branch only contains more or less what the closed pull request #457 had.

Before I go any further I want to make sure that I address a series of architectural questions and concerns (which I'll put in the following comment) that might block the implementation due to how deeply the code will have to be modified.

@DanielBeckwith
Copy link
Author

There are several questions that I ran into right away.

First is that while adding the appropriate SerializeAsync and DeserializeAsync methods to the Serializer and Deserializer respectively is an expected change, their current setup will require *Async versions of methods in many interfaces. This includes (but isn't limited to) IEmitter, IObjectGraphTraversalStrategy, IObjectGraphVisitor, and IValueSerializer, which is a breaking change. Given that these interfaces (and others like them) are routinely used by library users to personalize the de/serialization process to meet their needs, this might become quite a burden and prevent them from updating to the newest version. An alternate solution could be providing a default implementation of the *Async method that just calls the synchronous version, but I think that could lead to some hard-to-catch foot-guns. Any guidance or advice on this would be helpful and welcome.

Second, following the original comments in pull request #457 before it was closed (and my own inspection of the code), it seems unavoidable that at least the Emitter will have to change. The changes in that class will revolve around the StateMachine method. That method handles the logic of serialization and writing the results (through the Write and WriteBreak methods). To make that work with the asynchronous API, I see two paths: refactoring or duplicating the code with a StateMachineAsync method. For the refactoring, my preliminary thought was to change StateMachine (and the functions that it calls) into a generator function (returning an IEnumerable<string>) that yields the results, and moving the Write (or WriteAsync) calls into a loop in the respective Emit method (effectively moving the responsibility for writing the results from StateMachine to Emit and EmitAsync). That seems like a rather large, invassive, and unclean change, though, with unknown performance implications to boot. The other option, duplicating the code into a StateMachineAsync version, carries all the normal maintainability issues that duplicate code has. Is there a preference for either option? Or a better one that I didn't think of?

Third, and most tricky as far as I can tell, is the ValueSerializer class. From what it looks like, the SerializeValue method and Emitter infrastructure (including logic that uses the Emitter like the Object Graph Traversal Strategy) needs to have an additional Async version, which seems like a pretty deep and annoying break to the API. However, I'm not well-versed enough in the library to know what's necessary or if I'm overthinking this.

Lastly, is it preferred to have the asynchronous methods return Task<T> or ValueTask<T>? Task is the default choice for asynchronous methods and (as far as I can tell) is supported by every build target. ValueTask is used to (hopefully) prevent unnecessary allocations (like in the built-in System.Text.Json API), but is (as far as I can tell) not supported by every build target for this library, and comes with some more complicated usage constraints that might lead to bad behavior. I went with Task in my branch and lean towards it myself, but I figured I'd ask early in case there was a preference for going the most performant path over other concerns.

While I haven't added any tests of substance in the branch yet, I plan on finishing those once the bigger issues are taken care of.

Of course I welcome any comments, concerns, questions, etc, including those that I haven't touched on.

@DanielBeckwith
Copy link
Author

The commit I just made updates the Emitter to use IEnumerable<string> generators internally instead of directly writing to the output so that the async path can reuse the logic (though I believe this results in a little bit of a performance hit). It passes the tests, but the changes hurt readability. I welcome suggestions for a more elegant solution.

@EdwardCooke
Copy link
Collaborator

One idea to help reduce the breaking change on the supporting interfaces like the value serializer. Create 2 interfaces. The first would be ivalueserializerbase and would have nothing in it. The old interface would inherit it. The new interface would also inherit it. The new interface would have the async methods. In the builders you change the method signatures and everywhere else to take the base interface. Where the current interface is called you would do a check. If the type is the old one, call the synchronous method, else call async. Hopefully that makes sense. It’s how I’ve done changes like this in the past and it worked well.

@DanielBeckwith
Copy link
Author

I've been going around in circles with implementing this, primarily because there's no safe way to call async methods in synchronous methods without risking a deadlock (if there is and I've missed it, I would love to be wrong). As a consequence, it's not safe to allow the user to register, for example, an IAsyncEventEmitter in the SerializerBuilder, because the user could then use the synchronous serialization path and get a surprise deadlock (which could even potentially be non-deterministic depending on the synchronization context). Because of that, I think I'm going to have to split the API into a synchronous side (which is just the API as it is now) and an asynchronous side (which can handle synchronous and asynchronous versions of most, but not all, interfaces) from the builders all the way down, expanding existing implementations in the library as necessary and possible (for example, the existing modifications that I did for Emitter, with the only addition being implementing a new IAsyncEmitter interface). For example, an AsyncSerializerBuilder that accepts both IEventEmitters and IAsyncEventEmitters (and other customizable parts that async makes sense with), and Builds an IAsyncSerializer.

@@ -77,11 +78,21 @@ public T Deserialize<T>(TextReader input)
return Deserialize<T>(new Parser(input));
}

public async Task<T> DeserializeAsync<T>(TextReader input)
{
return await DeserializeAsync<T>(new Parser(input));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of return await I would just return the task directly

@@ -402,6 +403,19 @@ public void Emit(ParsingEvent @event)

Assert.False(scalar.IsQuotedImplicit);
}

public Task EmitAsync(ParsingEvent @event)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to have a CancellationToken parameter in all async mathods

@zivillian
Copy link

Given that these interfaces (and others like them) are routinely used by library users to personalize the de/serialization process to meet their needs, this might become quite a burden and prevent them from updating to the newest version. An alternate solution could be providing a default implementation of the *Async method that just calls the synchronous version, but I think that could lead to some hard-to-catch foot-guns.

I would prefer extending the existing interfaces and create abstract base classes, which just call the synchronous versions. These can then be used by library users to easily upgrade without being required to implement all async methods. The base classes should not be used inside the YamlDotNet library and may also be marked as obsolete to clearly state that this is not the suggested way, but only a workaround to postpone the required changes.

@EdwardCooke
Copy link
Collaborator

Have you looked in to how system.text.json handles this?

@DanielBeckwith
Copy link
Author

@zivillian Thank you for the input and the suggestions. I got partially finished with another commit that addresses some of that, but it's been a while since I've touched it, so it'll be a while before I can complete and push it.

@EdwardCooke I looked into the source code of System.Text.Json (and plan to study it more) to see how they do it, but not much of it is transferable due to the vastly different infrastructure/internals.

The main infrastructure issue is that the Deserializer is a direct data pipeline from the Stream, to the components (NodeDeserializers, Converters, etc), to the completed object (and the reverse pathway for the Serializer). Because of that, if even one of the configurable components in the chain uses the synchronous path (e.g. if I included a default implementation of a DeserializeAsync method in the INodeDeserializer interface that is implemented as return Task.FromResult(this.Deserialize(...));, which ultimately uses the IParser class to interface with the Stream), the entire call chain will stop being asynchronous and will be locked into the synchronous path. That essentially makes the async path useless for anyone who customizes even one component without implementing the *Async version of the method. We'd have to reengineer how most of the components work in order to bypass this issue.

System.Text.Json appears to bypass this issue by effectively (and efficiently?) buffering the current stage with Utf8JsonReader (for deserialization) and Utf8JsonWriter (for serialization), but I haven't had time to figure out if that's possible to emulate in this library. If it is, it'll probably be by modifying the implementations of IParser and IEmitter (respectively) that are provided to the components so that the sync/async reading/writing is handled outside, in the method that provides them to the components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants