-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unit tests that relied on global stability callbacks #264
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Volatile variables are not meant to be used for synchronization between threads and are not guaranteed to work correctly in busy-waiting loops like we have in these tests. The correct way to ensure a variable in a busy-waiting loop is not cached, reordered, or optimized out is to use std::atomic, which we mostly do already in our newer test programs.
This has always been empty and we have never used it. The tests of scalability are in performance_tests anyway: running a test like bandwidth_test with larger and larger group sizes is how we test scalability.
Apparently, as of commit eb5b281 the global stability callback is disabled for "cooked" (RPC) messages, which means this test can't use a stability callback to detect when the last message has been received. I'm going to try this workaround, based on the shared-atomic-bool code I wrote for signed_store_test. I'm not sure if it will work but I have to commit this in order to test it on Fractus.
After talking with Weijia I think I understand how the vector of DeserializationContext pointers can be used to link replicated objects with state in the main thread. This is actually an important component of making Replicated Objects work in more complicated settings, so I added some more comments documenting what the DeserializationContexts are used for, and replaced the confusing name "rdv" with something more meaningful. I also updated the comments on global_stability_callback to note that it is not called unless you use raw messages, and removed the commented-out old code, since this design decision is not going to be changed.
Now that I know how to use DeserializationContexts to give deserialized objects pointers to local objects, this should be the proper way to make TestObject share state with the main method.
This test seems very similar to typed_subgroup_bw_test and was also relying on a global stability callback to determine when it was done. Tried a slight variation on the "TestState" object where the object has a "callback-like" method instead of just being a simple struct.
Tests are sometimes getting stuck on startup, during the construction of SST resources objects. I think the problem is within connect_endpoint but I can't tell exactly which LibFabric operation is blocking the "client" end from connecting to the "server" end.
This test also relied on the global stability callback to signal the main thread, which won't work any more. Since it also used the persistence and verified callbacks, which still work for cooked messages, I'm creating a test-state object that receives those callbacks as well as the "message delivered" callback, instead of putting some logic in the main method and some within the test state object. This also eliminates the need to create 3 counters and 3 condition variables, since each node should only be in one subgroup, and the test object will only contain the state for the local node's subgroup.
The TestState object mostly works, except that each node starts its copy of subgroup_updates_delivered at 0, even if some updates have already been sent in the group by the time the node joins. This can happen because the subgroup allocator is configured with flexible subgroups, so the Group will get an adequate configuration when the minimum number of nodes joins, and those nodes will start sending updates even if more are about to join. Other tests don't have this problem because they use fixed subgroup sizes, so no updates are sent until all the nodes join, but this test needs to include the possibility of View changes so it can make sure the signature chains work even when the version number jumps due to a View change. I really hope adding a non-persistent replicated variable to each of the test objects doesn't mess anything up, but it's the only way to allow late-joining nodes to initialize their counter to the correct value. Otherwise those nodes wait forever for the "last update" (even though they have received it).
The signed_log_test.cpp file was getting too long and unwieldy, and most of the methods couldn't be written inline in the class definitions anyway, so I split it into a header and source file. Now that I know how to use a shared test state object to get the persistent version of the last message in the experiment, I can apply the same setup to persistent_bw_test, which also relied on the old global stability callback. This test is much simpler, so I'll try a TestState object with no methods, and leave the state-updating logic in the RPC method or persistence callback.
Constructor bodies should go in the cpp file too. We can use a simple flag in the TestState object to determine whether notify_global_persistence needs to signal the subgroup is finished, instead of hard-coding subgroup ID 2 = UnsignedObject. The revisions to persistent_bw_test can be easily copied to signed_bw_test, which is almost the same except for using a signed persistent field instead of a regular persistent field.
This test was not relying on global stability callbacks, but it was using a fragile workaround to notify the main thread when the experiment was finished, by passing a pointer to an atomic bool into the ObjectStore/SignatureStore factory and then hoping the subgroups were only constructed once. Now that I know how to use the DeserializationContext to give replicated objects pointers to the main thread, I should do it here too, so the test can support view changes without breaking the link between the ObjectStore objects and the main thread.
songweijia
approved these changes
Nov 29, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I discovered while running regression tests for the Derecho 2.4 release that many of our unit tests had been broken by commit eb5b281, which disabled global stability callbacks for RPC ("cooked") messages. This change made sense from an API standpoint, since the global stability callback is designed for "raw" messages; it receives a pointer to the delivered byte buffer, but it can't do anything useful with the buffer if it contains serialized RPC function arguments (unless it knows the types of objects to deserialize). However, several of our unit tests that used Replicated Objects with RPC messages were relying on the global stability callback to determine when the test had finished; they ignored the contents of the message in the callback and only looked at the message number.
I rewrote these tests to use mechanisms other than the global stability callback to determine when the last message in an experiment was delivered, and what its message/version number was. In most cases this involved using the DeserializationContext feature from the mutils-serialization library, which Derecho has supported for a while but has mostly gone unused (except internally for setting the PersistentRegistry in persistent objects). An object containing experiment data (e.g. the last message number, a flag indicating when the last message was received) is created in the main thread and passed to the Derecho group constructor as a DeserializationContext, which means Derecho will then provide a pointer to that object in the DeserializationManager that gets passed to the from_bytes function when deserializing a Replicated Object. The one downside is this means all the Replicated Object classes need custom from_bytes functions (they can't use DEFAULT_SERIALIZATION_SUPPORT) so they can retrieve the DeserializationContext pointer from the DeserializationManager and use it to initialize an instance variable.
Note that at this point there is still one test in the performance_tests folder that has not been fixed: persistent_latency_test. I haven't had time to rewrite and test it yet (it's a slightly more complicated test), but I wanted to create the pull request to ensure these fixes made it in to the 2.4 release.