Fix unit tests that relied on global stability callbacks #264

etremel · 2023-11-09T22:25:41Z

I discovered while running regression tests for the Derecho 2.4 release that many of our unit tests had been broken by commit eb5b281, which disabled global stability callbacks for RPC ("cooked") messages. This change made sense from an API standpoint, since the global stability callback is designed for "raw" messages; it receives a pointer to the delivered byte buffer, but it can't do anything useful with the buffer if it contains serialized RPC function arguments (unless it knows the types of objects to deserialize). However, several of our unit tests that used Replicated Objects with RPC messages were relying on the global stability callback to determine when the test had finished; they ignored the contents of the message in the callback and only looked at the message number.

I rewrote these tests to use mechanisms other than the global stability callback to determine when the last message in an experiment was delivered, and what its message/version number was. In most cases this involved using the DeserializationContext feature from the mutils-serialization library, which Derecho has supported for a while but has mostly gone unused (except internally for setting the PersistentRegistry in persistent objects). An object containing experiment data (e.g. the last message number, a flag indicating when the last message was received) is created in the main thread and passed to the Derecho group constructor as a DeserializationContext, which means Derecho will then provide a pointer to that object in the DeserializationManager that gets passed to the from_bytes function when deserializing a Replicated Object. The one downside is this means all the Replicated Object classes need custom from_bytes functions (they can't use DEFAULT_SERIALIZATION_SUPPORT) so they can retrieve the DeserializationContext pointer from the DeserializationManager and use it to initialize an instance variable.

Note that at this point there is still one test in the performance_tests folder that has not been fixed: persistent_latency_test. I haven't had time to rewrite and test it yet (it's a slightly more complicated test), but I wanted to create the pull request to ensure these fixes made it in to the 2.4 release.

Volatile variables are not meant to be used for synchronization between threads and are not guaranteed to work correctly in busy-waiting loops like we have in these tests. The correct way to ensure a variable in a busy-waiting loop is not cached, reordered, or optimized out is to use std::atomic, which we mostly do already in our newer test programs.

This has always been empty and we have never used it. The tests of scalability are in performance_tests anyway: running a test like bandwidth_test with larger and larger group sizes is how we test scalability.

Apparently, as of commit eb5b281 the global stability callback is disabled for "cooked" (RPC) messages, which means this test can't use a stability callback to detect when the last message has been received. I'm going to try this workaround, based on the shared-atomic-bool code I wrote for signed_store_test. I'm not sure if it will work but I have to commit this in order to test it on Fractus.

After talking with Weijia I think I understand how the vector of DeserializationContext pointers can be used to link replicated objects with state in the main thread. This is actually an important component of making Replicated Objects work in more complicated settings, so I added some more comments documenting what the DeserializationContexts are used for, and replaced the confusing name "rdv" with something more meaningful. I also updated the comments on global_stability_callback to note that it is not called unless you use raw messages, and removed the commented-out old code, since this design decision is not going to be changed.

Now that I know how to use DeserializationContexts to give deserialized objects pointers to local objects, this should be the proper way to make TestObject share state with the main method.

This test seems very similar to typed_subgroup_bw_test and was also relying on a global stability callback to determine when it was done. Tried a slight variation on the "TestState" object where the object has a "callback-like" method instead of just being a simple struct.

Tests are sometimes getting stuck on startup, during the construction of SST resources objects. I think the problem is within connect_endpoint but I can't tell exactly which LibFabric operation is blocking the "client" end from connecting to the "server" end.

This test also relied on the global stability callback to signal the main thread, which won't work any more. Since it also used the persistence and verified callbacks, which still work for cooked messages, I'm creating a test-state object that receives those callbacks as well as the "message delivered" callback, instead of putting some logic in the main method and some within the test state object. This also eliminates the need to create 3 counters and 3 condition variables, since each node should only be in one subgroup, and the test object will only contain the state for the local node's subgroup.

The TestState object mostly works, except that each node starts its copy of subgroup_updates_delivered at 0, even if some updates have already been sent in the group by the time the node joins. This can happen because the subgroup allocator is configured with flexible subgroups, so the Group will get an adequate configuration when the minimum number of nodes joins, and those nodes will start sending updates even if more are about to join. Other tests don't have this problem because they use fixed subgroup sizes, so no updates are sent until all the nodes join, but this test needs to include the possibility of View changes so it can make sure the signature chains work even when the version number jumps due to a View change. I really hope adding a non-persistent replicated variable to each of the test objects doesn't mess anything up, but it's the only way to allow late-joining nodes to initialize their counter to the correct value. Otherwise those nodes wait forever for the "last update" (even though they have received it).

The signed_log_test.cpp file was getting too long and unwieldy, and most of the methods couldn't be written inline in the class definitions anyway, so I split it into a header and source file. Now that I know how to use a shared test state object to get the persistent version of the last message in the experiment, I can apply the same setup to persistent_bw_test, which also relied on the old global stability callback. This test is much simpler, so I'll try a TestState object with no methods, and leave the state-updating logic in the RPC method or persistence callback.

Constructor bodies should go in the cpp file too. We can use a simple flag in the TestState object to determine whether notify_global_persistence needs to signal the subgroup is finished, instead of hard-coding subgroup ID 2 = UnsignedObject. The revisions to persistent_bw_test can be easily copied to signed_bw_test, which is almost the same except for using a signed persistent field instead of a regular persistent field.

This test was not relying on global stability callbacks, but it was using a fragile workaround to notify the main thread when the experiment was finished, by passing a pointer to an atomic bool into the ObjectStore/SignatureStore factory and then hoping the subgroups were only constructed once. Now that I know how to use the DeserializationContext to give replicated objects pointers to the main thread, I should do it here too, so the test can support view changes without breaking the link between the ObjectStore objects and the main thread.

include/derecho/core/detail/remote_invocable.hpp

include/derecho/core/detail/rpc_manager.hpp

etremel added 12 commits September 28, 2023 11:35

Removed the unused "scalability_tests" directory

f26b5cf

This has always been empty and we have never used it. The tests of scalability are in performance_tests anyway: running a test like bandwidth_test with larger and larger group sizes is how we test scalability.

Rewrote typed_subgroup_test to use DeserializationContext

eb7501e

Now that I know how to use DeserializationContexts to give deserialized objects pointers to local objects, this should be the proper way to make TestObject share state with the main method.

songweijia approved these changes Nov 29, 2023

View reviewed changes

include/derecho/core/detail/remote_invocable.hpp Show resolved Hide resolved

include/derecho/core/detail/rpc_manager.hpp Show resolved Hide resolved

songweijia merged commit 1601460 into master Nov 29, 2023
1 check passed

etremel deleted the test_fixes branch April 25, 2024 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unit tests that relied on global stability callbacks #264

Fix unit tests that relied on global stability callbacks #264

etremel commented Nov 9, 2023

Fix unit tests that relied on global stability callbacks #264

Fix unit tests that relied on global stability callbacks #264

Conversation

etremel commented Nov 9, 2023