Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unit tests that relied on global stability callbacks #264

Merged
merged 12 commits into from
Nov 29, 2023
Merged

Conversation

etremel
Copy link
Contributor

@etremel etremel commented Nov 9, 2023

I discovered while running regression tests for the Derecho 2.4 release that many of our unit tests had been broken by commit eb5b281, which disabled global stability callbacks for RPC ("cooked") messages. This change made sense from an API standpoint, since the global stability callback is designed for "raw" messages; it receives a pointer to the delivered byte buffer, but it can't do anything useful with the buffer if it contains serialized RPC function arguments (unless it knows the types of objects to deserialize). However, several of our unit tests that used Replicated Objects with RPC messages were relying on the global stability callback to determine when the test had finished; they ignored the contents of the message in the callback and only looked at the message number.

I rewrote these tests to use mechanisms other than the global stability callback to determine when the last message in an experiment was delivered, and what its message/version number was. In most cases this involved using the DeserializationContext feature from the mutils-serialization library, which Derecho has supported for a while but has mostly gone unused (except internally for setting the PersistentRegistry in persistent objects). An object containing experiment data (e.g. the last message number, a flag indicating when the last message was received) is created in the main thread and passed to the Derecho group constructor as a DeserializationContext, which means Derecho will then provide a pointer to that object in the DeserializationManager that gets passed to the from_bytes function when deserializing a Replicated Object. The one downside is this means all the Replicated Object classes need custom from_bytes functions (they can't use DEFAULT_SERIALIZATION_SUPPORT) so they can retrieve the DeserializationContext pointer from the DeserializationManager and use it to initialize an instance variable.

Note that at this point there is still one test in the performance_tests folder that has not been fixed: persistent_latency_test. I haven't had time to rewrite and test it yet (it's a slightly more complicated test), but I wanted to create the pull request to ensure these fixes made it in to the 2.4 release.

Volatile variables are not meant to be used for synchronization between
threads and are not guaranteed to work correctly in busy-waiting loops
like we have in these tests. The correct way to ensure a variable in a
busy-waiting loop is not cached, reordered, or optimized out is to use
std::atomic, which we mostly do already in our newer test programs.
This has always been empty and we have never used it. The tests of
scalability are in performance_tests anyway: running a test like
bandwidth_test with larger and larger group sizes is how we test
scalability.
Apparently, as of commit eb5b281 the
global stability callback is disabled for "cooked" (RPC) messages, which
means this test can't use a stability callback to detect when the last
message has been received. I'm going to try this workaround, based on
the shared-atomic-bool code I wrote for signed_store_test. I'm not sure
if it will work but I have to commit this in order to test it on
Fractus.
After talking with Weijia I think I understand how the vector of
DeserializationContext pointers can be used to link replicated objects
with state in the main thread. This is actually an important component
of making Replicated Objects work in more complicated settings, so I
added some more comments documenting what the DeserializationContexts
are used for, and replaced the confusing name "rdv" with something more
meaningful. I also updated the comments on global_stability_callback to
note that it is not called unless you use raw messages, and removed the
commented-out old code, since this design decision is not going to be
changed.
Now that I know how to use DeserializationContexts to give deserialized
objects pointers to local objects, this should be the proper way to make
TestObject share state with the main method.
This test seems very similar to typed_subgroup_bw_test and was also
relying on a global stability callback to determine when it was done.
Tried a slight variation on the "TestState" object where the object has
a "callback-like" method instead of just being a simple struct.
Tests are sometimes getting stuck on startup, during the construction of
SST resources objects. I think the problem is within connect_endpoint
but I can't tell exactly which LibFabric operation is blocking the
"client" end from connecting to the "server" end.
This test also relied on the global stability callback to signal the
main thread, which won't work any more. Since it also used the
persistence and verified callbacks, which still work for cooked
messages, I'm creating a test-state object that receives those callbacks
as well as the "message delivered" callback, instead of putting some
logic in the main method and some within the test state object. This
also eliminates the need to create 3 counters and 3 condition variables,
since each node should only be in one subgroup, and the test object will
only contain the state for the local node's subgroup.
The TestState object mostly works, except that each node starts its copy
of subgroup_updates_delivered at 0, even if some updates have already
been sent in the group by the time the node joins. This can happen
because the subgroup allocator is configured with flexible subgroups, so
the Group will get an adequate configuration when the minimum number of
nodes joins, and those nodes will start sending updates even if more are
about to join. Other tests don't have this problem because they use
fixed subgroup sizes, so no updates are sent until all the nodes join,
but this test needs to include the possibility of View changes so it can
make sure the signature chains work even when the version number jumps
due to a View change. I really hope adding a non-persistent replicated
variable to each of the test objects doesn't mess anything up, but it's
the only way to allow late-joining nodes to initialize their counter to
the correct value. Otherwise those nodes wait forever for the "last
update" (even though they have received it).
The signed_log_test.cpp file was getting too long and unwieldy, and most
of the methods couldn't be written inline in the class definitions
anyway, so I split it into a header and source file.

Now that I know how to use a shared test state object to get the
persistent version of the last message in the experiment, I can apply
the same setup to persistent_bw_test, which also relied on the old
global stability callback. This test is much simpler, so I'll try a
TestState object with no methods, and leave the state-updating logic in
the RPC method or persistence callback.
Constructor bodies should go in the cpp file too. We can use a simple
flag in the TestState object to determine whether
notify_global_persistence needs to signal the subgroup is finished,
instead of hard-coding subgroup ID 2 = UnsignedObject.

The revisions to persistent_bw_test can be easily copied to
signed_bw_test, which is almost the same except for using a signed
persistent field instead of a regular persistent field.
This test was not relying on global stability callbacks, but it was
using a fragile workaround to notify the main thread when the experiment
was finished, by passing a pointer to an atomic bool into the
ObjectStore/SignatureStore factory and then hoping the subgroups were
only constructed once. Now that I know how to use the
DeserializationContext to give replicated objects pointers to the main
thread, I should do it here too, so the test can support view changes
without breaking the link between the ObjectStore objects and the main
thread.
@songweijia songweijia merged commit 1601460 into master Nov 29, 2023
1 check passed
@etremel etremel deleted the test_fixes branch April 25, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants