Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safety of disconnecting slots #23

Open
saierd opened this issue Aug 3, 2021 · 16 comments
Open

Thread-safety of disconnecting slots #23

saierd opened this issue Aug 3, 2021 · 16 comments

Comments

@saierd
Copy link

saierd commented Aug 3, 2021

Take the following example code:

signal<void> signal;

struct Observer {
    Observer(auto& signal) {
        connection = signal.connect([this]() {
            // Use state...
        });
    }

    int state;
    scoped_connection connection;
}

auto observer = std::make_unique<Observer>(signal);

// Thread 1:
signal(); // (1)

// Thread 2:
observer.reset(); // (2)

I thought this should be thread-safe since Observer is built in such a way that the connection goes out of scope before the shared state (observer instance and its state member) of the slot gets destroyed.

It is not, though, since (1) creates a copy of the list of slots at the beginning of the signal emission. The disconnect triggered by (2) then happens asynchronously and the callback can potentially be executed after the observer got destroyed.

  • Is this behavior on purpose? I know that signal::operator() documents that thread-safety only applies to the signal itself, but it seems like this makes it very hard to manage the lifetime of the observer. The lifetime of the shared state cannot be bound to the lifetime of the scoped_connection, but must be coupled to the lifetime of the lambda.
  • If it is on purpose I think it would make sense to provide an option that removes the copy from signal::operator() and keeps the mutex locked while the slots are running. I don't care about the additional locking in my application, but this would make managing lifetimes of the callbacks a lot easier.
  • If I am not mistaken, this also applies to the sigslot::observer class provided by this library. The example for that class uses global state, so it doesn't have this problem. But I think this should be documented since using the observer class like in my example seems very natural.
@palacaze
Copy link
Owner

palacaze commented Aug 8, 2021

Thank you for the detailed analysis, I agree with your assessment that using scoped_connections in multi-threaded context requires careful planning, and that the current documentation does not offer enough guidance and/or warnings.

I actually invoked the slots under lock in the past, but it caused thread-safety concerns, more precisely it made deadlocks possible.

One way of mitigating the problem may be to copy a list of weak references to the actual slot objects, but it does not fix it entirely. In your use case, Thread 1 may start invocation of the lambda while reset() is being called from Thread 2.

The best way to protect against such behavior is to rely on slots tracking feature offered by the library. Here is a worked example that shows how to do it:

#include <atomic>
#include <iostream>
#include <memory>
#include <thread>
#include <sigslot/signal.hpp>

void thread1(sigslot::signal<> &sig) {
    const auto deadline = std::chrono::steady_clock::now()
                          + std::chrono::milliseconds(200);

    while (std::chrono::steady_clock::now() < deadline) {
        for (int i = 0; i < 1000; ++i) {
            sig();
        }
    }
}

struct Observer {
    static auto create(sigslot::signal<> &signal) {
        auto obs = std::make_shared<Observer>();
        signal.connect(&Observer::slot, obs);
        return obs;
    }

    std::atomic<int> state{};

private:
    void slot() {
        state++;
    }
};

int main() {
    sigslot::signal<> sig;
    auto observer = Observer::create(sig);

    std::thread t(thread1, std::ref(sig));

    std::this_thread::sleep_for(std::chrono::milliseconds(100));
    int v = observer->state;
    observer.reset();

    std::cout << "Got " << v << std::endl;
    t.join();
}

The main selling point of this approach is that the lifetime of the Observer gets extended via a weak_ptr locking while the slot is being invoked. This has the obvious advantage to avoid heavy locking. Would that be an acceptable solution to you?

The design and thread safety of sigslot::observer has actually been discussed in bug #19, and should be fine, provided the derived class calls disconnect_all() in its destructor. I will admit that I do not use observer inheritance and have not tested this feature much though.

@saierd
Copy link
Author

saierd commented Aug 9, 2021

Thank you for your reply.

The shared_ptr approach is a valid solution for the problem. We did think about using it, but decided against it, since it requires storing the "observer" in a shared_ptr even if that is otherwise not necessary. Keeping the lock seemed like an easier solution that solves this without requiring changes to the usage of the signals.

Could you elaborate about the possible deadlocks when calling the slots under the lock? We are currently running sigslot with the following patch:

@@ -686,10 +686,8 @@ public:
         if (m_block)
             return;
 
+        lock_type lock(m_mutex);
+
         // copy slots to execute them out of the lock
         cow_copy_type<list_type, Lockable> copy = slots_copy();
 
@@ -923,7 +921,7 @@ using signal_st = signal_base<detail::null_mutex, T...>;
  * Recursive signal emission and emission cycles are supported too.
  */
 template <typename... T>
-using signal = signal_base<std::mutex, T...>;
+using signal = signal_base<std::recursive_mutex, T...>;
 
 } // namespace sigslot

This fixed the race condition for us and works as expected so far.


Regarding sigslot::observer I am quite sure that using it like in my first example is not safe. Automating the slot destruction basically works the same way as the scoped_connection member in my example Observer class. Destroying them explicitly in the destructor does not change anything as long as the connection is the last member in the class and there is no inheritance involved.

@palacaze
Copy link
Owner

Sure, your patch triggers a deadlock right away in the test case dedicated to detect this problem.

Do you have any hard constraints on memory or CPU usage for avoiding shared_ptr? It is not super cheap, but it could also be argued that both keeping the lock longer and using a recursive mutex offset the cost of the shared_ptr.

I will look for a way to improve lifetime management.

@saierd
Copy link
Author

saierd commented Aug 13, 2021

Thank you, I did not think about such complicated situations for emitting signals. In our use case, the signals are always emitted from the same thread, so this cannot happen.

We don't want to avoid shared_ptr for its runtime cost. We want to avoid having to design everything so that it can receive signals. We have existing classes which are not stored in shared_ptr. It would be very convenient if those could receive signals simply by adding a scoped_connection.

@palacaze
Copy link
Owner

Yes, thread-safe signal emission is trickier than it appears.

One idea I can think of would be to add a lightweight spin mutex per slot and lock it before slot invocation and slot destruction. That would be cheap most of the time because contention would be quite small.

With a normal spin mutex, this would also make slot callable thread-safe as a side effect, otherwise maybe a RW spin mutex could be designed to allow concurrent invocation of slots.

@palacaze
Copy link
Owner

So, I have come up with a preliminary design for this that gives strong safety guarantees for connections and lambdas in multi-threaded contexts. It should basically allow you write the code as you expected it.

I pushed it on the safe-observer branch.

Would you mind testing it to see if it solves your problems?

@palacaze
Copy link
Owner

My first implementation was not satisfactory, I pushed a second proof of concept that should work as expected. I added a couple of tests to check the behaviour.

@saierd
Copy link
Author

saierd commented Feb 8, 2022

Thank you for creating the branch and sorry for the delay. We were busy with other things, but tested your changes now.

There seems to be a problem with the test_threaded_disconnection test on the branch. It fails quite often. I am not sure if this is a problem with the implementation or the test.

The implementation actually seems to work for our problem, though. We also added a sigslot test with our exact situation and it passes on the branch (and does not pass on master):

static void test_disconnect_while_slot_is_running() {
    sigslot::signal<> sig;

    int shared_state = 1;

    auto slot = [&shared_state]() {
        shared_state = 1;
        std::this_thread::sleep_for(std::chrono::milliseconds(200));
        assert(shared_state != 0);
    };

    sigslot::scoped_connection conn = sig.connect(slot);

    auto emit = [&sig]() {
        sig();
    };

    std::thread emitter(emit);

    std::this_thread::sleep_for(std::chrono::milliseconds(20));

    assert(shared_state != 0);

    conn.disconnect();
    shared_state = 0;

    emitter.join();
}

@oscarh
Copy link

oscarh commented Sep 20, 2022

The design and thread safety of sigslot::observer has actually been discussed in bug #19, and should be fine, provided the derived class calls disconnect_all() in its destructor. I will admit that I do not use observer inheritance and have not tested this feature much though.

I think that the issue exists also in sigslot::observer. At least I can't find that the observer lock is locked before the slot is executed, meaning they can go out of scope while executing. I think this (Googletest) test case proves it:

#include <future>
#include <atomic>

#include <gtest/gtest.h>
#include <sigslot/signal.hpp>

namespace {

class SigslotObserver : public sigslot::observer {
private:
    std::promise<void> _called_prom;
    std::atomic<bool> _called{false};
    std::atomic<bool> &_executing;

public:
    SigslotObserver(std::atomic<bool> &executing) :
        _executing(executing)
    {}

    ~SigslotObserver() {
        disconnect_all();
    }

    void slot(const int &msg) {
        using namespace std::literals::chrono_literals;
        _executing = true;
        if (!_called) {
            _called_prom.set_value();
            _called = true;
        }
        std::this_thread::sleep_for(2ms);
        _executing = false;
    }

    std::future<void> fut() {
        return _called_prom.get_future();
    }
};

}

TEST(SigslotObserver, SlotNotCalledAfterObserverIsDestroyed) {
    using namespace std::literals::chrono_literals;

    sigslot::signal<int> signal;
    std::atomic<bool> keep_signalling = true;
    std::thread feeder{[&](){
            while (keep_signalling) {
                signal(1);
            }
            // make a last call to the router
            signal(1);
    }};

    std::atomic<bool> executing = false;
    {
        SigslotObserver obs{executing};
        signal.connect(&SigslotObserver::slot, &obs);
        EXPECT_EQ(obs.fut().wait_for(100ms), std::future_status::ready);
    }
    // here, the obeserver is out of scope, and shall be destroyed and disconnected
    // if the slot is still executing, this will have bad consequences, since it could
    // access member variables which are now out of scope
    EXPECT_EQ(executing, false);
    // we can now stop signaling and join the background thread
    keep_signalling = false;
    feeder.join();
}

@oscarh
Copy link

oscarh commented Sep 20, 2022

The test above does pass on the safe-observer branch.

@oscarh
Copy link

oscarh commented Sep 20, 2022

Are there any plans of merging this branch at some point in the future?

@palacaze
Copy link
Owner

Are there any plans of merging this branch at some point in the future?

Yes I consider this feature to be essential for the usability of the library. I am however dissatisfied with the current implementation.

Deferred destruction is a kind of garbage collection, which is not easy to get both right and cheap. This attempt relies on atomic ref-counting and spin waiting on disconnection, which might be disastrous for some use cases and also incurs a non negligible cost on every slot invocation.

I would like to implement a better mechanism to mitigate the cost of tracking usages of a slot and read about deferred ref-counting and other strategies, but those are non trivial to implement and have not found the time to get to it yet.

@oscarh
Copy link

oscarh commented Sep 21, 2022

Interesting. I did discuss the while(counter > 0) with some of my colleges yesterday.

I assume that a solution with a (recursive) mutex and (mayby a) condition variable would impose too much of a performance penalty, if you consider the atomic reference counting also too expensive.

@palacaze
Copy link
Owner

Yes mutexes are not an option. I have a few ideas on how this could be solved, I will try to get to it soonish.

@oscarh
Copy link

oscarh commented Oct 5, 2023

Hi, any updates on this?

@palacaze
Copy link
Owner

Sorry for the delay I have been busy on other things. I absolutely want to fix this, but this not as easy as it looks. The proof of concept of the safe-observer branch is not good enough and fails on some corner cases.

The way I tried to fix this is basically equivalent to implementing a poor man's rc-based garbage collector, which is completely out of the scope of the project. I will look at from some other angle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants