Skip to content

High Avaliability: Watchdog and Heart Beat

Jeremy CHEN edited this page Nov 19, 2020 · 4 revisions

Introduction

Availability become more and more critical as more and more vehicle-related functions are integrated into single domain. Watchdog and heart beat are effective measures to ensure high availability. FDBus is equipped with built-in watchdog to monitor healthy of SW components by sending heart beat between endpoints as keep-alive signal, based on which reliable and robust infrastructure can be established. Typically heart beat is initiated by server endpoint and the connected client endpoints will bounce back in response ASAP to show they are still alive. In addition to starting heart beat by individual server, the Name server can act as global watchdog to monitor all FDBus components they are all connected with name server as clients.

Name Server as Global Watchdog

Each host(QNX/Linux/Windows...) is deployed with a name server responsible for service name resolution. Thereby each process in a system is connected with name server as a client to request service of name server. If enabled, a watchdog can be started by name server`to maintain heart beat with the clients and monitor the availability of them. If a client fails to response within certain time limitation, the client is regarded as 'abnormal' and lead to a death signal broadcasted by `name server. An 'super monitor' can subscribe the death signal and take proper measure upon receiving of the signal like restarting the non-responsive client.

                                                                              +-----------------------+
                                                            /---heart beat----| NS client (process 1) |
        /----------------monitor----------------\           |                 +-----------------------+
        |                                       |           |
+-------+-------+                        +------V------+    /                 +-----------------------+
| super monitor |<------death signal-----| name server |--------heart beat----| NS client (process 2) |
+---------------+                        +-------------+    \                 +-----------------------+
                                                            |
                                                            |                 +-----------------------+
                                                            \---heart beat----| NS client (process N) |
                                                                              +-----------------------+
NS client: Client of Name Server

The following code is implemented by super monitor to subscribe and handle 'heart failure' of NS clients:

int main(int argc, char **argv)
{
    FDB_CONTEXT->start();
    // pass a lambda which will be called once death of NS client is detected by name server
    FDB_CONTEXT->registerNsWatchdogListener([](const tNsWatchdogList &dropped_list)
        {
            for (auto it = dropped_list.begin(); it != dropped_list.end(); ++it)
            {
                // mPid represents pid of the client process
                // mClientName represents name of client endpoint
                FDB_LOG_F("Error!!! Endpoint drops - name: %s, pid: %d\n", it->mClientName.c_str(), it->mPid);
                // Do something upon the process represented by its pid in it->mPid.
            }
        });
    }
}

To enable the global watchdog, adding the following option while starting name server:


> name_server -d 'interval:retries'
  interval - span of time between two consecutive heart beat signal
  retries - how many heart beat to be sent on lose of response before declaring death to 'super monitor'

Generic Watchdog in Server

Watchdog of name server is just an implementation of generic watchdog of FDBus. A server can enable the built-in watchdog to monitor all connected clients by calling one of the following two methods:

CFdbBaseObject::enableWatchdog(true);

or

CFdbBaseObject::startWatchdog(int32_t interval, int32_t max_retries)

Once enabled, heart beat between server and clients are activated thereby non-responsive of any client can be detected by server. At client side, the default action upon receiving of heart beat is sending response to the server immediately at FDBus context thread. The client can override the default action to monitor more threads inside the client process. The following is an example:

class CMyClient : public CBaseClient
{
protected:
    // override onKickDog() to handle heart beat by yourself.
    void onKickDog(CBaseJob::Ptr &msg_ref)
    {
        // CFdbMessage::kickDog() sends the message represented by 'msg_ref' to
        // another worker to execute the specified lambda. In the lambda
        // CFdbMessage::feedDog(msg_ref) is called and response of the heart beat
        // is sent to the server indicating I'm still alive.
        CFdbMessage::kickDog(msg_ref, worker(), [](CBaseJob::Ptr &msg_ref)
            {
                CFdbMessage::feedDog(msg_ref);
                // or you can call CFdbMessage::kickDog() again if you would
                // like to check healthy of another worker.
            });
    }
};

At server side, once non-responsive(death) of a client is detected, you can get notified by overriding onBark() method as below:

// override onBark() to take action upon death of a client represented by 'session'
void CFdbBaseObject::onBark(CFdbSession *session)
{
    // session->pid() represents pid of the client process
    // session->getEndpointName() represent name of client endpoint
    LOG_F("CFdbBaseObject: NAME %s, PID %d: NO RESPONSE!!!\n", session->getEndpointName().c_str(),
                                                               session->pid());
}

Conclusion

FDBus provides a built-in, easy-to-use watchdog enabling establishing of reliable, highly available platform for model vehicle architecture.