Mod_Gearman is an easy way of distributing active Nagios checks across your network and to increase nagios scalability. Mod-Gearman can even help to reduce the load on a single nagios host, because its much smaller and more efficient in executing checks. It consists of three parts:
-
There is a NEB module which resides in the Nagios core and adds servicechecks, hostchecks and eventhandler to a Gearman queue.
-
The counterpart is one or more worker clients executing the checks. Worker can be configured to only run checks for specific host- or servicegroups.
-
You need at least one Gearman Job Server running
-
Mod-Gearman has been succesfully tested with Nagios 3.2.3 and Icinga 1.2.0 running on Lib-Gearman 0.14. There are no known bugs at the moment. Let me know if you find one.
-
Debian users may be interested in the quickstart guide.
When the broker module is loaded, it captures all servicecheck, hostcheck and the eventhandler events. Eventhandler are sent to a generic eventhandler queue. Checks for hosts which are in one of the specified hostgroups, are sent into a seperate hostgroup queue. All non matching hosts are sent to a generic hosts queue. Checks for services are first checked against the list of servicegroups, then against the hostgroups and if none matches they will be sent into a generic service queue. The NEB module starts a single thread, which monitors the check_results where all results come in.
A simple example queue would look like:
+---------------+------------------+--------------+--------------+ | Queue Name | Worker Available | Jobs Waiting | Jobs Running | +---------------+------------------+--------------+--------------+ | check_results | 1 | 0 | 0 | | host | 50 | 0 | 1 | | service | 50 | 0 | 13 | | eventhandler | 50 | 0 | 0 | +---------------+------------------+--------------+--------------+
There is one queue for the results and two for the checks plus the eventhandler.
The workflow is simple:
-
Nagios wants to execute a service check.
-
The check is intercepted by the Mod-Gearman neb module.
-
Mod-Gearman puts the job into the service queue.
-
A worker grabs the job and puts back the result into the check_results queue
-
Mod-Gearman grabs the result job and puts back the result onto the check result list
-
The nagios reaper reads all checks from the result list and updates hosts and services
You can set some host or servicegroups for special worker. This example uses a seperate hostgroup for Japan and a seperate servicegroup for jmx4perl.
It would look like this:
+-----------------------+------------------+--------------+--------------+ | Queue Name | Worker Available | Jobs Waiting | Jobs Running | +-----------------------+------------------+--------------+--------------+ | check_results | 1 | 0 | 0 | | host | 50 | 0 | 1 | | service | 50 | 0 | 13 | | servicegroup_jmx4perl | 3 | 0 | 3 | | hostgroup_japan | 3 | 1 | 3 | | eventhandler | 50 | 0 | 0 | +-----------------------+------------------+--------------+--------------+
You still have the generic queues and in addition there are two queues for the specific groups.
The worker processes will take jobs from the queues and put the result back into the check_result queue which will then be taken back by the neb module and put back into the nagios core. A worker can work on one or more queues. So you could start a worker which only handles the hostgroup_japan group. One worker for the jmx4perl checks and one worker which covers the other queues. There can be more than one worker on each queue to share the load.
The easiest variant is a simple load balancing. For example if your single nagios box just cannot handle the load, you could just add a worker in the same network (or even on the same host) to reduce your load on the nagios box. Therefor we just enable hosts, services and eventhandler on the server and the worker.
If your checks have to be run from different network segments, then you can use the hostgroups (or servicegroups) to define a hostgroup for specific worker. The general hosts and services queue is disabled for this worker and just the hosts and services from the given hostgroup will be processed.
Your distributed setup could easily be extended to a load balanced setup with just adding more worker of the same config.
If you just want to replace a current NSCA solution, you could load the Mod-Gearman NEB module and disable all distribution features. You still can receive passive results by the core send via send_gearman / send_multi. Make sure you use the same encryption settings like the neb module or your core won’t be able to process the results.
Using OMD is propably the easiest way of installing and using Mod-Gearman. You just have to run omd config or set Mod-Gearman to on.
OMD[test]:~$ omd config set MOD_GEARMAN on
Note
|
Mod-Gearman is included in OMD since version 0.48. |
Pre Requirements:
-
gcc / g++
-
autoconf / automake / autoheader
-
libtool
-
libgearman (>= 0.14)
Download the tarball and perform the following steps:
#> ./configure #> make #> make install
Then add the mod_gearman.o to your nagios installation and add a broker line to your nagios.cfg:
broker_module=.../mod_gearman.o server=localhost:4730 eventhandler=yes services=yes hosts=yes
see Configuration for details on all parameters
The last step is to start one or more worker. You may use the same configuration file as for the neb module.
./mod_gearman_worker --server=localhost:4730 --services --hosts
or use the supplied init script.
Note
|
Make sure you have started your Gearmand job server. Usually it can be started with |
/usr/sbin/gearmand -t 10 -j 0
or a supplied init script (extras/gearmand-init).
Note
|
The needed patch is already included since Nagios 3.2.2. Use the patch if you use an older version. |
It is not possible to distribute eventhandler with Nagios versions prior 3.2.2. Just apply the patch from the patches directory to your Nagios sources and build Nagios again if you want to use an older version. You only need to replace the nagios binary. Nothing else has changed. If you plan to distribute only Host/Servicechecks, no patch is needed.
A sample broker in your nagios.cfg could look like:
broker_module=/usr/local/share/nagios/mod_gearman.o keyfile=/usr/local/share/nagios/secret.txt server=localhost eventhandler=yes hosts=yes services=yes
See the following list for a detailed explanation of available options:
Shared options for worker and the NEB module:
- config
-
include config from this file. Options are the same as described here.
config=/etc/nagios3/mod_gm_worker.conf
- debug
-
use debug to increase the verbosity of the module. Possible values are:
-
0
- only errors -
1
- debug messages -
2
- trace messages
Default is 0.
debug=1
-
- logmode
-
set way of logging. Possible values are:
-
automatic
- logfile when a logfile is specified. stdout when no logfile is given. stdout for tools. -
stdout
- just print all messages to stdout -
syslog
- use syslog for all log events -
file
- use logfile -
core
- use nagios internal loging (not thread safe! Use with care)
Default is automatic.
debug=1
-
- logfile
-
Path to the logfile.
logfile=/path/to/log.file
- server
-
sets the address of your gearman job server. Can be specified more than once to add more server. Mod-Gearman uses the first server available.
server=localhost:4730,remote_host:4730
- eventhandler
-
defines if the module should distribute execution of eventhandlers.
eventhandler=yes
- services
-
defines if the module should distribute execution of service checks.
services=yes
- hosts
-
defines if the module should distribute execution of host checks.
hosts=yes
- hostgroups
-
sets a list of hostgroups which will go into seperate queues.
hostgroups=name1,name2,name3
- servicegroups
-
sets a list of servicegroups which will go into seperate queues.
servicegroups=name1,name2,name3
- encryption
-
enables or disables encryption. It is strongly advised to not disable encryption. Anybody will be able to inject packages to your worker. Encryption is enabled by default and you have to explicitly disable it. When using encryption, you will either have to specify a shared password with
key=...
or a keyfile withkeyfile=...
Default is On.encryption=yes
- key
-
A shared password which will be used for encryption of data pakets. Should be at least 8 bytes long. Maximum length is 32 characters.
key=secret
- keyfile
-
The shared password will be read from this file. Use either key or keyfile. Only the first 32 characters from the first line will be used. Whitespace to the right will be trimmed.
keyfile=/path/to/secret.file
Additional options for the NEB module only:
- localhostgroups
-
sets a list of hostgroups which will not be executed by gearman. They are just passed through.
localhostgroups=name1,name2,name3
- localservicegroups
-
sets a list of servicegroups which will not be executed by gearman. They are just passed through.
localservicegroups=name1,name2,name3
- do_hostchecks
-
Set this to no if you want Mod-Gearman to only take care about servicechecks. No hostchecks will be processed by Mod-Gearman. Use this option to disable hostchecks and still have the possibility to use hostgroups for easy configuration of your services. If set to yes, you still have to define which hostchecks should be processed by either using hosts or the hostgroups option. Default:
yes
do_hostchecks=yes
- result_workers
-
Number of result worker threads. Usually one is enough. You may increase the value if your result queue is not processed fast enough.
result_workers=3
- perfdata
-
defines if the module should distribute perfdata to gearman. Note: processing of perfdata is not part of mod_gearman. You will need additional worker for handling performance data. For example: PNP4Nagios Performance data is just written to the gearman queue.
perfdata=yes
- perfdata_mode
-
There will be only a single job for each host or servier when putting performance data onto the perfdata_queue in overwrite mode. In append mode perfdata will be stored as long as there is memory left. Setting this to overwrite helps preventing the perf_data queue from getting to big. Monitor your perfdata carefully when using the append mode. Possible values are:
-
1
- overwrite -
2
- append
Default is 1.
perfdata_mode=1
-
- result_queue
-
sets the result queue. Necessary when putting jobs from several nagios instances onto the same gearman queues. Default:
check_results
result_queue=check_results_nagios1
Additional options for worker:
- identifier
-
Identifier for this worker. Will be used for the worker_identifier queue for status requests. You may want to change it if you are using more than one worker on a single host. Defaults to the current hostname.
identifier=hostname_test
- pidfile
-
Path to the pidfile.
pidfile=/path/to/pid.file
- min-worker
-
Minimum number of worker processes which should run at any time. Default: 1
min-worker=1
- max-worker
-
Maximum number of worker processes which should run at any time. You may set this equal to min-worker setting to disable dynamic starting of workers. When setting this to 1, all services from this worker will be executed one after another. Default: 20
max-worker=20
- spawn-rate
-
Defines the rate of spawed worker per second as long as there are jobs waiting. Default: 1
spawn-rate=1
- idle-timeout
-
Time in seconds after which an idling worker exits. This parameter controls how fast your waiting workers will exit if there are no jobs waiting. Set to 0 to disable the idle timeout. Default: 10
idle-timeout=30
- max-jobs
-
Controls the amount of jobs a worker will do before he exits. Use this to control how fast the amount of workers will go down after high load times. Disabled when set to 0. Default: 1000
max-jobs=500
- fork_on_exec
-
Use this option to disable an extra fork for each plugin execution. This option will reduce the load on the worker host. Default: yes
fork_on_exec=no
- dupserver
-
sets the address of gearman job server where duplicated result will be sent to. Can be specified more than once to add more server. Useful for duplicating results for a reporting installation or remote gui.
dupserver=logserver:4730,logserver2:4730
You may want to watch your gearman server job queue. The shipped tools/queue_top.pl does this. It polls the gearman server every second and displays the current queue statistics.
+-----------------------+--------+-------+-------+---------+ | Name | Worker | Avail | Queue | Running | +-----------------------+--------+-------+-------+---------+ | check_results | 1 | 1 | 0 | 0 | | host | 3 | 3 | 0 | 0 | | service | 3 | 3 | 0 | 0 | | eventhandler | 3 | 3 | 0 | 0 | | servicegroup_jmx4perl | 3 | 3 | 0 | 0 | | hostgroup_japan | 3 | 3 | 0 | 0 | +-----------------------+--------+-------+-------+---------+
- check_results
-
this queue is monitored by the neb module to fetch results from the worker. You don’t need an extra worker for this queue. The number of result workers can be set to a maximum of 256, but usually one is enough. One worker is capable of processing several thousand results per second.
- host
-
This is the queue for generic host checks. If you enable host checks with the hosts=yes switch. Before a host goes into this queue, it is checked if any of the local groups matches or a seperate hostgroup machtes. If nothing matches, then this queue is used.
- service
-
This is the queue for generic service checks. If you enable service checks with the
services=yes
switch. Before a service goes into this queue it is checked against the local host- and service-groups. Then the normal host- and servicegroups are checked and if none matches, this queue is used. - hostgroup_<name>
-
This queue is created for every hostgroup which has been defined by the hostgroups=… option. Make sure you have at least one worker for every hostgroup you specify. Start the worker with
--hostgroups=...
to work on hostgroup queues. Note that this queue may also contain service checks if the hostgroup of a service matches. - servicegroup_<name>
-
This queue is created for every servicegroup which has been defined by the
servicegroup=...
option. - eventhandler
-
This is the generic queue for all eventhandler. Make sure you have a worker for this queue if you have eventhandler enabled. Start the worker with
--events
to work on this queue. - perfdata
-
This is the generic queue for all performance data. It is created and used if you switch on
--perfdata=yes
. Performance data cannot be processed by the gearman worker itself. You will need PNP4Nagios therefor.
While the main motivation was to ease distributed configuration, this plugin also helps to spread the load on multiple worker. Throughput is mainly limited by the amount of jobs a single nagios instance can put onto the Gearman job server. Keep the Gearman job server close to the nagios box. Best practice is to put both on the same machine. Both processes will utilize one core. Some testing with my workstation (Dual Core 2.50GHz) and two worker boxes gave me these results. I used a sample Nagios installation with 20.000 Services at a 1 minute interval and a sample plugin which returns just a single line of output. I got over 300 Servicechecks per second, which means you could easily setup 100.000 services at a 5 minute interval with a single nagios box. The amount of worker boxes depends on your check types.
Use the supplied check_gearman to monitor your worker and job server. Worker have a own queue for status requests.
%> ./check_gearman -H <job server hostname> -q worker_<worker hostname> -t 10 -s check check_gearman OK - localhost has 10 worker and is working on 1 jobs|worker=10 running=1 total_jobs_done=1508
This will send a test job to the given job server and the worker will respond with some statistical data.
Job server can be monitored with:
%> ./check_gearman -H localhost -t 20 check_gearman OK - 6 jobs running and 0 jobs waiting.|check_results=0;0;1;10;100 host=0;0;9;10;100 service=0;6;9;10;100
You can use send_gearman to submit active and passive checks to a gearman job server where they will be processed just like a finished check would do.
%> ./send_gearman --server=<job server> --encryption=no --host="<hostname>" --service="<service>" --message="message"
check_multi is a plugin which executes multiple child checks. See more details about the feed_passive mode at: www.my-plugin.de
You can pass such child checks to Nagios via the mod_gearman neb module:
%> check_multi -f multi.cmd -r 256 | ./send_multi --server=<job server> --encryption=no --host="<hostname>" --service="<service>"
If you want to use only check_multi and no other workers, you can achieve this with the following neb module settings:
broker_module=/usr/local/share/nagios/mod_gearman.o server=localhost encryption=no eventhandler=no hosts=no services=no hostgroups=does_not_exist
Note
|
encryption is not necessary if you both run the check_multi checks and the nagios check_results queue on the same server. |
Notifications are very difficult to distribute. And i think its not very useful too. So this feature will not be implemented in the near future.
-
Make sure you have at least one worker for every queue. You should monitor that (check_gearman).
-
Add Logfile checks for your gearmand server and mod_gearman worker.
-
Make sure all gearman checks are in local groups. Gearman self checks should not be monitored through gearman.
-
Checks which write directly to the nagios command file (ex.: check_mk) have to run on a local worker or have to be excluded by the localservicegroups.
-
Keep the gearmand server close to Nagios for better performance.
-
If you have some checks which should not run parallel, just setup a single worker with --max-worker=1 and they will be executed one after another. For example for cpu intesive checks with selenium.
-
Make sure all your worker have the nagios-plugins available under the same path. Otherwise they could’nt be found by the worker.
-
Mod Gearman is available for download at: http://labs.consol.de/nagios/mod-gearman
-
The source is available at GitHub: http://github.com/sni/mod_gearman