Skip to content
This repository has been archived by the owner on Jun 30, 2018. It is now read-only.

Not auto acknowledging alerts? #5

Closed
eroji opened this issue Nov 11, 2015 · 13 comments
Closed

Not auto acknowledging alerts? #5

eroji opened this issue Nov 11, 2015 · 13 comments

Comments

@eroji
Copy link

eroji commented Nov 11, 2015

I am able to get the script to run with the API key I generated and it seems to be able to pull current alerts in Nagios when I run it in debug, but the acknowledgement posted on PagerDuty never makes its way back to Nagios? Would be great if I can get some insight on how to make this work as we'd prefer not to have to open up Nagios portal on WAN side.

@scottnotrobot
Copy link
Contributor

i'm guessing you have a permissions issue... i should probably add to the documentation that you probably want to run this poller as the nagios user, and when debugging you probably want to as sudo -u nagios. you don't need to open any wan port, you would run this poller on the nagios server and it will just contact pagerduty directly and then write to the command pipe already open on the server.

@eroji
Copy link
Author

eroji commented Nov 12, 2015

I am running it as Nagios user but it doesn't seem to be updating it still. Running it in debug mode shows that it is able to pull up a list of current alerts for the Nagios instance, but no acknowledgement.

@eroji
Copy link
Author

eroji commented Nov 12, 2015

How is it matching the acknowledgement back to the Nagios alert? We changed the message sent to PagerDuty slightly to include the incident error message into the title. I'm wondering if the script is not able to match the incident back to Nagios because of that.

@scottnotrobot
Copy link
Contributor

oh yes, this mechanism does depend on a few environment macros / fields passed by pagerduty_nagios.pl, in particular HOSTDISPLAYNAME, SERVICEDISPLAYNAME, and SERVICEPROBLEMID. the alerts emitted by pagerduty probably use the former 2, and if you have mutated those by adding data like the status message, then this won't be able to use those to identify/key the correct service sent by this to the nagios command pipe. if this is your problem, then perhaps instead you could mutate the LASTSERVICESTATE macro, which would not interfere this way.

@eroji
Copy link
Author

eroji commented Nov 13, 2015

So I just put the command for the PagerDuty alert back to the stock way and it is still not acknowledging incidents back to Nagios. I don't think it's a problem of the incident title at this point. What should I expect to see in the script output in debug mode?

@scottnotrobot
Copy link
Contributor

i think at this point i'd need to see some raw data. could you log into pagerduty and click on the incident number and save the page, and also send me your status.dat, and also the contents of /tmp/pd_ack_to_nagios_ack_poller.last_id or wherever you put that state file. could you share these with me via gist or google drive?

@scottnotrobot
Copy link
Contributor

ok i got the incident page and last_id file, but not the status.dat

however, i already see a problem... under the second "Details" table (the one with the grey background), i only see 3 fields, CONTACTPAGER, pd_nagios_object, and SERVICEOUTPUT. there should be many many more "macro" fields, including the requisite HOSTDISPLAYNAME, SERVICEDISPLAYNAME, and SERVICEPROBLEMID.

i think this means you might have disabled some environment macros? some people do this for performance reasons:

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/configmain.html#enable_environment_macros

i would have thought this would have broken the https://github.com/PagerDuty/pagerduty-nagios-pl as it is specifically mentioned in it's README.md . can you make sure you have enable_environment_macros=1, and that they are not being filtered somehow?

@eroji
Copy link
Author

eroji commented Nov 17, 2015

I turned it back on but it's still behaving the same. Here is a zip of all the 3 requested items.

https://drive.google.com/file/d/0B_dX7tp_c7k0UTZjejRQOHFtREU/view?usp=sharing

@scottnotrobot
Copy link
Contributor

i still don't see the necessary fields in your incident detail page... e.g. in my environment there are almost 200 fields. i assume you restarted nagios after setting enable_environment_macros? also, i'm using nagios 3.5, but i see you're using 4.1.1... i wonder if the macro behavior has changed. i do see in the changelog for nagios that there was a bug in macros that was supposedly fixed in 4.0.3, but i wonder. i think you'll need some way to verify that compatible environment vars are being created. you should be able to catch a plugin's environment with something like this:

cat /proc/*/environ | strings | grep NAGIOS_SERVICEDESC

(be careful with this though... might be risky depending on workload or what else is running on your nagios server... and you might need to try several times before you catch a plugin process running)

also, i wonder what happens if you're using the embedded perl interpreter...

@eroji
Copy link
Author

eroji commented Nov 19, 2015

I ran a while loop using the command you gave and I'm seeing some data being created as Nagios is processing stuff. Enabling the embedded perl interpreter did not help with anything. For instance, the cat command outputs stuff like:

NAGIOS_SERVICEDESC=tungsten_latency
NAGIOS_SERVICEDESC=tungsten
NAGIOS_SERVICEDESC=Test Loan Process
NAGIOS_SERVICEDESC=Test Loan Process
NAGIOS_SERVICEDESC=Linux / dir
NAGIOS_SERVICEDESC=Postfix

@eroji
Copy link
Author

eroji commented Nov 19, 2015

Ok problem solved. The pagerduty_nagios.pl script wasn't running properly it appears. Once that was fixed this script is now working. Thanks for all your assist!

@eroji eroji closed this as completed Nov 19, 2015
@scottnotrobot
Copy link
Contributor

great! please comment here if you think the details of your solution are something anyone else could benefit from. best luck!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants