atom feed10 messages in net.sourceforge.lists.nagios-usersRe: [Nagios-users] Nagios retention p...
FromSent OnAttachments
Mark...@teliasonera.comNov 7, 2008 1:49 am 
Andreas EricssonNov 7, 2008 6:07 am 
Mark...@teliasonera.comNov 7, 2008 6:52 am 
Mark...@teliasonera.comNov 14, 2008 4:50 am 
Andreas EricssonNov 14, 2008 5:11 am 
Novak, MarkNov 14, 2008 8:40 am 
Marc PowellNov 14, 2008 9:16 am 
Fernando RochaNov 14, 2008 9:46 am 
Mark...@teliasonera.comNov 19, 2008 4:38 am 
Andreas EricssonNov 20, 2008 1:28 am 
Subject:Re: [Nagios-users] Nagios retention problem. (
Date:Nov 7, 2008 6:52:32 am

It seems only to happen vith services. It might be due to the fact that hosts comes first in the retention.dat file.

Linux 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 athlon i386 GNU/Linux 32-bit.


Running on a VM-Ware server, but only one CPU is emulated.

I've tried restarting with kill -HUP or writing to the external commands file, but it still happens occasionly. Now I've changed to restart, we'll see what happens.

My impression is that the problem occurs on startup rather than shutdown. I've set the retention_update_interval=0. Next time it happens I will check the contents of the retention.dat file to make sure all services is there, but I have a vague recollection they were before. My memory might mislead me though.

As far as I can see, Nagios does not use the retained data when doing a restart (kill -HUP). It keeps the status.dat-file. (Line number 839 in nagios.c).

As far as the reading of retention.dat, I don't really get the details of the mmap-file stuff. I will check how the reading is done next week. VMWare servers are as you know known for unreliable io. Maybe the read function does'nt check for errors.

Yes service checks last in alphabetical order.

What I meant with buffer overflow was like when you write commands to a named pipe faster than the reading process can handle it you might loose data. But I've checked, and noticed that the retention data is writen directly into the linked list inside Nagios.

I was considering exactly what you suggest in touching a file and doing restarts from cron, but it does'nt seem to matter if there are many or few restarts. For a while I had the idea that the problem occured when doing a reload before the startup routines were finished. But then it happened again when doing a "single" reload, so it does'nt seem to be the case...

I hav'nt been able to reproduce the problem with any reability either. It just happens every 10 or 20 reloads or so and of course - never when I want it to. The most annoying part of it is it has only happened in the production server.


-----Original Message----- From: Andreas Ericsson [] Sent: den 7 november 2008 15:08 To: Almroth, Markus M. Cc: Subject: Re: [Nagios-users] Nagios retention problem. wrote:

I run a nagios installation with 522 servers and 4654 service checks.

When adding or removing clients, it happens that about half or perhaps 2/3 of the service checks loose all status retention. What is more concerning is that they also go back to "initial state" eg Notifications are turned off!! This is bad.

I'll clarify a bit here for history reasons, so that people reading the ML archives knows what's going on. I've gotten the details from our support staff.

"Adding a client" in this case means the equivalent of running

/etc/init.d/nagios reload

or, in plaintext, sending SIGHUP to Nagios.

It does'nt happen every time, and it is'nt the same servers every time.

Does it happen with services or with hosts? If it's random, does it more usually happen with hosts/services that alphabetically sort last?

Apart from that, I'll need some more info to properly determine what's going wrong here. What OS type/version are you using? 64 or 32-bit? Multi-processor or single? What version of glibc are you using (actually, what version of libpthread, but one can be inferred from the other)?

If you're running this on VMWare on a guest-OS emulating multiple CPU's, I'm *guessing* you're running into an issue of Nagios not properly checking for received signals before starting to write the retention file, so the thread responsible for writing it gets killed by a signal delivered to the controller thread. If you're running Nagios in VMWare (a big nono as most know), this is more likely to happen.

You could try sending the RESTART_PROCESS command to Nagios' command- file instead, but you probably want to stagger it a bit so you don't spam the poor FIFO in case you get lots of reload-requests at in a short timeframe, like touching a file and then reloading once every five minutes (from a cron-job) if the file exists (make sure to remove the file after restarting, or you'll be wasting cycles at a tremenduous rate).

Needless to say, we don't have this problem and I haven't heard from anyone else that suffers from it either, which suggests to me that you're doing something that isn't quite normal. Having fired up our stress-test config (12000 hosts, 60000 services, running a plugin that emulates extremely skittish behaviour and submitting random commands every now and then) on one of our servers, I've failed to reproduce this problem.

Very strange. It seems to me like some kind of buffer overflow.

It's not a buffer overflow. A buffer overflow would have left your system riddled with core-dumps and nagios would not have continued running after receiving the SIGHUP.

It started when I upgraded from 2.9 to 3.0.4.

Strange. Given that there are no changes in the core between 3.0.4 and 3.0.5, I don't think it's worth upgrading to see if that solves the problem (although you probably want to use 3.0.5, or the even more fixed 3.0.5p1 from anyway for the security fixes they add).

If you figure out what it is, or if you can give me enough information to reproduce it, I'll see what I can do to fix this. We're just about to ship a release right now though, so I won't have time to do anything about it until monday at the earliest.

Good luck.