Nagios

DEPRECATED!

NOTE: This page is deprecated, since 2011 we use icinga for monitoring See new status on page IcingaMonitor .

-- JoernAdamczewski - 09 Feb 2012

Short Info

Nagios is a system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better.

  • Server: hadesdaq.gsi.de
  • Account: hadaq
  • Current version: 2.6
  • Base directory: /usr/local/nagios
  • Main config file: /usr/local/nagios/etc/nagios.cfg
  • Plugins: /usr/local/nagios/libexec

Check configuration: /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Run as daemon: /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

What do we monitor?

On nagios web server go to "Hostgroup Summary":
  • EB Servers (hadeb-group). This group includes importent servers lxhadesdaq (Event Builder) and hadeb06a (HLD File Server).
  • VME CPUS (vmecpu-group). These are all VME CPUs in the HADES experimental area.
  • night-queue hosts (nightqueue-group). This group contains PCs participating in HADES batch night queue. Lustre mount is monitored on each PC.
  • lxg hosts (lxg-group). These are GSI standard Linux boxes in/around the Counting House. Most of them are in HADES batch night queue.
  • lxg desktop hosts (desktop-group). These are desktop PCs which participate in HADES batch night queue.
  • hades hosts (hades-group). Special hosts for DAQ issues and development.

On nagios web server go to "Servicegroup Summary":
  • DISK TEST (harddisk-group). 'Smart' tests of hard disks.
  • Lustre mount (lustre-group). Monitoring of Lustre mount points.
  • SOUND SERVER (soundserver-group). Monitoring of sound server processes.

Additionally there are many single services monitored which are not combined in groups.

How to act (for beam-time periods)

If you see CRITICAL in Nagios:

General remark:

If you have problems during the night-shift: First try to find somebody on shift who knows more about the shown problem. Maybe she/he can help. Otherwise follow the instructions given below:

host service status action
hadcxx ping CRITICAL wait 5 min. If in the slow control vme crate the power is off => turn it on. Wait until hardware initialization is done and then restart DAQ and Event Builder
hadc08 sound_server CRITICAL CALL Ingo
hadeb06a ping CRITICAL wait 5 min. CALL Sergey, Michael
hadeb06a ramdisk CRITICAL CALL Sergey, Michael
hadeb06a connect_res_ram CRITICAL login to hadeb06a (as hadaq), cd /home/hadaq/res/daq_remote_access/ and type: nohup ./connect_res_ram -p /ramdisk/res --get_only &
hadeb06a get_hld_ramdisk CRITICAL login to hadeb06a (as hadaq), cd /home/hadaq/res/daq_remote_access/ and type: nohup ./get_hld_ramdisk -p /ramdisk/copy --get_only &
hadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
hadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
hadesdaq sound_server CRITICAL CALL Ingo
lxhadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
lxhadesdaq /data CRITICAL go to archivist web page (http://lxhadesdaq:1977), click on "results of last archiving effort", then click on "click here, to create a delete files shell script" and copy a line "/tmp/delete-archived-files...sh". Run this shell script from hadesdaq: ssh hadaq@lxhadesdaq "/tmp/delete-archived-files...sh". If it does not help CALL Sergey
lxhadesdaq archivist CRITICAL status information: "Data filesystem almost full". Go to archivist web page (http://lxhadesdaq:1977) and repeat everything written above. If it does not help CALL Sergey
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Sergey
lxhadesdaq /var CRITICAL CALL Sergey, Michael
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (goto /home/hadaq/apr07/oper/) and type nohup ./runinfo2ora.pl &
lxhadesdaq sound_server CRITICAL CALL Ingo
lxhadesdaq lustre CRITICAL lustre dropped out or 30 TB limit exceeded. Call Sergey, Michael
lxg0434 archiver CRITICAL archiving of slow ctrl data to Oracle has a problem. Call Ilse
lxg0447 ping CRITICAL Vertex rec. PC has no network. Try to check network connection yourself. Check network cable. Wait some minutes. Reboot. CALL ...
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; nohup scripts/restart_connect_res &
lxg0451 ping CRITICAL go4 PC has no network. Try to check network connection yourself. Check network cable. Wait some minutes. Reboot. CALL ...
lxg0451 /data.local2 CRITICAL Login to lxg0451 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0451 connect_res CRITICAL Login to lxg0451 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; scripts/restart_connect_res
lxg0411 ping CRITICAL the machine is unused! No call!
lxg04xx ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb04 ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb05 * CRITICAL not used for the data taking! No call!
hadeb07 * CRITICAL not used for the data taking! No call!
hades17 * CRITICAL not crucial! No call at night!
hades25 * CRITICAL not crucial! No call at night!

If it did not help CALL Sergey, Michael.

-- SergeyYurevich - 30 Jan 2009
Topic revision: r16 - 2012-02-09, JoernAdamczewski
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki Send feedback | Imprint | Privacy Policy (in German)