Difference: Nagios (r14 vs. r13)

-- HadesDaq - 12 Apr 2007

Nagios

Nagios is a system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better.

  • Server: hadesdaq.gsi.de
  • Account: hadaq
  • Current version: 2.6
  • Base directory: /usr/local/nagios
  • Main config file: /usr/local/nagios/etc/nagios.cfg
  • Plugins: /usr/local/nagios/libexec

Check configuration: /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Run as daemon: /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

How to act

If you see CRITICAL in Nagios:

General remark:


If you have problems during the night-shift: First try to find somebody on shift who knows more about the shown problem. Maybe she/he can help. Otherwise follow the instructions given below:

hostservicestatusaction
hadcxx ping CRITICAL wait 5 min. If in the slow control vme crate the power is off => turn it on. Wait until hardware initialization is done and then restart DAQ and Event Builder
hadc08 sound_server CRITICAL CALL Ingo
hadeb06a ping CRITICAL wait 5 min. CALL Sergey, Michael
hadeb06a ramdisk CRITICAL CALL Sergey, Michael
hadeb06a connect_res_ram CRITICAL login to hadeb06a, hadeb06a (as hadaq), cd /home/hadaq/res/daq_remote_access/ and type: nohup ./connect_res_ram -p /ramdisk/res --get_only &
hadeb06a get_hld_ramdisk CRITICAL login to hadeb06a, hadeb06a (as hadaq), cd /home/hadaq/res/daq_remote_access/ and type: nohup ./get_hld_ramdisk -p /ramdisk/copy --get_only &
hadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
hadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
hadesdaq sound_server CRITICAL CALL Ingo
lxhadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
lxhadesdaq /data CRITICAL go to archivist web page (http://lxhadesdaq:1977), click on "results of last archiving effort", then click on "click here, to create a delete files shell script" and copy a line "/tmp/delete-archived-files...sh". Run this shell script from hadesdaq: ssh hadaq@lxhadesdaq "/tmp/delete-archived-files...sh". If it does not help CALL Sergey
lxhadesdaq archivist CRITICAL status information: "Data filesystem almost full". Go to archivist web page (http://lxhadesdaq:1977) and repeat everything written above. If it does not help CALL Sergey
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Sergey
lxhadesdaq /var CRITICAL CALL Sergey, Michael
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be (goto /home/hadaq/apr07/oper/) in apr07/oper/) and type nohup ./runinfo2ora.pl &
lxhadesdaq sound_server CRITICAL CALL Ingo
lxg0434 lxhadesdaq archiver lustre CRITICAL archiving of slow ctrl data to Oracle has lustre dropped out or 30 TB limit exceeded. a problem. Call Ilse Sergey, Michael
lxg0447 lxg0434 ping archiver CRITICAL wait 10 min. Ask QA archiving of slow ctrl data operator. Try to check network connection yourself. Check network Oracle has a problem. Call Ilse cable. Reboot. CALL ...
lxg0447 /data.local2 ping CRITICAL Login Vertex rec. PC has no network. Try to lxg0447 as hades-qa and run check network connection yourself. Check network cable. Wait some minutes. Reboot. CALL ... ./cron/clean_datalocal2.pl. Wait some minutes. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0447 connect_res /data.local2 CRITICAL Login to lxg0447 as hades-qa, go to hades-qa and run /data.local2/qa/hld-snapshot-archive ./cron/clean_datalocal2.pl and . Wait some minutes. However CRITICAL status may stay until the next check if of there are new data files coming. Check if connect_res still running on the PC. To restart connect_res disk space by Nagios. do: cd; scripts/restart_connect_res
lxg0451 lxg0447 ping connect_res CRITICAL wait Login 10 min. Ask QA operator. Try to check network connection yourself. Check lxg0447 as hades-qa, go to network cable. Reboot. CALL ... /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; nohup scripts/restart_connect_res &
lxg0451 /data.local2 ping CRITICAL Login go4 PC has no network. Try to lxg0451 as hades-qa and run check network connection yourself. Check network cable. Wait some minutes. Reboot. CALL ... ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0451 connect_res /data.local2 CRITICAL Login to lxg0451 as hades-qa, go to hades-qa and run /data.local2/qa/hld-snapshot-archive ./cron/clean_datalocal2.pl and . Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check if of there are new data files coming. Check if connect_res still running on the PC. To restart connect_res disk space by Nagios. do: cd; scripts/restart_connect_res
lxg0411 lxg0451 ping connect_res CRITICAL wait 10 min. Ask QA operator. CALL Login to lxg0451 as hades-qa, go to ... /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; scripts/restart_connect_res
lxg04xx lxg0411 ping/ssh ping CRITICAL NO CALL at night! Can wait the machine is unused! No call! till morning.
hadeb04 lxg04xx ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb04 ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb05 * CRITICAL not used for the data taking! No call!
hadeb07 * CRITICAL not used for the data taking! No call!
hades17 * CRITICAL not crucial! No call at night!
hades25 * CRITICAL not crucial! No call at night!

If it did not help CALL Sergey, Michael.

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Hades Wiki? Send feedback
Imprint (in German)
Privacy Policy (in German)