Difference: Nagios (1 vs. 16)

Revision 16
09 Feb 2012 - Main.JoernAdamczewski
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"

Nagios

Added:
>
>

DEPRECATED!

NOTE: This page is deprecated, since 2011 we use icinga for monitoring See new status on page IcingaMonitor .

-- JoernAdamczewski - 09 Feb 2012
 

Short Info

Nagios is a system and network monitoring application.
Revision 15
30 Jan 2009 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
Changed:
<
<
-- HadesDaq - 12 Apr 2007
>
>

Nagios

 
Changed:
<
<

Nagios

>
>

Short Info

 

Nagios is a system and network monitoring application. It watches hosts and services that you specify, alerting you
Line: 21 to 23
  Run as daemon: /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Changed:
<
<

How to act

>
>

What do we monitor?

On nagios web server go to "Hostgroup Summary":
  • EB Servers (hadeb-group). This group includes importent servers lxhadesdaq (Event Builder) and hadeb06a (HLD File Server).
  • VME CPUS (vmecpu-group). These are all VME CPUs in the HADES experimental area.
  • night-queue hosts (nightqueue-group). This group contains PCs participating in HADES batch night queue. Lustre mount is monitored on each PC.
  • lxg hosts (lxg-group). These are GSI standard Linux boxes in/around the Counting House. Most of them are in HADES batch night queue.
  • lxg desktop hosts (desktop-group). These are desktop PCs which participate in HADES batch night queue.
  • hades hosts (hades-group). Special hosts for DAQ issues and development.

On nagios web server go to "Servicegroup Summary":
  • DISK TEST (harddisk-group). 'Smart' tests of hard disks.
  • Lustre mount (lustre-group). Monitoring of Lustre mount points.
  • SOUND SERVER (soundserver-group). Monitoring of sound server processes.

Additionally there are many single services monitored which are not combined in groups.

How to act (for beam-time periods)

 

If you see CRITICAL in Nagios:
Line: 69 to 88
 

If it did not help CALL Sergey, Michael.
Changed:
<
<

>
>
-- SergeyYurevich - 30 Jan 2009
 
Revision 14
24 Sep 2008 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 38 to 38
 
hadc08 sound_server CRITICAL CALL Ingo
hadeb06a ping CRITICAL wait 5 min. CALL Sergey, Michael
hadeb06a ramdisk CRITICAL CALL Sergey, Michael
Changed:
<
<
hadeb06a connect_res_ram CRITICAL login to hadeb06a, cd /home/hadaq/res/daq_remote_access/ and type: nohup ./connect_res_ram -p /ramdisk/res --get_only &
hadeb06a get_hld_ramdisk CRITICAL login to hadeb06a, cd /home/hadaq/res/daq_remote_access/ and type: nohup ./get_hld_ramdisk -p /ramdisk/copy --get_only &
>
>
hadeb06a connect_res_ram CRITICAL login to hadeb06a (as hadaq), cd /home/hadaq/res/daq_remote_access/ and type: nohup ./connect_res_ram -p /ramdisk/res --get_only &
hadeb06a get_hld_ramdisk CRITICAL login to hadeb06a (as hadaq), cd /home/hadaq/res/daq_remote_access/ and type: nohup ./get_hld_ramdisk -p /ramdisk/copy --get_only &
 
hadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
hadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
hadesdaq sound_server CRITICAL CALL Ingo
Line: 49 to 49
 
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Sergey
lxhadesdaq /var CRITICAL CALL Sergey, Michael
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
Changed:
<
<
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be in apr07/oper/) and type nohup ./runinfo2ora.pl &
>
>
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (goto /home/hadaq/apr07/oper/) and type nohup ./runinfo2ora.pl &
 
lxhadesdaq sound_server CRITICAL CALL Ingo
Added:
>
>
lxhadesdaq lustre CRITICAL lustre dropped out or 30 TB limit exceeded. Call Sergey, Michael
 
lxg0434 archiver CRITICAL archiving of slow ctrl data to Oracle has a problem. Call Ilse
Changed:
<
<
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. Try to check network connection yourself. Check network cable. Reboot. CALL ...
>
>
lxg0447 ping CRITICAL Vertex rec. PC has no network. Try to check network connection yourself. Check network cable. Wait some minutes. Reboot. CALL ...
 
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. However CRITICAL status may stay until the next check of the disk space by Nagios.
Changed:
<
<
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; scripts/restart_connect_res
lxg0451 ping CRITICAL wait 10 min. Ask QA operator. Try to check network connection yourself. Check network cable. Reboot. CALL ...
>
>
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; nohup scripts/restart_connect_res &
lxg0451 ping CRITICAL go4 PC has no network. Try to check network connection yourself. Check network cable. Wait some minutes. Reboot. CALL ...
 
lxg0451 /data.local2 CRITICAL Login to lxg0451 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0451 connect_res CRITICAL Login to lxg0451 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; scripts/restart_connect_res
Changed:
<
<
lxg0411 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
>
>
lxg0411 ping CRITICAL the machine is unused! No call!
 
lxg04xx ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb04 ping/ssh CRITICAL NO CALL at night! Can wait till morning.
Added:
>
>
hadeb05 * CRITICAL not used for the data taking! No call!
hadeb07 * CRITICAL not used for the data taking! No call!
hades17 * CRITICAL not crucial! No call at night!
hades25 * CRITICAL not crucial! No call at night!
 

If it did not help CALL Sergey, Michael.
Revision 13
16 Sep 2008 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 51 to 51
 
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be in apr07/oper/) and type nohup ./runinfo2ora.pl &
lxhadesdaq sound_server CRITICAL CALL Ingo
Added:
>
>
lxg0434 archiver CRITICAL archiving of slow ctrl data to Oracle has a problem. Call Ilse
 
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. Try to check network connection yourself. Check network cable. Reboot. CALL ...
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; scripts/restart_connect_res
Revision 12
08 Sep 2008 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 51 to 51
 
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be in apr07/oper/) and type nohup ./runinfo2ora.pl &
lxhadesdaq sound_server CRITICAL CALL Ingo
Changed:
<
<
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. Check network cable. Reboot. CALL ...
>
>
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. Try to check network connection yourself. Check network cable. Reboot. CALL ...
 
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. However CRITICAL status may stay until the next check of the disk space by Nagios.
Changed:
<
<
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and run nohup connect_res --get_only &
lxg0451 ping CRITICAL wait 10 min. Ask QA operator. Check network cable. Reboot. CALL ...
>
>
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; scripts/restart_connect_res
lxg0451 ping CRITICAL wait 10 min. Ask QA operator. Try to check network connection yourself. Check network cable. Reboot. CALL ...
 
lxg0451 /data.local2 CRITICAL Login to lxg0451 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
Changed:
<
<
lxg0451 connect_res CRITICAL Login to lxg0451 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and run nohup connect_res --get_only &
>
>
lxg0451 connect_res CRITICAL Login to lxg0451 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and check if there are new data files coming. Check if connect_res still running on the PC. To restart connect_res do: cd; scripts/restart_connect_res
 
lxg0411 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
lxg04xx ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb04 ping/ssh CRITICAL NO CALL at night! Can wait till morning.
Revision 11
08 Sep 2008 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 44 to 44
 
hadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
hadesdaq sound_server CRITICAL CALL Ingo
lxhadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
Changed:
<
<
lxhadesdaq /data CRITICAL go to archivist web page (http://lxhadesdaq:1977), click on "results of last archiving effort", then click on "click here, to create a delete files shell script" and copy a line "/tmp/delete-archived-files...sh". Run this shell script from hadesdaq: ssh hadaq@lxhadesdaq "/tmp/delete-archived-files...sh". If it does not help CALL Simon
lxhadesdaq archivist CRITICAL status information: "Data filesystem almost full". Go to archivist web page (http://lxhadesdaq:1977) and repeat everything written above. If it does not help CALL Simon
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Simon
>
>
lxhadesdaq /data CRITICAL go to archivist web page (http://lxhadesdaq:1977), click on "results of last archiving effort", then click on "click here, to create a delete files shell script" and copy a line "/tmp/delete-archived-files...sh". Run this shell script from hadesdaq: ssh hadaq@lxhadesdaq "/tmp/delete-archived-files...sh". If it does not help CALL Sergey
lxhadesdaq archivist CRITICAL status information: "Data filesystem almost full". Go to archivist web page (http://lxhadesdaq:1977) and repeat everything written above. If it does not help CALL Sergey
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Sergey
 
lxhadesdaq /var CRITICAL CALL Sergey, Michael
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be in apr07/oper/) and type nohup ./runinfo2ora.pl &
lxhadesdaq sound_server CRITICAL CALL Ingo
Changed:
<
<
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
>
>
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. Check network cable. Reboot. CALL ...
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. However CRITICAL status may stay until the next check of the disk space by Nagios.
 
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and run nohup connect_res --get_only &
Added:
>
>
lxg0451 ping CRITICAL wait 10 min. Ask QA operator. Check network cable. Reboot. CALL ...
lxg0451 /data.local2 CRITICAL Login to lxg0451 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0451 connect_res CRITICAL Login to lxg0451 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and run nohup connect_res --get_only &
 
lxg0411 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
lxg04xx ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb04 ping/ssh CRITICAL NO CALL at night! Can wait till morning.
Deleted:
<
<
lxg04xx runPairDST WARNING to restart service look at online QA/DST page
lxg04xx updateDST WARNING to restart service look at online QA/DST page
lxg04xx runQA WARNING to restart service look at online QA/DST page
lxg04xx updateQA WARNING to restart service look at online QA/DST page
 

If it did not help CALL Sergey, Michael.
Revision 10
16 Aug 2007 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007

Nagios

Added:
>
>
Nagios is a system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better.

  • Server: hadesdaq.gsi.de
  • Account: hadaq
  • Current version: 2.6
  • Base directory: /usr/local/nagios
  • Main config file: /usr/local/nagios/etc/nagios.cfg
  • Plugins: /usr/local/nagios/libexec

Check configuration: /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Run as daemon: /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
 

How to act

If you see CRITICAL in Nagios:
Revision 9
13 May 2007 - Main.MichaelTraxler
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 32 to 32
 
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Simon
lxhadesdaq /var CRITICAL CALL Sergey, Michael
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
Changed:
<
<
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be in apr07/oper/) and type nohup runinfo2ora.pl &
>
>
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be in apr07/oper/) and type nohup ./runinfo2ora.pl &
 
lxhadesdaq sound_server CRITICAL CALL Ingo
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
Revision 8
26 Apr 2007 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 21 to 21
 
hadc08 sound_server CRITICAL CALL Ingo
hadeb06a ping CRITICAL wait 5 min. CALL Sergey, Michael
hadeb06a ramdisk CRITICAL CALL Sergey, Michael
Added:
>
>
hadeb06a connect_res_ram CRITICAL login to hadeb06a, cd /home/hadaq/res/daq_remote_access/ and type: nohup ./connect_res_ram -p /ramdisk/res --get_only &
hadeb06a get_hld_ramdisk CRITICAL login to hadeb06a, cd /home/hadaq/res/daq_remote_access/ and type: nohup ./get_hld_ramdisk -p /ramdisk/copy --get_only &
 
hadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
hadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
hadesdaq sound_server CRITICAL CALL Ingo
Revision 7
23 Apr 2007 - Main.HadesDaq
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 38 to 38
 
lxg0411 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
lxg04xx ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb04 ping/ssh CRITICAL NO CALL at night! Can wait till morning.
Added:
>
>
lxg04xx runPairDST WARNING to restart service look at online QA/DST page
lxg04xx updateDST WARNING to restart service look at online QA/DST page
lxg04xx runQA WARNING to restart service look at online QA/DST page
lxg04xx updateQA WARNING to restart service look at online QA/DST page
 

If it did not help CALL Sergey, Michael.
Revision 6
22 Apr 2007 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 18 to 18
 

host service status action
hadcxx ping CRITICAL wait 5 min. If in the slow control vme crate the power is off => turn it on. Wait until hardware initialization is done and then restart DAQ and Event Builder
Added:
>
>
hadc08 sound_server CRITICAL CALL Ingo
 
hadeb06a ping CRITICAL wait 5 min. CALL Sergey, Michael
hadeb06a ramdisk CRITICAL CALL Sergey, Michael
hadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
Revision 5
14 Apr 2007 - Main.HadesDaq
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 22 to 22
 
hadeb06a ramdisk CRITICAL CALL Sergey, Michael
hadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
hadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
Added:
>
>
hadesdaq sound_server CRITICAL CALL Ingo
 
lxhadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
lxhadesdaq /data CRITICAL go to archivist web page (http://lxhadesdaq:1977), click on "results of last archiving effort", then click on "click here, to create a delete files shell script" and copy a line "/tmp/delete-archived-files...sh". Run this shell script from hadesdaq: ssh hadaq@lxhadesdaq "/tmp/delete-archived-files...sh". If it does not help CALL Simon
lxhadesdaq archivist CRITICAL status information: "Data filesystem almost full". Go to archivist web page (http://lxhadesdaq:1977) and repeat everything written above. If it does not help CALL Simon
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Simon
lxhadesdaq /var CRITICAL CALL Sergey, Michael
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
Added:
>
>
lxhadesdaq runinfo2ora CRITICAL login to lxhadesdaq (should be in apr07/oper/) and type nohup runinfo2ora.pl &
lxhadesdaq sound_server CRITICAL CALL Ingo
 
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and run nohup connect_res --get_only &
Revision 4
13 Apr 2007 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 28 to 28
 
lxhadesdaq archivist CRITICAL status information: "no response from archivist at". Try still to stop the archiver: ssh hadaq@lxhadesdaq pkill hades-archivist. Wait a minute and restart the archiver: ssh hadaq@lxhadesdaq "cd /home/hadaq/hades-archivist; ./hades-archivist apr07.cfg.pl". If it does not help CALL Simon
lxhadesdaq /var CRITICAL CALL Sergey, Michael
lxhadesdaq CPU load CRITICAL wait 20 min. CALL Sergey, Michael
Changed:
<
<
lxg0447 ping CRITICAL wait 5 min. Ask QA operator. CALL ...
>
>
lxg0447 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
 
lxg0447 /data.local2 CRITICAL Login to lxg0447 as hades-qa and run ./cron/clean_datalocal2.pl. Wait some minutes. Now Go4 should be able to analyze most recent hld file. However CRITICAL status may stay until the next check of the disk space by Nagios.
lxg0447 connect_res CRITICAL Login to lxg0447 as hades-qa, go to /data.local2/qa/hld-snapshot-archive and run nohup connect_res --get_only &
Changed:
<
<
lxg0411 ping CRITICAL wait 10 min. CALL ...
>
>
lxg0411 ping CRITICAL wait 10 min. Ask QA operator. CALL ...
 
lxg04xx ping/ssh CRITICAL NO CALL at night! Can wait till morning.
hadeb04 ping/ssh CRITICAL NO CALL at night! Can wait till morning.
Revision 3
12 Apr 2007 - Main.SergeyYurevich
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 8 to 8
 

If you see CRITICAL in Nagios:
Added:
>
>
General remark:

If you have problems during the night-shift: First try to find somebody on shift who knows more about the shown problem. Maybe she/he can help. Otherwise follow the instructions given below:
 
host service status action
hadcxx ping CRITICAL wait 5 min. If in the slow control vme crate the power is off => turn it on. Wait until hardware initialization is done and then restart DAQ and Event Builder
hadeb06a ping CRITICAL wait 5 min. CALL Sergey, Michael
Revision 2
12 Apr 2007 - Main.MichaelTraxler
Line: 1 to 1
 
META TOPICPARENT name="HadesDaqDocumentation"
-- HadesDaq - 12 Apr 2007
Line: 9 to 9
  If you see CRITICAL in Nagios:

host service status action
Changed:
<
<
hadcxx ping CRITICAL wait 5 min. If in the slow control vme crate the power is off => tern it on. Wait until hardware initialization is done and then restart DAQ and Event Builder
>
>
hadcxx ping CRITICAL wait 5 min. If in the slow control vme crate the power is off => turn it on. Wait until hardware initialization is done and then restart DAQ and Event Builder
 
hadeb06a ping CRITICAL wait 5 min. CALL Sergey, Michael
hadeb06a ramdisk CRITICAL CALL Sergey, Michael
hadesdaq ping CRITICAL wait 5 min. CALL Sergey, Michael
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Hades Wiki? Send feedback
Imprint (in German)
Privacy Policy (in German)