You are here: Foswiki>DaqSlowControl Web>HadesDaqDocumentation>Nagios>IcingaMonitor (2012-02-17, JoernAdamczewski)Edit Attach

ICINGA

General setup
- ICINGA configuration and restart
- Repository for scripts and plugins
What do we monitor?
How to act (for beam-time periods)

General setup

Icinga is a system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better.

ICINGA configuration and restart

Icinga server runs on hadesdaq02.
Restart icinga services with /etc/init.d/icinga reload. The starting error logfile is /var/log/icinga/config.err. The regular logfile is /var/log/icinga/icinga.log
Configuration files are at /etc/icinga. This is under cvs version control.
Plugins are at /usr/lib/nagios/plugins. Note that on the remote nodes monitored, additional plugins are located at /home/hadaq/nagios/plugins; these scripts are under cvs version control.
Config files: Main file is /etc/icinga/icinga.cfg, this will include other config files. The configuration for the commands and servers are in subfolder objects:
- commands.cfg: Definition of service checks and event handlers
- localhost.cfg: Definition of services to be tested on localhost, i.e. the icinga webserver hadesdaq02 itself. This is also the central slow control machine!
- hosts_eb_servers.cfg: Definition of nodes and services for eventbuilder and daq server
- hosts_power.cfg: Definition of nodes and services for power supplies
- hosts_etrax.cfg: Definition of nodes and services for etrax frontend cpus: NOTE: this file is auto-generated by script daq2icinga.pl from actual DAQ set up.
- contacts.cfg: Who will get a notification email from icinga
NOTE: to update hosts_etrax.cfg, please run on hadesdaq02 the script /etc/icinga/daq2icinga.pl. This will evalulate the epics etrax configuration, and the trb.db or detector components via cross mount of /home/hadaq/trb

Repository for scripts and plugins

All icinga configuration files and user defined plugin scripts are in hadaq repository: hadaq@lxi001.gsi.de:/misc/hadesprojects/daq/cvsroot. Module nagios will contain subfolder plugins for old and new plugins, and subfolder icinga with config files.

What do we monitor?

Hostgroups

On icinga web server go to "Hostgroup Overview". The most important groups are as follows:

DAQ Servers (daq-servers). Contains the central machine lxhadesdaq
active EB Servers (eb-servers-active). Eventbuilder servers actually used for data taking.
EB Servers (eb-servers). ALL Eventbuilder servers, also spares.
all etrax nodes (etrax). These are all etrax cpus in the HADES experimental area. This nodes are also divided up into hostgroups for the active components:
- etrax_rpc (rpc).
- etrax_tof (tof).
- etrax_start (start).
- etrax_cts (cts).
- etrax_scs (scs).
power supplies (power). Power supply nodes _hadps*_in the HADES experimental area.
caen crates (caen). Caen HV Power supply nodes hadhvp in the HADES experimental area.
HADES PCs (hades-pcs). Various interactive machines apart from the DAQ/EB Servers

-- JoernAdamczewski - 16 Feb 2012

Servicegroups

On icinga web server go to "Servicegroup Overview". HINT: Press "View Service Status Grid For All Service Groups" to view all service states at once.

The most important groups are as follows:

General services

Cpu Load (LOAD). check load (uptime) per number of CPUs
Check ssh connections (ssh). test of ssh login.
Linux raid checks (Raid-1). check of /proc/mdstat on machines without adaptec controllers

Eventbuilder and DAQ related

Eventbuilder data disk status (EB-data-disks). Monitor validity of eventbuilder partitions /data01 .. /data22 with fine granularity.
Raid controllers (adaptecs). Active check of status of adaptec raid controllers on eventbuilders. NOTE: error may be indicated by state UNKNOWN
Eventbuilder disks balancing and cleanup (EB-disk-services)). Processes daq_disks and cleanup.pl
Oracle import services EB and DAQ (oracle-clients). Demons runinfo2oracle and daq2ora that insert ascii files from eventbuilders into hades database.
Eventbuilder EPICS (EB-epics). Check number of epics iocs for eventbuilders servers and the eventbuilder PV state. NOTE: if EBi-status is CRITICAL, this indicates that eventbuilder process is not running

-- JoernAdamczewski - 16 Feb 2012

Standalone services

There may be single services monitored which are not combined in groups. To display all existing services, on icinga web server go to "Service Detail".

How to act (for beam-time periods)

host	service	status	action
CENTRAL
lxhadesdaq	ping	CRITICAL
lxhadesdaq	/var	CRITICAL	l
lxhadesdaq	CPU load	CRITICAL	wait 20 min. CALL
lxhadesdaq	RUN2ORA	CRITICAL	There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to `ssh hadaq@lxhadesdaq` and type nohup /home/hadaq/trbsoft/daq/oracle/runinfo2orastart_parallel.sh >/dev/null 2>&1 &
lxhadesdaq	DAQ2ORA	CRITICAL	There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to `ssh hadaqlxhadesdaq` and type nohup /home/hadaq/trbsoft/daq/oracle/daq2ora_client.pl -d -o &
EVENTBUILDERS
lxhadeb0*	daq_disks	CRITICAL	There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login `ssh hadaq@lxhadeb0` and type nohup /home/hadaq/bin/daq_disks -a -s 10 >/dev/null 2>&1 &*
lxhadeb0*	disks cleanup	CRITICAL	There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to `ssh hadaq@xhadeb0` and type nohup /home/hadaq/bin/cleanup.pl >/dev/null 2>&1 &*
lxhadeb0*	EB-EPICS procs	CRITICAL	Wait 5 minutes. Login to `hadaq@lxhadesdaq` and restart all eventbuilder iocs by this: cd /home/hadaq/trbsoft/daq/evtbuild/; ./start_eb_gbe.pl -i start -n 1-16
lxhadeb0*	EBnn-status	CRITICAL	This state means that the eventbuilder process nn itself is not running. This may happen by chance if icinga has just updated status when eventbuilders were being restarted. If this state remains for several minutes during data taking, eventbuilders should be restarted. Do this from operator gui ("StartEB" button), or login to `hadaq@lxhadesdaq` and restart all eventbuilders by this: cd /home/hadaq/trbsoft/daq/evtbuild/; ./start_eb_gbe.pl -e restart -n 1-16
lxhadeb0*	EBnn-status	UNKNOWN	Means that eventbuilder ioc is not running. Should corresond to CRITICAL state of service `EB-EPICS procs` on this node. Do as described for EB-EPICS procs.
lxhadeb0*	/dataNN	CRITICAL	This state means that partition `/dataNN` on eventbuilder `lxhadeb0` is not available anymore. This is likely due to xfs filesystem errors. Try to mount partition again: login to `root@lxhadeb0` and type umount /dataNN; mount /dataNN

-- JoernAdamczewski - 16 Feb 2012

Topic revision: r5 - 2012-02-17, JoernAdamczewski

DaqSlowControl

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki Send feedback | Imprint | Privacy Policy (in German)