ICINGA
General setup
Icinga is a system and network monitoring application.
It watches hosts and services that you specify, alerting you
when things go bad and when they get better.
ICINGA configuration and restart
- Icinga server runs on hadesdaq02.
- Restart icinga services with
/etc/init.d/icinga reload
. The starting error logfile is /var/log/icinga/config.err
. The regular logfile is /var/log/icinga/icinga.log
- Configuration files are at
/etc/icinga
. This is under cvs version control.
- Plugins are at
/usr/lib/nagios/plugins
. Note that on the remote nodes monitored, additional plugins are located at /home/hadaq/nagios/plugins
; these scripts are under cvs version control.
- Config files: Main file is
/etc/icinga/icinga.cfg
, this will include other config files. The configuration for the commands and servers are in subfolder objects
:
-
commands.cfg
: Definition of service checks and event handlers
-
localhost.cfg
: Definition of services to be tested on localhost, i.e. the icinga webserver hadesdaq02
itself. This is also the central slow control machine!
-
hosts_eb_servers.cfg
: Definition of nodes and services for eventbuilder and daq server
-
hosts_power.cfg
: Definition of nodes and services for power supplies
-
hosts_etrax.cfg
: Definition of nodes and services for etrax frontend cpus: NOTE: this file is auto-generated by script daq2icinga.pl
from actual DAQ set up.
-
contacts.cfg
: Who will get a notification email from icinga
- NOTE: to update
hosts_etrax.cfg
, please run on hadesdaq02 the script /etc/icinga/daq2icinga.pl
. This will evalulate the epics etrax configuration, and the trb.db or detector components via cross mount of /home/hadaq/trb
Repository for scripts and plugins
All icinga configuration files and user defined plugin scripts are in hadaq repository:
hadaq@lxi001.gsi.de:/misc/hadesprojects/daq/cvsroot
.
Module
nagios
will contain subfolder
plugins
for old and new plugins, and subfolder
icinga
with config files.
What do we monitor?
Hostgroups
On icinga web server go to "Hostgroup Overview". The most important groups are as follows:
- DAQ Servers (daq-servers). Contains the central machine lxhadesdaq
- active EB Servers (eb-servers-active). Eventbuilder servers actually used for data taking.
- EB Servers (eb-servers). ALL Eventbuilder servers, also spares.
- all etrax nodes (etrax). These are all etrax cpus in the HADES experimental area. This nodes are also divided up into hostgroups for the active components:
- etrax_rpc (rpc).
- etrax_tof (tof).
- etrax_start (start).
- etrax_cts (cts).
- etrax_scs (scs).
- power supplies (power). Power supply nodes _hadps*_in the HADES experimental area.
- caen crates (caen). Caen HV Power supply nodes hadhvp in the HADES experimental area.
- HADES PCs (hades-pcs). Various interactive machines apart from the DAQ/EB Servers
--
JoernAdamczewski - 16 Feb 2012
Servicegroups
On icinga web server go to "Servicegroup Overview".
HINT: Press "View Service Status Grid For All Service Groups" to view all service states at once.
The most important groups are as follows:
General services
- Cpu Load (LOAD). check load (uptime) per number of CPUs
- Check ssh connections (ssh). test of ssh login.
- Linux raid checks (Raid-1). check of
/proc/mdstat
on machines without adaptec controllers
- Eventbuilder data disk status (EB-data-disks). Monitor validity of eventbuilder partitions
/data01
.. /data22
with fine granularity.
- Raid controllers (adaptecs). Active check of status of adaptec raid controllers on eventbuilders. NOTE: error may be indicated by state UNKNOWN
- Eventbuilder disks balancing and cleanup (EB-disk-services)). Processes
daq_disks
and cleanup.pl
- Oracle import services EB and DAQ (oracle-clients). Demons
runinfo2oracle
and daq2ora
that insert ascii files from eventbuilders into hades database.
- Eventbuilder EPICS (EB-epics). Check number of epics iocs for eventbuilders servers and the eventbuilder PV state. NOTE: if
EBi-status
is CRITICAL, this indicates that eventbuilder process is not running
--
JoernAdamczewski - 16 Feb 2012
Standalone services
There may be single services monitored which are not combined in groups. To display all existing services, on icinga web server go to "Service Detail".
How to act (for beam-time periods)
host |
service |
status |
action |
CENTRAL |
lxhadesdaq |
ping |
CRITICAL |
|
lxhadesdaq |
/var |
CRITICAL |
l |
lxhadesdaq |
CPU load |
CRITICAL |
wait 20 min. CALL |
lxhadesdaq |
RUN2ORA |
CRITICAL |
There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to ssh hadaq@lxhadesdaq and type nohup /home/hadaq/trbsoft/daq/oracle/runinfo2orastart_parallel.sh >/dev/null 2>&1 & |
lxhadesdaq |
DAQ2ORA |
CRITICAL |
There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to ssh hadaqlxhadesdaq and type nohup /home/hadaq/trbsoft/daq/oracle/daq2ora_client.pl -d -o & |
EVENTBUILDERS |
lxhadeb0* |
daq_disks |
CRITICAL |
There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login ssh hadaq@lxhadeb0* and type nohup /home/hadaq/bin/daq_disks -a -s 10 >/dev/null 2>&1 & |
lxhadeb0* |
disks cleanup |
CRITICAL |
There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to ssh hadaq@xhadeb0* and type nohup /home/hadaq/bin/cleanup.pl >/dev/null 2>&1 & |
lxhadeb0* |
EB-EPICS procs |
CRITICAL |
Wait 5 minutes. Login to hadaq@lxhadesdaq and restart all eventbuilder iocs by this: cd /home/hadaq/trbsoft/daq/evtbuild/; ./start_eb_gbe.pl -i start -n 1-16 |
lxhadeb0* |
EBnn-status |
CRITICAL |
This state means that the eventbuilder process nn itself is not running. This may happen by chance if icinga has just updated status when eventbuilders were being restarted. If this state remains for several minutes during data taking, eventbuilders should be restarted. Do this from operator gui ("StartEB" button), or login to hadaq@lxhadesdaq and restart all eventbuilders by this: cd /home/hadaq/trbsoft/daq/evtbuild/; ./start_eb_gbe.pl -e restart -n 1-16 |
lxhadeb0* |
EBnn-status |
UNKNOWN |
Means that eventbuilder ioc is not running. Should corresond to CRITICAL state of service EB-EPICS procs on this node. Do as described for EB-EPICS procs. |
lxhadeb0* |
/dataNN |
CRITICAL |
This state means that partition /dataNN on eventbuilder lxhadeb0* is not available anymore. This is likely due to xfs filesystem errors. Try to mount partition again: login to root@lxhadeb0* and type umount /dataNN; mount /dataNN |
--
JoernAdamczewski - 16 Feb 2012