ICINGA

General setup

Icinga is a system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better.

ICINGA configuration and restart

  • Icinga server runs on hadesdaq02.
  • Restart icinga services with /etc/init.d/icinga reload. The starting error logfile is /var/log/icinga/config.err. The regular logfile is /var/log/icinga/icinga.log
  • Configuration files are at /etc/icinga. This is under cvs version control.
  • Plugins are at /usr/lib/nagios/plugins. Note that on the remote nodes monitored, additional plugins are located at /home/hadaq/nagios/plugins; these scripts are under cvs version control.
  • Config files: Main file is /etc/icinga/icinga.cfg, this will include other config files. The configuration for the commands and servers are in subfolder objects:
    • commands.cfg: Definition of service checks and event handlers
    • localhost.cfg: Definition of services to be tested on localhost, i.e. the icinga webserver hadesdaq02 itself. This is also the central slow control machine!
    • hosts_eb_servers.cfg: Definition of nodes and services for eventbuilder and daq server
    • hosts_power.cfg: Definition of nodes and services for power supplies
    • hosts_etrax.cfg: Definition of nodes and services for etrax frontend cpus: NOTE: this file is auto-generated by script daq2icinga.pl from actual DAQ set up.
    • contacts.cfg: Who will get a notification email from icinga smile
  • NOTE: to update hosts_etrax.cfg, please run on hadesdaq02 the script /etc/icinga/daq2icinga.pl. This will evalulate the epics etrax configuration, and the trb.db or detector components via cross mount of /home/hadaq/trb

Repository for scripts and plugins

All icinga configuration files and user defined plugin scripts are in hadaq repository: hadaq@lxi001.gsi.de:/misc/hadesprojects/daq/cvsroot. Module nagios will contain subfolder plugins for old and new plugins, and subfolder icinga with config files.

What do we monitor?

Hostgroups

On icinga web server go to "Hostgroup Overview". The most important groups are as follows:

  • DAQ Servers (daq-servers). Contains the central machine lxhadesdaq
  • active EB Servers (eb-servers-active). Eventbuilder servers actually used for data taking.
  • EB Servers (eb-servers). ALL Eventbuilder servers, also spares.
  • all etrax nodes (etrax). These are all etrax cpus in the HADES experimental area. This nodes are also divided up into hostgroups for the active components:
    • etrax_rpc (rpc).
    • etrax_tof (tof).
    • etrax_start (start).
    • etrax_cts (cts).
    • etrax_scs (scs).
  • power supplies (power). Power supply nodes _hadps*_in the HADES experimental area.
  • caen crates (caen). Caen HV Power supply nodes hadhvp in the HADES experimental area.
  • HADES PCs (hades-pcs). Various interactive machines apart from the DAQ/EB Servers

-- JoernAdamczewski - 16 Feb 2012

Servicegroups

On icinga web server go to "Servicegroup Overview". HINT: Press "View Service Status Grid For All Service Groups" to view all service states at once.

The most important groups are as follows:

General services

  • Cpu Load (LOAD). check load (uptime) per number of CPUs
  • Check ssh connections (ssh). test of ssh login.
  • Linux raid checks (Raid-1). check of /proc/mdstat on machines without adaptec controllers

  • Eventbuilder data disk status (EB-data-disks). Monitor validity of eventbuilder partitions /data01 .. /data22 with fine granularity.
  • Raid controllers (adaptecs). Active check of status of adaptec raid controllers on eventbuilders. NOTE: error may be indicated by state UNKNOWN
  • Eventbuilder disks balancing and cleanup (EB-disk-services)). Processes daq_disks and cleanup.pl
  • Oracle import services EB and DAQ (oracle-clients). Demons runinfo2oracle and daq2ora that insert ascii files from eventbuilders into hades database.
  • Eventbuilder EPICS (EB-epics). Check number of epics iocs for eventbuilders servers and the eventbuilder PV state. NOTE: if EBi-status is CRITICAL, this indicates that eventbuilder process is not running

-- JoernAdamczewski - 16 Feb 2012

Standalone services

There may be single services monitored which are not combined in groups. To display all existing services, on icinga web server go to "Service Detail".

How to act (for beam-time periods)

host service status action
CENTRAL
lxhadesdaq ping CRITICAL  
lxhadesdaq /var CRITICAL l
lxhadesdaq CPU load CRITICAL wait 20 min. CALL
lxhadesdaq RUN2ORA CRITICAL There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to ssh hadaq@lxhadesdaq and type nohup /home/hadaq/trbsoft/daq/oracle/runinfo2orastart_parallel.sh >/dev/null 2>&1 &
lxhadesdaq DAQ2ORA CRITICAL There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to ssh hadaqlxhadesdaq and type nohup /home/hadaq/trbsoft/daq/oracle/daq2ora_client.pl -d -o &
EVENTBUILDERS
lxhadeb0* daq_disks CRITICAL There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login ssh hadaq@lxhadeb0* and type nohup /home/hadaq/bin/daq_disks -a -s 10 >/dev/null 2>&1 &
lxhadeb0* disks cleanup CRITICAL There should be an icinga eventhandler that automatically solves this problem after 5 minutes. If this fails, login to ssh hadaq@xhadeb0* and type nohup /home/hadaq/bin/cleanup.pl >/dev/null 2>&1 &
lxhadeb0* EB-EPICS procs CRITICAL Wait 5 minutes. Login to hadaq@lxhadesdaq and restart all eventbuilder iocs by this: cd /home/hadaq/trbsoft/daq/evtbuild/; ./start_eb_gbe.pl -i start -n 1-16
lxhadeb0* EBnn-status CRITICAL This state means that the eventbuilder process nn itself is not running. This may happen by chance if icinga has just updated status when eventbuilders were being restarted. If this state remains for several minutes during data taking, eventbuilders should be restarted. Do this from operator gui ("StartEB" button), or login to hadaq@lxhadesdaq and restart all eventbuilders by this: cd /home/hadaq/trbsoft/daq/evtbuild/; ./start_eb_gbe.pl -e restart -n 1-16
lxhadeb0* EBnn-status UNKNOWN Means that eventbuilder ioc is not running. Should corresond to CRITICAL state of service EB-EPICS procs on this node. Do as described for EB-EPICS procs.
lxhadeb0* /dataNN CRITICAL This state means that partition /dataNN on eventbuilder lxhadeb0* is not available anymore. This is likely due to xfs filesystem errors. Try to mount partition again: login to root@lxhadeb0* and type umount /dataNN; mount /dataNN

-- JoernAdamczewski - 16 Feb 2012
Topic revision: r5 - 2012-02-17, JoernAdamczewski
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki Send feedback | Imprint | Privacy Policy (in German)