Eventbuilder status in 2014

here some description of the actual setup before beamtimes Jul14 and Aug14

Eventbuilder servers

Hardware

lxhadesdaq

Central server machine hardware of previous lxhadeb01,

lxhadesdaqold

hardware of previous lxhadesdaq. Not enabled by default

lxhadeb02

Eventbuilders 2,6,10,14

EB 2 is monitoring server

lxhadeb03

lxhadeb04

lxhadeb05

Master event builder EB1

Eventbuilder software

only news here

Besides standard production software daq_evtbuild/daq_netmem there is dabc installation at /home/hadaq/soft/dabc. Wich eventbuilders run with dabc can be switched in /home/hadaq/trbsoft/hadesdaq/evtbuild/eb.conf with tags: DABC:         0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0

Setup for beamtime: Only EB5 with dabc, streamserver output enabled.

Configuration

Eventbuilder configuration is done by central files on lxhadesdaq:
  • /home/hadaq/trbsoft/hadesdaq/evtbuild/eb.conf
  • /home/hadaq/trbsoft/hadesdaq/hub/register_configgbe_ip.db
  • /home/hadaq/trbsoft/hadesdaq/main/data_sources.db

Special configuration templates for DABC are on each eventbuilder server at /home/hadaq/oper/EventBuilderHades.xml Logfile for dabc is /home/hadaq/oper/EB_i.log, with i number of eventbuilder process

Performance measurements

RFIO and DABC on 25 June 2014

Find out what is the influence of disk and tape io and DABC on eventbuilder performance and loss of events.

Data

Setup: 31 subsystems, Trigger: 1 Mhz Pulser, one run has about 2e6 (2 million) events total with 400 Mb/s write bandwidth

Note that number of events per run will change according to number of used eventbuilders! The run is always terminated when EB1 exceeds 1.5Gbyte, so total amount of data written scales with number of EBs

Measurement was done by observing rates on eventbuilder epics gui. No evaluation of file write rate! Just manual estimations:

CTS Trigger (kHz) Disk IO RFIO EB mask DABC Eventbuilder total kEv/s Discarded events / run Discarded fraction Comments
70.1 1 0 FF EB 5 only 69..73 537 <2.7e-4  
70.2 1 1 FF EB 5 only 68..75 382...50000 < 2.5e-2 often large number of events lost due to queue flushing
70.2 0 0 FF EB 5 only ~70 526 < 2.6e-4  
70.2 1 1 FF all except EB1 and EB2 51 500000 < 2.5e-1 cpu balancing of DABC is not yet tuned well! high load on single cpus
70.2 1 0 FF all except EB1 and EB2 40 500000 < 2.5e-1 cpu balancing of DABC is not yet tuned well! high load on single cpus
70.2 1 1 FF EB5,6,7,8 65 152000 < 8e-2 cpu balancing of DABC is not yet tuned well! high load on single cpus
70.2 1 1 FFFF EB 5 only 66 88000 < 2.2e-2 often queue flushing
70.2 1 0 FFFF EB 5 only 68 80000 < 2.2e-2 often large number of events lost due to queue flushing
70.2 1 0 FF EB 5 only 70 700 < 3.5e-4 reproduce previous measurement
70.2 1 1 FF EB 5 only 69...72 1072...60000 < 3.0e-2 reproduce previous measurement
70.2 1 1 FFF EB 5 only 71 1000 < 3.3e-4  
70.2 1 0 FFF EB 5 only 69-72 1400 < 4.7e-4  

Further observations and Conclusions:

  • RFIO does affect the lost event rate under extreme conditions (70kHz cts permanently). With standard setup (eb mask FF, 1 dabc EB5) this is factor 100 compared with simple disk io. However, the event loss is still < 3%
  • DABC eventbuilders are not tuned well yet concering cpu balancing. With 4 or 6 active DABC eventbuilders, eventbuilder rates may drop to 40 kHz with 25% event loss! Maybe also side effect of currently high DABC debug write output? Note that old EB processes are set to well defined cpus. Further improvements may be possible here, but not before beamtime.
  • using one DABC eventbuilder on lxhadeb05 only does not show lower performance than daq_netmem/daq_evtbuild
  • distributing the load to more eventbuilder processes (EB masks FFFF or FFF) does not improve performance necessarily:
    • With mask FFFF (all 16 EBs) event loss rate gets even worse than for FF, even without rfio. This may be due to the fact that EBs still share same network interface.
    • Mask FFF (12 EBs) improves situtation for RFIO though: less often the netmem queues are flushed leading to large number of lost events. The loss rate is below 1% both for rfio or simple disk io.
  • Probably 12 EBs with RFIO and EB5 with DABC/monitoring streamserver is a good setup for beamtime
  • Adjusting the ethernet interrupt affinities by the set_eth_affinity_lxhadeb0i.pl scripts works initially, but after a while the affinities are reset by the system again to some default cpus! This behaviour is probably due to linux upgrade of EB servers to Debian 7. TODO: find out if the default affinity balancing over the first 8 cores is worse than setting eth3 and eth4 all to a single higher core that is not reserved for EB processes. At first glance, performance is not affected by this.

-- JoernAdamczewski - 25 Jun 2014

After tuning of cpu affinities on 26 June 2014

Changes:
  • Adjusted list of free cores in start_eb_gbe.pl for machines lxhadeb02/03/04: cores 0...5 are reserverd for eth interrupts, cores 6...12 available for eventbuilder processes
  • Disabled set_eth_affinity_lxhadeb0i.pl in move_doublecpu_irq.sh for lxhadeb02/03/04

Result: Eventbuilders will not share same cores as eth0 interrupts, no more 99% load at any core!

Data

Setup: 31 subsystems, Trigger: 1 Mhz Pulser, one run has about 2e6 (2 million) events total (for 16 EBs) with 400 Mb/s write bandwidth

Note that number of events per run will change according to number of used eventbuilders! The run is always terminated when EB1 exceeds 1.5Gbyte, so total amount of data written scales with number of EBs

Probably frontend setup was changed since yesterday due to lower cts trigger rate at same pulser?

CTS Trigger (kHz) Disk IO RFIO EB mask DABC Eventbuilder total kEv/s Discarded events / run Discarded fraction Comments
67 1 1 FF EB 5 only 67 350 <1.8e-4  
67 1 1 FFF EB 5 only 67 1300 <4.3e-4  
67 1 1 FFFF EB 5 only 67 1900 <4.8e-4  

Further observations and Conclusions:

  • Tuning the cpu affinities mostly reduces the event loss rate with RFIO enabled
  • At 67kHz Eventrate there is average event loss of about 1Hz in complete system, i.e. it appears round robin on different eventbuilders depending on the eb mask. This was debugged and could be understood with known eventcounter overflow error that is not treated by eventbuilder software at the moment (working with 16bit only will decrease recovery stability of the system). Some debug outputs proving this are in file log_lostevents.txt.


-- JoernAdamczewski - 26 Jun 2014
Topic attachments
I Attachment Action Size Date Who Comment
log_lostevents.txttxt log_lostevents.txt manage 11.1 K 26 Jun 2014 - 10:47 JoernAdamczewski Collected output of hadeslog/messages and debug of EB2 for lost events due to overflow
Topic revision: r2 - 26 Jun 2014, JoernAdamczewski
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Hades Wiki? Send feedback
Imprint (in German)
Privacy Policy (in German)