Eventbuilder status in 2014

here some description of the actual setup before beamtimes Jul14 and Aug14

Eventbuilder servers

Hardware

lxhadesdaq

Central server machine hardware of previous lxhadeb01,

lxhadesdaqold

hardware of previous lxhadesdaq. Not enabled by default

lxhadeb02

Eventbuilders 2,6,10,14

EB 2 is monitoring server

lxhadeb03

lxhadeb04

lxhadeb05

Master event builder EB1

Eventbuilder software

only news here

Besides standard production software daq_evtbuild/daq_netmem there is dabc installation at /home/hadaq/soft/dabc. Wich eventbuilders run with dabc can be switched in /home/hadaq/trbsoft/hadesdaq/evtbuild/eb.conf with tags: DABC:         0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0

Setup for beamtime: Only EB5 with dabc, streamserver output enabled.

Configuration

Eventbuilder configuration is done by central files on lxhadesdaq:
  • /home/hadaq/trbsoft/hadesdaq/evtbuild/eb.conf
  • /home/hadaq/trbsoft/hadesdaq/hub/register_configgbe_ip.db
  • /home/hadaq/trbsoft/hadesdaq/main/data_sources.db

Special configuration templates for DABC are on each eventbuilder server at /home/hadaq/oper/EventBuilderHades.xml Logfile for dabc is /home/hadaq/oper/EB_i.log, with i number of eventbuilder process

Performance measurements

RFIO and DABC on 25 June 2014

Find out what is the influence of disk and tape io and DABC on eventbuilder performance and loss of events.

Data

Setup: 31 subsystems, Trigger: 1 Mhz Pulser, one run has about 2e6 (2 million) events total with 400 Mb/s write bandwidth

Note that number of events per run will change according to number of used eventbuilders! The run is always terminated when EB1 exceeds 1.5Gbyte, so total amount of data written scales with number of EBs

Measurement was done by observing rates on eventbuilder epics gui. No evaluation of file write rate! Just manual estimations:

CTS Trigger (kHz) Disk IO RFIO EB mask DABC Eventbuilder total kEv/s Discarded events / run Discarded fraction Comments
70.1
1
0
FF
EB 5 only
69..73
537
<2.7e-4
 
70.2
1
1
FF
EB 5 only
68..75
382...50000
< 2.5e-2
often large number of events lost due to queue flushing
70.2
0
0
FF
EB 5 only
~70
526
< 2.6e-4
 
70.2
1
1
FF
all except EB1 and EB2
51
500000
< 2.5e-1
cpu balancing of DABC is not yet tuned well! high load on single cpus
70.2
1
0
FF
all except EB1 and EB2
40
500000
< 2.5e-1
cpu balancing of DABC is not yet tuned well! high load on single cpus
70.2
1
1
FF
EB5,6,7,8
65
152000
< 8e-2
cpu balancing of DABC is not yet tuned well! high load on single cpus
70.2
1
1
FFFF
EB 5 only
66
88000
< 2.2e-2
often queue flushing
70.2
1
0
FFFF
EB 5 only
68
80000
< 2.2e-2
often large number of events lost due to queue flushing
70.2
1
0
FF
EB 5 only
70
700
< 3.5e-4
reproduce previous measurement
70.2
1
1
FF
EB 5 only
69...72
1072...60000
< 3.0e-2
reproduce previous measurement
70.2
1
1
FFF
EB 5 only
71
1000
< 3.3e-4
 
70.2
1
0
FFF
EB 5 only
69-72
1400
< 4.7e-4
 

Further observations and Conclusions:

  • RFIO does affect the lost event rate under extreme conditions (70kHz cts permanently). With standard setup (eb mask FF, 1 dabc EB5) this is factor 100 compared with simple disk io. However, the event loss is still < 3%
  • DABC eventbuilders are not tuned well yet concering cpu balancing. With 4 or 6 active DABC eventbuilders, eventbuilder rates may drop to 40 kHz with 25% event loss! Maybe also side effect of currently high DABC debug write output? Note that old EB processes are set to well defined cpus. Further improvements may be possible here, but not before beamtime.
  • using one DABC eventbuilder on lxhadeb05 only does not show lower performance than daq_netmem/daq_evtbuild
  • distributing the load to more eventbuilder processes (EB masks FFFF or FFF) does not improve performance necessarily:
    • With mask FFFF (all 16 EBs) event loss rate gets even worse than for FF, even without rfio. This may be due to the fact that EBs still share same network interface.
    • Mask FFF (12 EBs) improves situtation for RFIO though: less often the netmem queues are flushed leading to large number of lost events. The loss rate is below 1% both for rfio or simple disk io.
  • Probably 12 EBs with RFIO and EB5 with DABC/monitoring streamserver is a good setup for beamtime
  • Adjusting the ethernet interrupt affinities by the set_eth_affinity_lxhadeb0i.pl scripts works initially, but after a while the affinities are reset by the system again to some default cpus! This behaviour is probably due to linux upgrade of EB servers to Debian 7. TODO: find out if the default affinity balancing over the first 8 cores is worse than setting eth3 and eth4 all to a single higher core that is not reserved for EB processes. At first glance, performance is not affected by this.

-- JoernAdamczewski - 25 Jun 2014

After tuning of cpu affinities on 26 June 2014

Changes:
  • Adjusted list of free cores in start_eb_gbe.pl for machines lxhadeb02/03/04: cores 0...5 are reserverd for eth interrupts, cores 6...12 available for eventbuilder processes
  • Disabled set_eth_affinity_lxhadeb0i.pl in move_doublecpu_irq.sh for lxhadeb02/03/04

Result: Eventbuilders will not share same cores as eth0 interrupts, no more 99% load at any core!

Data

Setup: 31 subsystems, Trigger: 1 Mhz Pulser, one run has about 2e6 (2 million) events total (for 16 EBs) with 400 Mb/s write bandwidth

Note that number of events per run will change according to number of used eventbuilders! The run is always terminated when EB1 exceeds 1.5Gbyte, so total amount of data written scales with number of EBs

Probably frontend setup was changed since yesterday due to lower cts trigger rate at same pulser?

CTS Trigger (kHz) Disk IO RFIO EB mask DABC Eventbuilder total kEv/s Discarded events / run Discarded fraction Comments
67
1
1
FF
EB 5 only
67
350
<1.8e-4
 
67
1
1
FFF
EB 5 only
67
1300
<4.3e-4
 
67
1
1
FFFF
EB 5 only
67
1900
<4.8e-4
 

Further observations and Conclusions:

  • Tuning the cpu affinities mostly reduces the event loss rate with RFIO enabled
  • At 67kHz Eventrate there is average event loss of about 1Hz in complete system, i.e. it appears round robin on different eventbuilders depending on the eb mask. This was debugged and could be understood with known eventcounter overflow error that is not treated by eventbuilder software at the moment (working with 16bit only will decrease recovery stability of the system). Some debug outputs proving this are in file log_lostevents.txt.


-- JoernAdamczewski - 26 Jun 2014
I Attachment Action Size Date Who Comment
log_lostevents.txttxt log_lostevents.txt manage 11 K 2014-06-26 - 12:47 JoernAdamczewski Collected output of hadeslog/messages and debug of EB2 for lost events due to overflow
Topic revision: r2 - 2014-06-26, JoernAdamczewski
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki Send feedback | Imprint | Privacy Policy (in German)