Eventbuilder status in 2014
here some description of the actual setup before beamtimes Jul14 and Aug14
Eventbuilder servers
Hardware
lxhadesdaq
Central server machine
hardware of previous lxhadeb01,
lxhadesdaqold
hardware of previous lxhadesdaq. Not enabled by default
lxhadeb02
Eventbuilders 2,6,10,14
EB 2 is monitoring server
lxhadeb03
lxhadeb04
lxhadeb05
Master event builder EB1
Eventbuilder software
only news here
Besides standard production software
daq_evtbuild/daq_netmem
there is dabc installation at
/home/hadaq/soft/dabc
. Wich eventbuilders run with dabc can be switched in
/home/hadaq/trbsoft/hadesdaq/evtbuild/eb.conf
with tags:
DABC: 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Setup for beamtime: Only EB5 with dabc, streamserver output enabled.
Configuration
Eventbuilder configuration is done by central files on
lxhadesdaq:
-
/home/hadaq/trbsoft/hadesdaq/evtbuild/eb.conf
-
/home/hadaq/trbsoft/hadesdaq/hub/register_configgbe_ip.db
-
/home/hadaq/trbsoft/hadesdaq/main/data_sources.db
Special configuration templates for
DABC are on each eventbuilder server
at
/home/hadaq/oper/EventBuilderHades.xml
Logfile for dabc is
/home/hadaq/oper/EB_i.log
, with i number of eventbuilder process
RFIO and DABC on 25 June 2014
Find out what is the influence of disk and tape io and DABC on eventbuilder performance and loss of events.
Data
Setup: 31 subsystems, Trigger: 1 Mhz Pulser, one run has about 2e6 (2 million) events total with 400 Mb/s write bandwidth
Note that number of events per run will change according to number of used eventbuilders! The run is always terminated when EB1 exceeds 1.5Gbyte, so total amount of data written scales with number of EBs
Measurement was done by observing rates on eventbuilder epics gui. No evaluation of file write rate! Just manual estimations:
Further observations and Conclusions:
- RFIO does affect the lost event rate under extreme conditions (70kHz cts permanently). With standard setup (eb mask FF, 1 dabc EB5) this is factor 100 compared with simple disk io. However, the event loss is still < 3%
- DABC eventbuilders are not tuned well yet concering cpu balancing. With 4 or 6 active DABC eventbuilders, eventbuilder rates may drop to 40 kHz with 25% event loss! Maybe also side effect of currently high DABC debug write output? Note that old EB processes are set to well defined cpus. Further improvements may be possible here, but not before beamtime.
- using one DABC eventbuilder on lxhadeb05 only does not show lower performance than
daq_netmem/daq_evtbuild
- distributing the load to more eventbuilder processes (EB masks FFFF or FFF) does not improve performance necessarily:
- With mask FFFF (all 16 EBs) event loss rate gets even worse than for FF, even without rfio. This may be due to the fact that EBs still share same network interface.
- Mask FFF (12 EBs) improves situtation for RFIO though: less often the netmem queues are flushed leading to large number of lost events. The loss rate is below 1% both for rfio or simple disk io.
- Probably 12 EBs with RFIO and EB5 with DABC/monitoring streamserver is a good setup for beamtime
- Adjusting the ethernet interrupt affinities by the
set_eth_affinity_lxhadeb0i.pl
scripts works initially, but after a while the affinities are reset by the system again to some default cpus! This behaviour is probably due to linux upgrade of EB servers to Debian 7. TODO: find out if the default affinity balancing over the first 8 cores is worse than setting eth3 and eth4 all to a single higher core that is not reserved for EB processes. At first glance, performance is not affected by this.
--
JoernAdamczewski - 25 Jun 2014
After tuning of cpu affinities on 26 June 2014
Changes:
- Adjusted list of free cores in
start_eb_gbe.pl
for machines lxhadeb02/03/04: cores 0...5 are reserverd for eth interrupts, cores 6...12 available for eventbuilder processes
- Disabled
set_eth_affinity_lxhadeb0i.pl
in move_doublecpu_irq.sh
for lxhadeb02/03/04
Result: Eventbuilders will not share same cores as eth0 interrupts, no more 99% load at any core!
Data
Setup: 31 subsystems, Trigger: 1 Mhz Pulser, one run has about 2e6 (2 million) events total (for 16 EBs) with 400 Mb/s write bandwidth
Note that number of events per run will change according to number of used eventbuilders! The run is always terminated when EB1 exceeds 1.5Gbyte, so total amount of data written scales with number of EBs
Probably frontend setup was changed since yesterday due to lower cts trigger rate at same pulser?
Further observations and Conclusions:
- Tuning the cpu affinities mostly reduces the event loss rate with RFIO enabled
- At 67kHz Eventrate there is average event loss of about 1Hz in complete system, i.e. it appears round robin on different eventbuilders depending on the eb mask. This was debugged and could be understood with known eventcounter overflow error that is not treated by eventbuilder software at the moment (working with 16bit only will decrease recovery stability of the system). Some debug outputs proving this are in file log_lostevents.txt.
--
JoernAdamczewski - 26 Jun 2014