Event Builder Development
The aim of the Event Builder (EB) is to receive subevents from subsystems and to build a complete event out of them. EB consists of two parts: daq_netmem (the receiving part) and daq_evtbuild (the building part). A communication between daq_netmem and daq_evtbuild is done via a shared memory. Number of opened in a shared memory buffers (queues) is equal to a number of subsystems. A completed event can be written to different mass storage systems.
The daq_evtbuild and daq_netmem can be configured in three ways:
- take default arguments,
- read arguments from eb_p.tcl configuration file ($DAQ_SETUP should be set),
- read arguments from a command line.
daq_evetbuild must be executed before daq_netmem (as daq_evtbuild is the one who opens buffers in shared memory).
daq_evtbuild can be executed with the following options:
- [-x expId]
- [-m nrOfMsgs] should correspond to the number of subsystems
- [-f slowCtrlFile ...] not in use
- [-r runNumber]
- [-a (agent)] not in use
- [-p priority]
- [-I evtId] set event Id. It will be written to an event header
- [-v debug|info|notice|warning|err|alert|crit|emerg] levels of debug info
- [--norpc] not in use
- [-q queueSize] the size of the EB buffers
- [-d null|tape|file|stdout] where the data should be written. "null" = write to /dev/null, "file" = write to a file on the disk.
- [-o outPath] output path on the disk
- [--filesize max_file_size] maximum size of output file in MB
- [--rfio path_to_tape_archive] write to a tape via RFIO. A format of the path : rfiodaq:gstore://...
- [--lustre path_to_lustre] write to the Lustre cluster (if mounted, of course)
- [--resdownscale down_factor] write events to resfile with this downscaling factor (for Remote Event Server)
- [--resnumevents evt_num] number of events in one resfile
- [--respath path] write resfile to this path
- [--ressizelimit file_num] maximum number of resfiles in the directory
- [--secsizelimit max_size] maximum size of a second directory in MB, where the mirrored data can be written
- [--write_data path] path to a second directory with mirrored data
- [--epicsctrl] enable synch and distribution of RUN Id by Epics for parallel event building (requires ioc running on the event builders)
- [--buffstat] show fill levels of buffers in a standard output
- [-S shmname] an extension of the names of shared memory segments. Allows to start many EBs on the same machine. Shuold be given for daq_evtbuild and daq_netmem.
- [-O|--orapath path] path to eb_runinfo2ora.txt (provide this path if you do not want to write RUN info into the standard $DAQ_SETUP directory).
- [-i|--ignore] ignore trigger mismatch conditions. This will allow for running Event Builder ignoring all the trigger tag mismatches.
- [--online] switch on online service (default off)
- the options -m and -q can be read from eb_p.tcl configuration file and should not necessarily be specified in a command line. The variable queue sizes (varqsize) can be read only from eb_p.tcl. To enable variable queue sizes one must set standard queue size to zero: -q 0.
- If -S option is used to start several EBs, then daq_sniff will sniff the data from the last EB started only. Solution: write the data to the disk and look at the data with daq_anal xxxxxxx.hld | less
daq_netmem can be executed with the following options:
- [-i inPath] format: UDP:0.0.0.0:port_num, where "0.0.0.0" is a dummy IP and "port_num" is the receiving port number.
- [-m nrOfMsgs] should correspond to the number of subsystems
- [-p priority]
- [-b] show fill levels of buffers in a standard output
- [-S shmname] an extension of the names of shared memory segments. Allows to start many EBs on the same machine. Shuold be given for daq_netmem and daq_evtbuild.
Important: the options -m and -i can be read from eb_p.tcl configuration file and should not necessarily be specified in a command line.
New features and some info:
The limitation of maximum 32 subsystems.
An Event Builder was able to accept only up to 32 subsystems. This was due to the length of the bit mask (of a type unsigned long) used by
function in nettrans.c. New function
does not use bit mask and should be able to handle any number of subsystems.
Direct connection to the Data Mover.
RFIO is not in the automake since this feature is rarely needed (during beamtime), thus configure will create Makefile without RFIO libs. To compile with RFIO do the following after running configure:
- In evtbuild.c uncomment: #define RFIO
- Check if rawapin.h, rawcommn.h, rawclin.h are in "include" dir
- Add two libs to Makefile: LIBS = ... -lrawapiclin -lrawservn (or -lrawapiclin64 -lrawservn64)
In the future DAQ we might want to have an opportunity to send the data from the Event Builder directly to the gStore. This is realized via RFIO.
The libraries and test programs prepared by Horst Goeringer are located here:
- mrawWriteLoopSampleFulln.c - this program sends local file in loop to gStore using RFIO
- /GSI/staging/adsm/v51/rfioclient/Linux/ - corresponding make files
- /GSI/staging/adsm/v51/Linux/ - libraries. We need two libraries: librawapiclin.a and librawservn.a
- /GSI/staging/adsm/v51/Linux64/ - libraries for 64-bit platform
- /GSI/staging/adsm/v51/inc/ - headers
An RFIO client implemented in the Event Builder (evtbuild.c) has the following functionality:
- rfio_openConnection - open connection to the Data Mover
- openFile, writeFile, closeFile
- rfio_closeConnection - close connection to the Data Mover
The general purpose functions (openFile, writeFile, closeFile) were extended to include RFIO functionality and to ensure a uniqueness of runNr, seqNr and file name.
The tests of the RFIO are still going on. The test path is: rfiodaq:gstore:/hadaqtest/path/file, where hadaqtest is an archive name for the tests.
New (future) Interface to RFIO server
HADES Event Builders will write the full data to Tape via Data Movers. Some fraction of data we want also to write to Lustre. The latter feature can be integrated into the RFIO server on Data Movers. To fullfil this an RFIO interface should be extended. The following three parameters
to be passed to gstore:
- Lustre path. (For example "/lustre_alpha/hades/beam/sep08/d", where "/lustre_alpha/hades/beam/sep08" is the existing path and "d" is a prefix of the non-existing directory. The RFIO server should add to this prefix a time stamp in a form: yydddhhmm (yy - year, ddd - day of the year, hh - hour, mm - minutes). Then, a complete path should look like "/lustre_alpha/hades/beam/sep08/d090231634" If directory "d090231634" does not exist the RFIO server should create it.
- Number of files per directory. This is needed to avoid huge amount of files in one directory.
- Fraction of files in the main stream to be written to Lustre.
- 0 - nothing
- 1 - each file
- 2 - every second file
- 3 - every third file and so on
Additionally if connection to tape breaks the RFIO server will automatically start writing each file to Lustre independently of the third parameter setting.
We might need a couple of parameters more for fine tuning.
This function implements the extended RFIO interface:
FILE* rfio_fopen_gsidaq( char *pcFile, char *pcOptions, int iCopyMode, char *pcCopyPath, int iCopyFraction, int iMaxFile, int iPathConvention)
- pcFile : base name ("rfiodaq:gstore:")
- pcOptions : some options ("wb")
- iCopyMode :
- 0 : standard RFIO, ignore following arguments.
- 1 : copy the data to pcCopyPath after the file is written to a write cache (this is for the high data rates).
- 2 : for lustre only
- pcCopyPath :
- lustre path ("/lustre/hades/daq/test"). If the path does exist it will be created according to parameter iPathConvention.
- "RC" : read cache
- iCopyFraction :
- 0 : write only to a tape.
- i (>0) : copy each i-th file to lustre (pcCopyPath). If migration to a tape fails, ignore iCopyFraction and copy each file to lustre.
- iMaxFile :
- 0 : no file number limit.
- i (>0) : maximum number of files to be written to a directory (files already sitting in the directory are ignored). When iMaxFile is reached, a new directory at the same level is created according to a parameter iPathConvention.
- iPathConvention :
- 0 : default convention "/hadaqtest/test" => "/hadaqtest/test", next "/hadaqtest/test" => "/hadaqtest/test1", next "/hadaqtest/testi" => "/hadaqtest/test(i+1)"
- 1 : HADES convention "/hadaqtest/test" => "/hadaqtest/testyydddhhmm"
Writing to Lustre file system.
Lustre cluster was mounted on lxhadesdaq. A "write to Lustre" option was added to the Event Builder. The tests showed no effect on the performance of the EB at the rate of 25 MB/s when writing to Lustre.
Port to 64-bit platform.
hadaq module and support modules (allParam, compat) were successfully compiled and tested on a 64-bit platform. All the features (RFIO etc) are working. Except a couple of minor fixes there were no modifications required.
Event Builder was tested with 60 sources (total data rate = 40 MB/s). Total size of open buffers was 11.5 GB (12 GB is a maximum size available on hadeb06). Very good performance, no discarded events and a lot of room for playing with buffer sizes.
Communication with Oracle
- During each restart of the DAQ, the list of subsystems and the time stamp should be written to the Oracle database.
- RUN Start/Stop information: one can read the time stamp of last RUN Start from Oracle. Then the script should start inserting the RUN Start time stamp equal to (or newer than) the time stamp the last RUN Start from Oracle.
- Startup script should collect the following information from all the boards: subevtId, uniqueId. This MUST be done for the TRBs with TDCs since this table is later used for the analysis to apply the corrections. Then the complete table with subevtId and uniqueId is inserted to the Oracle database in one go.
- Each detector must have its range of the subevtIds (min subevtId, max subevtId). We should not shift the range of subevtIds for a given detector. HUBs can have their own subevt range. Subevents sent from HUBs contain subsubevents with proper subevtIds belonging to the particular detectors. The special unpacker function will recognize the subevtId of HUB and will scan the subsubevent headers of the HUBs data, will extract subevtIds from subsubevt headers and will provide these subevtIds to the old unpackers together with the corresponding subsubevt data.
- 24 Feb 2010
module needs allParam
- Configure allParam: CPPFLAGS="-I/home/hadaq/daqsoftware/include" LDFLAGS="-L/home/hadaq/daqsoftware/lib" ./configure --prefix=/home/hadaq/daqsoftware
- Build allParam and install: make; make install
- Configure compat: CPPFLAGS="-I/home/hadaq/daqsoftware/include" LDFLAGS="-L/home/hadaq/daqsoftware/lib" ./configure --prefix=/home/hadaq/daqsoftware
- Build compat and install: make; make install
- Configure hadaq: CPPFLAGS="-I/home/hadaq/daqsoftware/include" LDFLAGS="-L/home/hadaq/daqsoftware/lib" ./configure --prefix=/home/hadaq/daqsoftware
- Build hadaq and install: make; make install
previous LDFLAGS is changed: does not contain architecture i686 anymore!
- 02 Dec 2011
Possible problems during configuration:
checking for library containing conParam... no
configure: error: Parameter library not found
- None out of tcl tcl8.3 tcl8.2 tcl8.0 tcl7.4 libs specified in configure.in was found. Fast fix: ln -s /usr/lib/libtcl8.4.so /usr/lib/libtcl.so (put whatever version you have instead of libtcl8.4.so).
- Watch out where your tcl includes are. If necessary add: -I/usr/include/tcl8.4
- On SUSE install tcl-dev
evtbuild.c:47:38: fatal error: rawapin.h: No such file or directory
- Edit hadaq/eventbild.c and comment line: #define RFIO
- Error message: lxhadeb04 EB 4 daq_evtbuild: evtbuild.c, 558: fopen: failed to open file /data04/data: Input/output error
- Reason: most likely /data04 hard disks is broken
- Solution: restart ./daq_disks --exclude 4 on lxhadeb04 to exclude /data04 and include MULTIDISK: 5 in lxhadesdaq:/home/hadaq/trbsoft/daq/evtbuild/eb.conf for the corresponding Event builder to make sure that the first hld file will be written to /data05 and not to /data04 (instead of 5 it can be any other number of disk but 4)
- Error message: netmem.c, 645: NetTrans_create: failed to create UDP:0.0.0.0:50534: Address already in use
- Reason: port 50534 is already used by other application (most likely by EPICS IOC)
- Debug: lsof -i | grep 50534 => ebctrl 25724 scs 3u IPv4 8662658 UDP *:50534
- Solution: Close all EBs, close all IOCs, start EBs, start IOCs, close EBs, start EBs. By doing this sequence we start IOCs after EBs because IOCs are able to dynamically pick up unused UDP ports. Then we restart again EBs because for a proper start they need running IOCs.
- Error message: No space left on device
- Reason: This error occurs when the event builder application tries to open more than 128 sets of semaphores (when the standard setting is kernel.sem="250 32000 32 128"):
- 250 - SEMMSL - The maximum number of semaphores in a sempahore set
- 32000 - SEMMNS - The maximum number of sempahores in the system
- 32 - SEMOPM - The maximum number of operations in a single semop call
- 128 - SEMMNI - The maximum number of sempahore sets (128 sets mean 64 shared memory segments since two semaphore sets are required per memory segment. In this case, daq_evtbuild -m 65 will lead to an error)
- Solution: sysctl -w kernel.sem="250 128000 32 512"
- Error message: File exists
- Reason: This error occurs when semaphores remained from previous execution of daq_evtbuild are not properly cleaned.
- Solution: Use ipcrm -s semid (or /home/hadaq/bin/ipcrm.pl).
- 29 Nov 2010