Progress in virtually all areas of physics research relies on recording and analyzing enormous amounts of data. Recent improvements in detector instrumentation provide unprecedented detail to researchers. At the same time data rates far outpace the improvements in the performance of storage systems. Online data processing and reduction is crucial for the next generation of detector systems. This is rarely a challenge for large international experiments which are able to hire hundreds of engineers to develop custom data acquisition system. High-bandwidth detectors, however, become more-and-more readily available also for small-scale instruments which are developed and operated by a single group or within a small collaboration. Such collaborations are often lacking both expertise and resources to develop the required online system.
We see Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) infrastructures as core building blocks for a common framework to push data from the detector directly into the local computing center and rely on HPC resources for online data processing and reduction. This would require a close collaboration between groups developing novel detectors and computing centers operating HPC clusters.
We see several major advantages of the described approach:
- Reduce costs and efforts designing, building, and maintaining data processing clusters.
- Allow users to focus the experiment-specific data processing using the available common components for data flow organization.
- Rely on hardware and computing expertise available in the super-computing centers.
- Allow resource sharing. A number of dedicated resources will be allocated for each experiment, but much higher amount of shared resources may be available and requested during the load spikes.
Technology
We at IPE aim to bridge the gap between detector development groups and computing centers with their HPC clusters. We started a project to develop technologies required to organize fast and reliable data flow between detector and HPC infrastructure. This will allow us to rely on HPC resources for data processing and reduction.
Ethernet already becomes interface of choice for high-speed detector systems. The rapid advances in the Ethernet technology allow sufficient readout bandwidth, but efficient data distribution methods relaying on RDMA technologies are required to utilize network capacity efficiently. One of the major challenges is to develop mechanisms to prevent data loss due to unavoidable network and hardware failures. A good compromise between the system reliability and resource overhead may be achieved by enabling cooperation between a firmware of detector system and a HPC middleware. Additional operational information from the detectors should help the middleware to manage resources and steer data flow efficiently. Furthermore, a distributed data processing framework is required to simplify development of scalable data reduction modules. The major challenge is on one hand to allow users a full flexibility in the selection of technologies and on the other to ensure that the developed software can easily be migrated between nodes and co-exist with software from different experiments potentially relaying on very different set of technologies. Particularly, the framework is expected to facilitate deployment of extremely complex Machine Learning models which can be executed across multiple nodes and accelerated using FPGAs, GPUs, or/and custom neuro-computers.
We focus on the data intensive applications on the cloud platforms and possible extensions of the cloud middleware to enable co-operation with detector electronics. The anticipated areas of research include:
- IaaS (VMWare, oVirt, and KVM) and PaaS (OpenShift/Kubernetes) cloud infrastructure for data intensive applications
- Distributed file-systems for data-intensive workloads (GlusterFS, CePH, BeeGFS)
- Low-latency Infiniband and Ethernet networking using RDMA and RoCE technologies
- Optimized communication of CRI-O containers within Kubernetes infrastructure
- Scientific Workflow Engines for cloud environment
- HPC and database workloads in cloud environment
- Desktop image analysis applications in cloud environment
Projects
- KATRIN
- PANDA
- ROOF