High-Performance Data Processing for Large-Scale Scientific Experiments
-
job categories:
Master Thesis / Internship
-
job posting number:
IPE-PDV1-2025
-
institute:
IPE
-
starting date:
by appointment
- contact person:
High-Performance Data Processing for Large-Scale Scientific Experiments
Modern scientific experiments, such as the KARA synchrotron radiation facility and the KATRIN experiment, generate massive, high-dimensional datasets that challenge current data processing frameworks. Efficient storage, retrieval, and analysis of this data are crucial for scientific discovery but remain a significant bottleneck. This project aims to evaluate and develop novel approaches for high-performance data processing by leveraging advanced compression techniques, optimized database architectures, and system-level enhancements. The focus is on improving scalability and efficiency for multi-dimensional time-series data and 3D volumetric datasets, addressing key challenges such as reducing storage overhead, accelerating retrieval times, and enhancing performance in large-scale scientific workflows.
Depending on the student's background and interests, the project may focus on one or more of the following:
Database Optimization: Investigating high-performance storage solutions such as TileDB and ClickHouse to improve query execution and scalability.
High-Performance Computing (HPC): Exploring parallel processing techniques, pipeline-based workflows, and optimizations for Linux-based clusters to enhance computational efficiency.
Performance Benchmarking: Extending and refining our in-house tool (SciTS) to systematically assess ingestion throughput, query latency, and scalability in large-scale scientific datasets.
AI-Based Techniques: Investigating machine learning approaches for adaptive compression and intelligent query optimization.
Required Skills
Experience with C, Python, or C# (modular software development).
Familiarity with database architectures (relational or novel systems like columnar/array-based storage).
Experience with Linux-based environments and system-level optimization. Cloud (Kubernetes) experience is a plus.
Duration & Collaboration
The project is expected to last six months or longer and offers collaboration across multiple disciplines.