HPC resources for CMS offline computing: An integration and scalability challenge for the Submission Infrastructure

Read original: arXiv:2405.14631 - Published 5/24/2024 by Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem

📉

Overview

The computing resource needs of Large Hadron Collider (LHC) experiments are expected to grow significantly in the coming years.
High-Performance Computing (HPC) and Cloud resources will play an increasingly important role, providing a significant fraction of the total compute capacity.
Integrating and scaling the computing infrastructure to meet these demands represents a major challenge for the experiments.

Plain English Explanation

The Large Hadron Collider (LHC) is a massive particle accelerator used by scientists to study the fundamental building blocks of the universe. As the LHC experiments continue to evolve and collect more data, the computing power required to process and analyze this data is expected to grow rapidly.

To meet these increasing demands, the LHC experiments will need to utilize a variety of computing resources, including traditional grid-based systems as well as HPC and cloud computing platforms. Integrating these diverse resources and scaling the computing infrastructure to handle the immense workloads of the LHC's High-Luminosity upgrade (HL-LHC) will be a significant challenge.

The CMS Submission Infrastructure (SI) is the system used by the CMS experiment to manage and distribute its computing tasks. As the computing needs grow, the SI will need to evolve to effectively harness these new types of resources while maintaining the flexibility and efficiency required for the CMS experiment's operations.

Technical Explanation

The CMS SI is a federated system of HTCondor computing pools that currently manages over 400,000 CPU cores distributed worldwide. This infrastructure supports the simultaneous execution of more than 200,000 computing tasks for the CMS experiment.

Incorporating HPC resources into the CMS computing landscape presents several challenges. HPC centers are much more diverse than traditional grid sites, requiring a more complex integration process. Additionally, evolving the current SI to scale up and effectively utilize the increased compute capacity needed for the HL-LHC phase will be a significant engineering challenge.

To proactively address potential scalability issues, the SI team regularly conducts stress tests to explore the limits of their infrastructure. These tests help the team identify and address any bottlenecks or other issues that could arise as the computing demands continue to grow.

Critical Analysis

The paper highlights the significant computing challenges faced by the LHC experiments, particularly CMS, as they prepare for the HL-LHC era. While the integration of HPC and cloud resources is a necessary step to meet these growing demands, the authors acknowledge that this integration represents a substantial technical challenge.

One potential concern not explicitly addressed in the paper is the security implications of incorporating a more diverse set of computing resources, some of which may have different security practices and policies. Ensuring the overall security and integrity of the CMS computing infrastructure will be crucial as it becomes more heterogeneous.

Additionally, the paper does not discuss the potential trade-offs between the increased flexibility and scalability provided by HPC and cloud resources and the potential loss of control or visibility that the CMS experiment may experience. Maintaining a balance between these competing factors will be an ongoing challenge.

Conclusion

The LHC experiments, and CMS in particular, face significant computing challenges as they prepare for the HL-LHC era. Integrating and scaling the computing infrastructure to meet these growing demands, while maintaining flexibility and efficiency, will require substantial technical innovation and collaboration across the High Energy Physics community.

The CMS Submission Infrastructure is at the forefront of this effort, actively exploring solutions to incorporate diverse computing resources, including HPC and cloud platforms, into their computing ecosystem. The team's proactive approach to testing and addressing scalability issues sets the stage for the CMS experiment to continue its groundbreaking scientific discoveries in the years to come.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

HPC resources for CMS offline computing: An integration and scalability challenge for the Submission Infrastructure

Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem

The computing resource needs of LHC experiments are expected to continue growing significantly during the Run 3 and into the HL-LHC era. The landscape of available resources will also evolve, as High Performance Computing (HPC) and Cloud resources will provide a comparable, or even dominant, fraction of the total compute capacity. The future years present a challenge for the experiments' resource provisioning models, both in terms of scalability and increasing complexity. The CMS Submission Infrastructure (SI) provisions computing resources for CMS workflows. This infrastructure is built on a set of federated HTCondor pools, currently aggregating 400k CPU cores distributed worldwide and supporting the simultaneous execution of over 200k computing tasks. Incorporating HPC resources into CMS computing represents firstly an integration challenge, as HPC centers are much more diverse compared to Grid sites. Secondly, evolving the present SI, dimensioned to harness the current CMS computing capacity, to reach the resource scales required for the HLLHC phase, while maintaining global flexibility and efficiency, will represent an additional challenge for the SI. To preventively address future potential scalability limits, the SI team regularly runs tests to explore the maximum reach of our infrastructure. In this note, the integration of HPC resources into CMS offline computing is summarized, the potential concerns for the SI derived from the increased scale of operations are described, and the most recent results of scalability test on the CMS SI are reported.

5/24/2024

🤷

The integration of heterogeneous resources in the CMS Submission Infrastructure for the LHC Run 3 and beyond

Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem

While the computing landscape supporting LHC experiments is currently dominated by x86 processors at WLCG sites, this configuration will evolve in the coming years. LHC collaborations will be increasingly employing HPC and Cloud facilities to process the vast amounts of data expected during the LHC Run 3 and the future HL-LHC phase. These facilities often feature diverse compute resources, including alternative CPU architectures like ARM and IBM Power, as well as a variety of GPU specifications. Using these heterogeneous resources efficiently is thus essential for the LHC collaborations reaching their future scientific goals. The Submission Infrastructure (SI) is a central element in CMS Computing, enabling resource acquisition and exploitation by CMS data processing, simulation and analysis tasks. The SI must therefore be adapted to ensure access and optimal utilization of this heterogeneous compute capacity. Some steps in this evolution have been already taken, as CMS is currently using opportunistically a small pool of GPU slots provided mainly at the CMS WLCG sites. Additionally, Power9 processors have been validated for CMS production at the Marconi-100 cluster at CINECA. This note will describe the updated capabilities of the SI to continue ensuring the efficient allocation and use of computing resources by CMS, despite their increasing diversity. The next steps towards a full integration and support of heterogeneous resources according to CMS needs will also be reported.

5/24/2024

🤔

Repurposing of the Run 2 CMS High Level Trigger Infrastructure as a Cloud Resource for Offline Computing

Marco Mascheroni, Antonio Perez-Calero Yzquierdo, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem, Damiele Spiga, Christoph Wissing, Frank Wurthwein

The former CMS Run 2 High Level Trigger (HLT) farm is one of the largest contributors to CMS compute resources, providing about 25k job slots for offline computing. This CPU farm was initially employed as an opportunistic resource, exploited during inter-fill periods, in the LHC Run 2. Since then, it has become a nearly transparent extension of the CMS capacity at CERN, being located on-site at the LHC interaction point 5 (P5), where the CMS detector is installed. This resource has been configured to support the execution of critical CMS tasks, such as prompt detector data reconstruction. It can therefore be used in combination with the dedicated Tier 0 capacity at CERN, in order to process and absorb peaks in the stream of data coming from the CMS detector. The initial configuration for this resource, based on statically configured VMs, provided the required level of functionality. However, regular operations of this cluster revealed certain limitations compared to the resource provisioning and use model employed in the case of WLCG sites. A new configuration, based on a vacuum-like model, has been implemented for this resource in order to solve the detected shortcomings. This paper reports about this redeployment work on the permanent cloud for an enhanced support to CMS offline computing, comparing the former and new models' respective functionalities, along with the commissioning effort for the new setup.

5/24/2024

📈

Adoption of a token-based authentication model for the CMS Submission Infrastructure

Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem, Frank Wurthwein

The CMS Submission Infrastructure (SI) is the main computing resource provisioning system for CMS workloads. A number of HTCondor pools are employed to manage this infrastructure, which aggregates geographically distributed resources from the WLCG and other providers. Historically, the model of authentication among the diverse components of this infrastructure has relied on the Grid Security Infrastructure (GSI), based on identities and X509 certificates. In contrast, commonly used modern authentication standards are based on capabilities and tokens. The WLCG has identified this trend and aims at a transparent replacement of GSI for all its workload management, data transfer and storage access operations, to be completed during the current LHC Run 3. As part of this effort, and within the context of CMS computing, the Submission Infrastructure group is in the process of phasing out the GSI part of its authentication layers, in favor of IDTokens and Scitokens. The use of tokens is already well integrated into the HTCondor Software Suite, which has allowed us to fully migrate the authentication between internal components of SI. Additionally, recent versions of the HTCondor-CE support tokens as well, enabling CMS resource requests to Grid sites employing this CE technology to be granted by means of token exchange. After a rollout campaign to sites, successfully completed by the third quarter of 2022, the totality of HTCondor CEs in use by CMS are already receiving Scitoken-based pilot jobs. On the ARC CE side, a parallel campaign was launched to foster the adoption of the REST interface at CMS sites (required to enable token-based job submission via HTCondor-G), which is nearing completion as well. In this contribution, the newly adopted authentication model will be described. We will then report on the migration status and final steps towards complete GSI phase out in the CMS SI.

5/24/2024