The integration of heterogeneous resources in the CMS Submission Infrastructure for the LHC Run 3 and beyond

Read original: arXiv:2405.14647 - Published 5/24/2024 by Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem

🤷

Overview

The Large Hadron Collider (LHC) experiments, such as the Compact Muon Solenoid (CMS) experiment, are generating vast amounts of data that need to be processed and analyzed.
The computing infrastructure supporting these experiments has been dominated by x86 processors, but this is expected to change in the coming years.
LHC collaborations will increasingly use High-Performance Computing (HPC) and Cloud facilities, which feature diverse compute resources, including alternative CPU architectures like ARM and IBM Power, as well as a variety of GPU specifications.
Efficiently using these heterogeneous resources is essential for the LHC collaborations to reach their future scientific goals.

Plain English Explanation

The Large Hadron Collider (LHC) is a massive particle accelerator that is used to study the fundamental building blocks of the universe. The experiments conducted at the LHC, such as the Compact Muon Solenoid (CMS) experiment, generate huge amounts of data that need to be processed and analyzed by scientists.

In the past, the computing infrastructure supporting these experiments has been dominated by a type of processor called x86. However, this is expected to change in the coming years. LHC collaborations will start using more High-Performance Computing (HPC) and Cloud facilities, which offer a variety of different types of processors, including ARM and IBM Power, as well as different types of graphics processors (GPUs).

Using these diverse, or heterogeneous, computing resources efficiently is crucial for the LHC collaborations to achieve their future scientific goals. The paper discusses how the CMS experiment is adapting its computing infrastructure to effectively utilize these new types of processors and GPUs.

Technical Explanation

The paper discusses the evolving computing landscape that supports the LHC experiments, particularly the CMS experiment. Currently, the computing infrastructure at the Worldwide LHC Computing Grid (WLCG) sites is dominated by x86 processors. However, this configuration is expected to change in the coming years as LHC collaborations increasingly use HPC and Cloud facilities to process the vast amounts of data expected during the LHC Run 3 and the future HL-LHC phase.

These HPC and Cloud facilities often feature diverse compute resources, including alternative CPU architectures like ARM and IBM Power, as well as a variety of GPU specifications. The paper emphasizes that using these heterogeneous resources efficiently is essential for the LHC collaborations to reach their future scientific goals.

The Submission Infrastructure (SI) is a central element in CMS Computing, enabling resource acquisition and exploitation by CMS data processing, simulation, and analysis tasks. The paper describes how the SI must be adapted to ensure access and optimal utilization of this heterogeneous compute capacity.

Some steps in this evolution have already been taken, as CMS is currently using a small pool of GPU slots provided mainly at the CMS WLCG sites. Additionally, Power9 processors have been validated for CMS production at the Marconi-100 cluster at CINECA. The paper outlines the updated capabilities of the SI to continue ensuring the efficient allocation and use of computing resources by CMS, despite their increasing diversity. It also reports on the next steps towards a full integration and support of heterogeneous resources according to CMS needs.

Critical Analysis

The paper provides a comprehensive overview of the evolving computing landscape supporting the LHC experiments, particularly the CMS experiment. It highlights the growing importance of diversifying the computing resources beyond the traditional x86 processors, to include alternative CPU architectures and GPUs.

One potential limitation of the research is that it focuses primarily on the CMS experiment, and it's unclear how the findings might translate to other LHC collaborations or experimental domains. Additionally, the paper does not delve into the specific technical challenges or trade-offs involved in integrating and optimizing the use of these heterogeneous resources.

Further research could explore the performance characteristics, energy efficiency, and cost-effectiveness of the various compute architectures across a broader range of LHC experiments and use cases. Investigating the software and programming models required to effectively leverage these diverse resources would also be valuable.

Conclusion

The paper underscores the significant changes underway in the computing landscape supporting LHC experiments, such as the CMS experiment. As LHC collaborations increasingly turn to HPC and Cloud facilities, they must adapt their computing infrastructure to efficiently utilize a wide range of heterogeneous resources, including alternative CPU architectures and GPUs.

The CMS experiment's Submission Infrastructure is being updated to enable access and optimal utilization of these diverse computing capabilities. This evolution is essential for the LHC collaborations to achieve their future scientific goals in the face of the ever-growing data processing demands.

The research highlights the importance of staying ahead of the curve in terms of computing technology and infrastructure, as the LHC community strives to push the boundaries of our understanding of the fundamental nature of the universe.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

The integration of heterogeneous resources in the CMS Submission Infrastructure for the LHC Run 3 and beyond

Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem

While the computing landscape supporting LHC experiments is currently dominated by x86 processors at WLCG sites, this configuration will evolve in the coming years. LHC collaborations will be increasingly employing HPC and Cloud facilities to process the vast amounts of data expected during the LHC Run 3 and the future HL-LHC phase. These facilities often feature diverse compute resources, including alternative CPU architectures like ARM and IBM Power, as well as a variety of GPU specifications. Using these heterogeneous resources efficiently is thus essential for the LHC collaborations reaching their future scientific goals. The Submission Infrastructure (SI) is a central element in CMS Computing, enabling resource acquisition and exploitation by CMS data processing, simulation and analysis tasks. The SI must therefore be adapted to ensure access and optimal utilization of this heterogeneous compute capacity. Some steps in this evolution have been already taken, as CMS is currently using opportunistically a small pool of GPU slots provided mainly at the CMS WLCG sites. Additionally, Power9 processors have been validated for CMS production at the Marconi-100 cluster at CINECA. This note will describe the updated capabilities of the SI to continue ensuring the efficient allocation and use of computing resources by CMS, despite their increasing diversity. The next steps towards a full integration and support of heterogeneous resources according to CMS needs will also be reported.

5/24/2024

📉

HPC resources for CMS offline computing: An integration and scalability challenge for the Submission Infrastructure

Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem

The computing resource needs of LHC experiments are expected to continue growing significantly during the Run 3 and into the HL-LHC era. The landscape of available resources will also evolve, as High Performance Computing (HPC) and Cloud resources will provide a comparable, or even dominant, fraction of the total compute capacity. The future years present a challenge for the experiments' resource provisioning models, both in terms of scalability and increasing complexity. The CMS Submission Infrastructure (SI) provisions computing resources for CMS workflows. This infrastructure is built on a set of federated HTCondor pools, currently aggregating 400k CPU cores distributed worldwide and supporting the simultaneous execution of over 200k computing tasks. Incorporating HPC resources into CMS computing represents firstly an integration challenge, as HPC centers are much more diverse compared to Grid sites. Secondly, evolving the present SI, dimensioned to harness the current CMS computing capacity, to reach the resource scales required for the HLLHC phase, while maintaining global flexibility and efficiency, will represent an additional challenge for the SI. To preventively address future potential scalability limits, the SI team regularly runs tests to explore the maximum reach of our infrastructure. In this note, the integration of HPC resources into CMS offline computing is summarized, the potential concerns for the SI derived from the increased scale of operations are described, and the most recent results of scalability test on the CMS SI are reported.

5/24/2024

📈

Adoption of a token-based authentication model for the CMS Submission Infrastructure

Antonio Perez-Calero Yzquierdo, Marco Mascheroni, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem, Frank Wurthwein

The CMS Submission Infrastructure (SI) is the main computing resource provisioning system for CMS workloads. A number of HTCondor pools are employed to manage this infrastructure, which aggregates geographically distributed resources from the WLCG and other providers. Historically, the model of authentication among the diverse components of this infrastructure has relied on the Grid Security Infrastructure (GSI), based on identities and X509 certificates. In contrast, commonly used modern authentication standards are based on capabilities and tokens. The WLCG has identified this trend and aims at a transparent replacement of GSI for all its workload management, data transfer and storage access operations, to be completed during the current LHC Run 3. As part of this effort, and within the context of CMS computing, the Submission Infrastructure group is in the process of phasing out the GSI part of its authentication layers, in favor of IDTokens and Scitokens. The use of tokens is already well integrated into the HTCondor Software Suite, which has allowed us to fully migrate the authentication between internal components of SI. Additionally, recent versions of the HTCondor-CE support tokens as well, enabling CMS resource requests to Grid sites employing this CE technology to be granted by means of token exchange. After a rollout campaign to sites, successfully completed by the third quarter of 2022, the totality of HTCondor CEs in use by CMS are already receiving Scitoken-based pilot jobs. On the ARC CE side, a parallel campaign was launched to foster the adoption of the REST interface at CMS sites (required to enable token-based job submission via HTCondor-G), which is nearing completion as well. In this contribution, the newly adopted authentication model will be described. We will then report on the migration status and final steps towards complete GSI phase out in the CMS SI.

5/24/2024

🤔

Repurposing of the Run 2 CMS High Level Trigger Infrastructure as a Cloud Resource for Offline Computing

Marco Mascheroni, Antonio Perez-Calero Yzquierdo, Edita Kizinevic, Farrukh Aftab Khan, Hyunwoo Kim, Maria Acosta Flechas, Nikos Tsipinakis, Saqib Haleem, Damiele Spiga, Christoph Wissing, Frank Wurthwein

The former CMS Run 2 High Level Trigger (HLT) farm is one of the largest contributors to CMS compute resources, providing about 25k job slots for offline computing. This CPU farm was initially employed as an opportunistic resource, exploited during inter-fill periods, in the LHC Run 2. Since then, it has become a nearly transparent extension of the CMS capacity at CERN, being located on-site at the LHC interaction point 5 (P5), where the CMS detector is installed. This resource has been configured to support the execution of critical CMS tasks, such as prompt detector data reconstruction. It can therefore be used in combination with the dedicated Tier 0 capacity at CERN, in order to process and absorb peaks in the stream of data coming from the CMS detector. The initial configuration for this resource, based on statically configured VMs, provided the required level of functionality. However, regular operations of this cluster revealed certain limitations compared to the resource provisioning and use model employed in the case of WLCG sites. A new configuration, based on a vacuum-like model, has been implemented for this resource in order to solve the detected shortcomings. This paper reports about this redeployment work on the permanent cloud for an enhanced support to CMS offline computing, comparing the former and new models' respective functionalities, along with the commissioning effort for the new setup.

5/24/2024