DCGPT: Transforming Future AIDC Operations

INTRODUCTION

Artificial intelligence (AI) demand is rapidly expanding at an unprecedented rate all over the world. This growing trend has been piling pressure on data centers (DC). Unlike standard computational tasks, AI workloads involve complex computations for training and inference. Dedicated graphics processing unit (GPU), data processing unit (DPU), and networking devices are designed for AI services. The burst in demand for high-performance computing (HPC) has been transforming DC into AIDC for years. Two main features of AIDC are identified from the aspects of operational technology (OT) and information technology (IT). For the OT side, the per-rack power density of AIDC can reach up to 125 kW, compared to 10 kW of normal DC. For the IT side, approximately half of AI cloud service providers significantly leverage their computing resources, with utilization rates exceeding 85%.

The extreme power density and high resource utilization features make AIDC more volatile than normal DC. The previous best-practice approach hires professional operators to manually operate the DC. Worldwide DCs have experienced hundreds of outages caused by technical issues in their cooling systems. More than half (54 percent) of the respondents in the survey said severe outages cost more than $100,000, with 16 percent claiming that their most recent outage cost more than $1 million. Half of DC outages stem from human error. These inevitable errors often occur due to staff failing to follow procedures. Besides the ITE in normal DC, AIDC contains extra advanced and heterogeneous devices, including GPU, DPU, NVLink, etc. The high resource utilization may result in low fault tolerance and diminish the resilience of AIDC. AIDC requires more operators with additional extensive domain knowledge that is hard to scale. Human-based DC operations fall short of catching the emerging AI trends.

As we delve into the challenges we face, three primary concerns come to the fore.

  1. High Cost and Low Resilience: Firstly, our design needs the involvement of a professional design agency, a process that is not only costly but also time-consuming. The high CapEX and OpEx of AIDC and increased GPU workload may result in low resilience and fault-tolerance of AIDC.
  2. Talent Shortage:Secondly, with the growth of our DC, we are encountering a distinct lack of operational talent. This scarcity could hamper our ability to manage the heightened workload effectively, potentially leading to operational inefficiencies and adversely affecting our overall performance.
  3. Domain Complexity: Lastly, the operational tasks we undertake are characterized by a high degree of domain-specific knowledge. This inherent complexity not only makes the tasks more laborious and time-consuming but also significantly increases the risk on operation outages.

DCGPT

As AIDC is a highly integrated, complex, costly, and mission-critical cyber-physical system (CPS), and LLMs have been shown to perform human-level abilities and integrate comprehensive domain knowledge, we introduce a DC domain-specific LLM named DCGPT to enhance future AIDC operation.

Our DCGPT involves multiple foundational agentic components for DC operations and management. With their ability to process vast amounts of data and generate actionable insights, DCGPT is being leveraged as the core engine for DC operations.

  1. Base LLM: DCGPT is built upon a family of open-sourced base LLMs (e.g. LLaMA, DeepSeek, QWen). These models are being integrated into workflows for AIDC design, dynamics modeling, optimizing energy use, managing workloads, first response, and enhanced cybersecurity.
  2. Knowledge: We curate a massive DC domain-specific knowledge base, including abstract physics equations, multimodal content, and model assets library. These data build and enhance our DCGPT in DC-related LLM tasks.
  3. Agents: We develop three agents to power DCGPT’s functionality, including a modeling agent, simulation agent, and planning agent. These agents, as fundamental components, cooperatively support operation applications.
  4. Applications: Built upon DCGPT, we provide three major applications for DC operations, which assist operators in the process of design, optimization, and maintenance. These applications interact with DC operators with a conversational interface.
DCGPT bridges the gap between abstract knowledge and real-world applications. This framework seamlessly integrates abstract physics equations, textual knowledge, and model assets with the reasoning capabilities of LLMs. This powerful combination unlocks a new era of LLM-empowered operations for future AIDC, including advanced design, optimization, maintenance, etc.

OPEN SOURCE

DCGPT is an open-source LLM project focusing on DCs. Our goal is to provide a user-friendly platform and an open-source community for DC practitioners and researchers to share trending DC information.

  • HuggingFace: We release our DCGPT finetuned model family at Huggingface (https://huggingface.co/CAP-GDCR). These models are trained on different open-source base models.
  • Github: We also release our web application source code and our data design on Github (https://github.com/CAP-GDCR/DCGPT) . We wish our DCGPT can be an open-source project for DC practitioners to collaboratively contribute.

CASE STUDY

We here introduce our DCGPT platform. Currently, it provides comprehensive and accurate responses to questions in the DC domain. It serves as an invaluable tool for DC practitioners to seek information or solutions during daily operations, maintenance, and sustainability. 

Case1-Q&A: DCGPT provides DC-related question-answering functionality. We trained and augmented the DCGPT with the latest DC textual data, including metric, policy, report, SOP, specification, paper, and textbook. Our DCGPT platform provides chat with uploaded documents. It functions as a consulting agent, providing insightful responses to queries related to sustainability.

Our DCGPT continuously trained and augmented by the latest DC-related knowledge provides more authentic answers for DC practitioners. The following cases show the differences in results with/without DCGPT.

Terminology Explanation: Terminology explanation tests LLM’s ability to comprehend domain-related professional terminology. Our DCGPT gives the correct answer on AIDC, while GPT-4 comprehends the wrong meaning of the abbreviation.

Domain Physics Formulation: Domain Physics Formulation tests LLM’s ability in reasoning about and applying physical laws and principles. Our DCGPT gives the correct ODE form of data hall thermodynamics, while GPT-4 gives a more general one, which is not tailored to the DC scenario.

In-house Code Generation: In-house code generation tests LLM’s ability to comprehend the input prompts to generate corresponding codes. Our DCGPT, trained with our in-house library, comprehends the DC hierarchy and gives executable Python codes, while GPT-4 gives a totally wrong code snippet.

Case 2-Incident Response:

Incident Response in a DC refers to the structured approach taken to address and manage the aftermath of a security breach or cyberattack, also known as an incident. The goal is to handle the situation in a way that limits damage and reduces recovery time and costs, ensuring the continuity of operations and services. Several scenarios may happen in daily DC operations:

  1. UPS Battery Redundancy Error: A DC relies on Uninterruptible Power Supply (UPS) systems with N+1 battery redundancy to protect against power outages. Operators are alerted to the redundancy error. They may perform manual checks on the battery string, such as voltage measurements and visual inspection. This process can be time-consuming, and the root cause of the error may not be immediately apparent.
  2. Cooling System Failure Due to Pump Malfunction: A critical cooling pump in the DC’s chilled water system malfunctions, causing a rapid rise in server temperatures. Operators are alerted to the rising temperatures. They manually attempt to diagnose the pump issue, potentially consulting maintenance manuals or contacting the vendor. Finding a replacement pump or repairing the existing one can take time, leading to potential server overheating and downtime. Manual intervention can also be prone to errors under pressure.

Our approach leverages DCGPT to provide SOP for DC operations based on the emergency alerts in the DCIM.

For UPS Battery Redundancy Error: DCGPT follows the SOP of an in-operating DC to generate detailed solutions step-by-step for a certain battery alert.

GPT-4 fails to provide SOP with detailed actions, only checking and inspection.

For Cooling System Failure Due to Pump Malfunction: DCGPT follows the SOP of an in-operating DC to generate detailed solutions step-by-step for a chiller shutdown alert.

GPT-4 provides a wrong SOP that cannot solve the problem.

Case 3-Design Generation:

During the operation stage of a DC, it is critical to expand new computing resources by constructing new data halls. Designing a DC with its complex requirements for power, cooling, and network infrastructure, along with ensuring optimal server performance and data security, is a daunting and painstaking task. Necessary considerations include:

  1. Power and Cooling: The new data hall needs to support high-density servers, which require significant power and cooling. The company must carefully plan the power distribution and cooling systems to ensure efficient operation and prevent equipment failure.
  2. Server Performance and Data Security: The company needs to select and configure servers that meet their specific performance requirements while ensuring data security.

Our approach leverages DCGPT to generate DC design candidates via prompts. Only DCGPT can generate a plausible data hall design and run CFD simulation for what-if analysis.

Case 4-Energy Optimization:

DCs are the backbone of the digital world, but they consume vast amounts of energy. Improving energy efficiency in these facilities is crucial for both cost savings and environmental sustainability. Optimizing cooling control is therefore crucial for energy efficiency. There are several aspects of optimized cooling control:

  1. Variable Speed Fans and Pumps refer to using fans and pumps that can adjust their speed based on cooling demands, rather than operating at a constant speed. It optimizes energy usage by matching cooling output to actual requirements. A colocation provider upgraded their cooling system with variable-speed fans and pumps. They saw a 15% reduction in energy consumption and improved cooling performance due to more precise control.
  2. AI-Powered Cooling Optimization refers to using artificial intelligence and machine learning to analyze data from sensors and optimize cooling system operation in real time. It enables dynamic adjustments to cooling strategies, maximizing efficiency and minimizing energy waste. A cloud service provider implemented an AI-powered cooling optimization system in their DCs. They were able to reduce cooling energy consumption by 25% while maintaining optimal server temperatures.

Our approach leverages DCGPT to comprehend the physical environment of DCs and provide the corresponding recommendations for DC operations. DCGPT provides a better action recommendation with a low ACU fan flow rate to save energy.

While GPT-4 gives the wrong actions, maybe cause energy wastage.

CONCLUSION & FUTURE WORK

This article introduces DCGPT, a domain-specific LLM designed for AIDC operation, along with an architectural vision for its implementation. We release our open-source models and projects. The case study demonstrates DCGPT’s capabilities in DC-related QA, construction design, facility maintenance, and energy optimization.  Future work will focus on expanding the training data to further improve performance and robustness, exploring fine-tuning strategies for specific AIDC tasks, and developing more efficient inference methods for real-time applications.  We also plan to investigate incorporating multi-modal data, such as images and sensor readings, to enhance the model’s understanding of the AIDC environment.

Leave a Reply

Your email address will not be published. Required fields are marked *