With the recent boom in AI, the footprint of AI workloads and AI supported hardware servers deployed in Cloud Data Centers has grown exponentially. This growth is spread across multiple regions worldwide over various data centers. To support this growth and to ensure leadership over various Cloud Competitors (like Azure, AWS, and GCP) have started building a fleet of specialized high performance computing servers. The AI workloads that perform a huge amount of data processing, training and inference of data models require a special kind of hardware unlike the traditional general purpose Compute servers. Hence, all cloud service providers are investing heaving on GPU, TPU and NPU based servers that are effective in hosting AI workloads. Majority of these servers are Buy Model and cloud service providers are dependent on ‘Other equipment Manufacturer’ (OEM) for diagnostics and maintenance of the hardware. This dependency has caused a lot of pain for cloud service providers as the repair SLAs are uncertain and their expensive impacting the fleet availability. Hence, the cloud providers are shifting from simple Buy to Make (OEM designed servers’ maintenance to In-House server maintenance). This shift in business model has resulted in transition in Service Model in Data Centers from OEM Reliant to Self-Maintainer. To support this Self Reliance and the growth of AI hardware fleet, every cloud service provider is aiming to reduce the cost of service and build Swift, Remote, Accurate, Automated and Economical HW Diagnostics.

Why Hardware Diagnostics Matter for AI

AI workloads are unique in nature and require parallel processing and compute-intensive hardware that are Reliable and Stable. However, HW components often fail, and at times without notice. A single degraded GPU or memory failure can derail hours of training or crash real-time inference endpoints. Some common hardware-related issues impacting AI workloads:

So, to support the needs of the customers for a High Availability and Uninterruptable service, Cloud providers need accurate HW Diagnostics that pinpoint the faulty component.

The hardware diagnostics engine for AI hardware will be broken down into the following components:

1) Telemetry collection layer: This layer focuses on collecting real time hardware telemetry regarding various components.

Platform will use cloud-based agents to collect and publish hardware telemetry to a centralized location.

2) Hardware risk scoring layer: This layer is used by diagnostics to lock a hardware risk score based on hardware failure patterns. Diagnostics engine will sample errors such as ECC error rates over time, Thermal headroom across workloads, GPU performance degradation from baseline, Firmware mismatches vs golden config and Hardware retry count per VM allocation.

Sample logic: Node_Health_Score = weighted_sum (ECC_rate, Thermal_Throttle, Firmware_Drift, Allocation_Retry)

The risk score will be used by diagnostics engine to predict and mitigate hardware failure.

3) Prediction, Mitigation and Remediation Layer: Diagnostic engine will use telemetry data across various hardware components and risk rating scores to take various mitigation and remediation actions.

A. Prediction of Hardware Faults

Mitigation of Hardware Faults

Remediation of Hardware Faults


4) Diagnostics metrics and AI hardware Fleet Wide Insights

Build a reporting dashboard to expose GPU/node health metrics:

Conclusion:

Building robust and reliable diagnostics will help baseline AI hardware health and find how the hardware health looks like across GPU SKUs and host models. We can correlate hardware failure events with AI model degradations.