Zach Anderson
Aug 14, 2024 04:45
Discover the intricacies of testing and operating giant GPU clusters for generative AI mannequin coaching, guaranteeing excessive efficiency and reliability.
Coaching generative AI fashions requires clusters of pricy, cutting-edge {hardware} similar to H100 GPUs and quick storage, interconnected via multi-network topologies involving Infiniband hyperlinks, switches, transceivers, and ethernet connections. Whereas high-performance computing (HPC) and AI cloud providers provide these specialised clusters, they arrive with substantial capital commitments. Nonetheless, not all clusters are created equal, in response to collectively.ai.
Introduction to GPU Cluster Testing
Reliability of GPU clusters varies considerably, with points starting from minor to important. As an illustration, Meta reported that in their 54-day coaching run of the Llama 3.1 mannequin, GPU points accounted for 58.7% of all surprising issues. Collectively AI, serving many AI startups and Fortune 500 corporations, has developed a strong validation framework to make sure {hardware} high quality earlier than deployment.
The Strategy of Testing Clusters at Collectively AI
The aim of acceptance testing is to make sure that {hardware} infrastructure meets specified necessities and delivers the reliability and efficiency needed for demanding AI/ML workloads.
1. Preparation and Configuration
The preliminary part includes configuring new {hardware} in a GPU cluster atmosphere, mimicking end-use situations. This contains putting in NVIDIA drivers, OFED drivers for Infiniband, CUDA, NCCL, HPCX, and configuring SLURM cluster and PCI settings for efficiency.
2. GPU Validation
Validation begins with guaranteeing the GPU kind and depend match expectations. Stress testing instruments like DCGM Diagnostics and gpu-burn are used to measure energy consumption and temperature underneath load. These assessments assist determine points like NVML driver mismatches or “GPU fell off the bus” errors.
3. NVLink and NVSwitch Validation
After particular person GPU validation, instruments like NCCL assessments and nvbandwidth measure GPU-to-GPU communication over NVLink. These assessments assist diagnose issues like a foul NVSwitch or down NVLinks.
4. Community Validation
For distributed coaching, community configuration is validated utilizing Infiniband or RoCE networking materials. Instruments like ibping, ib_read_bw, ib_write_bw, and NCCL assessments are used to make sure optimum efficiency. end in these assessments signifies the cluster will carry out nicely for distributed coaching workloads.
5. Storage Validation
Storage efficiency is essential for machine studying workloads. Instruments like fio measure completely different storage configurations’ efficiency traits, together with random reads, random writes, sustained reads, and sustained writes.
6. Mannequin Construct
The ultimate part includes operating reference duties tailor-made to buyer use circumstances. This ensures the cluster can obtain anticipated end-to-end efficiency. A well-liked process is constructing a mannequin with frameworks like PyTorch’s Totally Sharded Knowledge Parallel (FSPD) to judge coaching throughput, mannequin flops utilization, GPU utilization, and community communication latencies.
7. Observability
Steady monitoring for {hardware} failures is important. Collectively AI makes use of Telegraf to gather system metrics, guaranteeing most uptime and reliability. Monitoring contains cluster-level and host-level metrics, similar to CPU/GPU utilization, obtainable reminiscence, disk area, and community connectivity.
Conclusion
Acceptance testing is indispensable for AI/ML startups delivering top-tier computational sources. A complete and structured strategy ensures steady and dependable infrastructure, supporting the meant GPU workloads. Corporations are inspired to run acceptance testing on delivered GPU clusters and report any points for troubleshooting.
Picture supply: Shutterstock