NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

Darius Baruo
Feb 18, 2026 18:31

NVIDIA and Nebius benchmarks present GPU fractioning achieves 86% consumer capability on 0.5 GPU allocation, enabling 3x extra concurrent customers for blended AI workloads.

NVIDIA’s Run:ai platform can ship 77% of full GPU throughput utilizing simply half the {hardware} allocation, in accordance with joint benchmarking with cloud supplier Nebius launched February 18. The outcomes reveal that enterprises working giant language mannequin inference can dramatically develop capability with out proportional GPU funding.

The assessments, performed on clusters with 64 NVIDIA H100 NVL GPUs and 32 NVIDIA HGX B200 GPUs, confirmed fractional GPU scheduling reaching near-linear efficiency scaling throughout 0.5, 0.25, and 0.125 allocations.

Onerous Numbers from Manufacturing Testing

At 0.5 GPU allocation, the system supported 8,768 concurrent customers whereas sustaining time-to-first-token underneath one second—86% of the ten,200 customers supported at full allocation. Token era hit 152,694 tokens per second, in comparison with 198,680 at full capability.

Smaller fashions pushed these positive aspects additional. Phi-4-Mini working on 0.25 GPU fractions dealt with 72% extra concurrent customers than full-GPU deployment, reaching roughly 450,000 tokens per second with P95 latency underneath 300 milliseconds on 32 GPUs.

The blended workload state of affairs proved most placing. Operating Llama 3.1 8B, Phi-4 Mini, and Qwen-Embeddings concurrently on fractional allocations tripled whole concurrent system customers in comparison with single-model deployment. Mixed throughput exceeded 350,000 tokens per second at full scale with no cross-model interference.

Why This Issues for GPU Economics

Conventional Kubernetes schedulers allocate complete GPUs to particular person fashions, leaving substantial capability stranded. The benchmarks famous that even Qwen3-14B, the most important mannequin examined at 14 billion parameters, occupies solely 35% of an H100 NVL’s 80GB capability.

Run:ai’s scheduler eliminates this waste via dynamic reminiscence allocation. Customers specify necessities straight; the system handles useful resource distribution with out preconfiguration. Reminiscence isolation occurs at runtime whereas compute cycles distribute pretty amongst energetic processes.

This timing coincides with broader business strikes towards GPU partitioning. SoftBank and AMD introduced validation testing on February 16 for related fractioning capabilities on AMD Intuition GPUs, the place single GPUs can break up into as much as eight logical gadgets.

Autoscaling With out Latency Spikes

Nebius examined computerized scaling with Llama 3.1 8B configured so as to add GPUs when concurrent customers exceeded 50. Replicas scaled from 1 to 16 with clear ramp-up, steady utilization throughout pod warm-up, and negligible HTTP errors.

The sensible implication: enterprises can run a number of inference fashions on current GPU stock, scale dynamically throughout peak demand, and reclaim idle capability throughout off-hours for different workloads. For organizations dealing with fastened GPU budgets, fractioning transforms capability planning from {hardware} procurement into software program configuration.

Run:ai v2.24 is on the market now. NVIDIA plans to debate the Nebius implementation at GTC 2026.

Picture supply: Shutterstock

What's Hot

Ether price may 20% drop as analysts say ‘downside risks remain’

The best crypto presales to get in before the crowd

Ethereum Price Reaching $4,000 Isn’t A Moonshot, Here’s What It Is

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

Ether price may 20% drop as analysts say ‘downside risks remain’

Ethereum Price Reaching $4,000 Isn’t A Moonshot, Here’s What It Is

DMND And RootstockLabs Partner To Bring Stratum V2 To Merge-mining

Trump family trust bought Coinbase and these crypto-related stocks in Q1, ethics filing shows

Ether price may 20% drop as analysts say ‘downside risks remain’

The best crypto presales to get in before the crowd

Ethereum Price Reaching $4,000 Isn’t A Moonshot, Here’s What It Is

Firm Strive Pushes SATA As Rival To Strategy’s STRC

Crypto is no longer a single industry, and that may be bullish

What's Hot

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

Onerous Numbers from Manufacturing Testing

Why This Issues for GPU Economics

Autoscaling With out Latency Spikes

Related Posts