Close Menu
StreamLineCrypto.comStreamLineCrypto.com
  • Home
  • Crypto News
  • Bitcoin
  • Altcoins
  • NFT
  • Defi
  • Blockchain
  • Metaverse
  • Regulations
  • Trading
What's Hot

Ether price may 20% drop as analysts say ‘downside risks remain’

May 15, 2026

The best crypto presales to get in before the crowd

May 15, 2026

Ethereum Price Reaching $4,000 Isn’t A Moonshot, Here’s What It Is

May 15, 2026
Facebook X (Twitter) Instagram
Friday, May 15 2026
  • Contact Us
  • Privacy Policy
  • Cookie Privacy Policy
  • Terms of Use
  • DMCA
Facebook X (Twitter) Instagram
StreamLineCrypto.comStreamLineCrypto.com
  • Home
  • Crypto News
  • Bitcoin
  • Altcoins
  • NFT
  • Defi
  • Blockchain
  • Metaverse
  • Regulations
  • Trading
StreamLineCrypto.comStreamLineCrypto.com

NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation

February 18, 2026Updated:February 19, 2026No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr Email
NVIDIA Run:ai GPU Fractioning Delivers 77% Throughput at Half Allocation
Share
Facebook Twitter LinkedIn Pinterest Email
ad


Darius Baruo
Feb 18, 2026 18:31

NVIDIA and Nebius benchmarks present GPU fractioning achieves 86% consumer capability on 0.5 GPU allocation, enabling 3x extra concurrent customers for blended AI workloads.





NVIDIA’s Run:ai platform can ship 77% of full GPU throughput utilizing simply half the {hardware} allocation, in accordance with joint benchmarking with cloud supplier Nebius launched February 18. The outcomes reveal that enterprises working giant language mannequin inference can dramatically develop capability with out proportional GPU funding.

The assessments, performed on clusters with 64 NVIDIA H100 NVL GPUs and 32 NVIDIA HGX B200 GPUs, confirmed fractional GPU scheduling reaching near-linear efficiency scaling throughout 0.5, 0.25, and 0.125 allocations.

Onerous Numbers from Manufacturing Testing

At 0.5 GPU allocation, the system supported 8,768 concurrent customers whereas sustaining time-to-first-token underneath one second—86% of the ten,200 customers supported at full allocation. Token era hit 152,694 tokens per second, in comparison with 198,680 at full capability.

Smaller fashions pushed these positive aspects additional. Phi-4-Mini working on 0.25 GPU fractions dealt with 72% extra concurrent customers than full-GPU deployment, reaching roughly 450,000 tokens per second with P95 latency underneath 300 milliseconds on 32 GPUs.

The blended workload state of affairs proved most placing. Operating Llama 3.1 8B, Phi-4 Mini, and Qwen-Embeddings concurrently on fractional allocations tripled whole concurrent system customers in comparison with single-model deployment. Mixed throughput exceeded 350,000 tokens per second at full scale with no cross-model interference.

Why This Issues for GPU Economics

Conventional Kubernetes schedulers allocate complete GPUs to particular person fashions, leaving substantial capability stranded. The benchmarks famous that even Qwen3-14B, the most important mannequin examined at 14 billion parameters, occupies solely 35% of an H100 NVL’s 80GB capability.

Run:ai’s scheduler eliminates this waste via dynamic reminiscence allocation. Customers specify necessities straight; the system handles useful resource distribution with out preconfiguration. Reminiscence isolation occurs at runtime whereas compute cycles distribute pretty amongst energetic processes.

This timing coincides with broader business strikes towards GPU partitioning. SoftBank and AMD introduced validation testing on February 16 for related fractioning capabilities on AMD Intuition GPUs, the place single GPUs can break up into as much as eight logical gadgets.

Autoscaling With out Latency Spikes

Nebius examined computerized scaling with Llama 3.1 8B configured so as to add GPUs when concurrent customers exceeded 50. Replicas scaled from 1 to 16 with clear ramp-up, steady utilization throughout pod warm-up, and negligible HTTP errors.

The sensible implication: enterprises can run a number of inference fashions on current GPU stock, scale dynamically throughout peak demand, and reclaim idle capability throughout off-hours for different workloads. For organizations dealing with fastened GPU budgets, fractioning transforms capability planning from {hardware} procurement into software program configuration.

Run:ai v2.24 is on the market now. NVIDIA plans to debate the Nebius implementation at GTC 2026.

Picture supply: Shutterstock


ad
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Related Posts

Ether price may 20% drop as analysts say ‘downside risks remain’

May 15, 2026

Ethereum Price Reaching $4,000 Isn’t A Moonshot, Here’s What It Is

May 15, 2026

DMND And RootstockLabs Partner To Bring Stratum V2 To Merge-mining

May 15, 2026

Trump family trust bought Coinbase and these crypto-related stocks in Q1, ethics filing shows

May 15, 2026
Add A Comment
Leave A Reply Cancel Reply

ad
What's New Here!
Ether price may 20% drop as analysts say ‘downside risks remain’
May 15, 2026
The best crypto presales to get in before the crowd
May 15, 2026
Ethereum Price Reaching $4,000 Isn’t A Moonshot, Here’s What It Is
May 15, 2026
Firm Strive Pushes SATA As Rival To Strategy’s STRC
May 15, 2026
Crypto is no longer a single industry, and that may be bullish
May 15, 2026
Facebook X (Twitter) Instagram Pinterest
  • Contact Us
  • Privacy Policy
  • Cookie Privacy Policy
  • Terms of Use
  • DMCA
© 2026 StreamlineCrypto.com - All Rights Reserved!

Type above and press Enter to search. Press Esc to cancel.