Zach Anderson
Apr 18, 2026 00:53
Elon Musk’s xAI releases Grok Speech to Textual content and Textual content to Speech APIs at $0.10/hour, claiming lowest error charges throughout enterprise transcription benchmarks.
Elon Musk’s xAI dropped two standalone audio APIs on April 17, positioning Grok’s speech know-how as a direct competitor to ElevenLabs, Deepgram, and AssemblyAI at aggressive worth factors.
The Grok Speech to Textual content API runs $0.10 per hour for batch processing and $0.20 per hour for real-time streaming. Textual content to Speech is available in at $4.20 per million characters. Each leverage the identical infrastructure powering Tesla automobiles and Starlink buyer assist.
Benchmark Claims Price Scrutinizing
xAI’s revealed phrase error charges inform an attention-grabbing story. On telephone name entity recognition—assume names, account numbers, dates—Grok STT claims 5.0% error fee versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That is a major hole if it holds up in manufacturing.
The corporate demonstrated this with a difficult take a look at case: transcribing Welsh names like “Anghared Llewelyn Bowen” and “Oisin MacGiolla Phadraig” alongside mortgage particulars. Grok nailed it with zero errors. Competing fashions found pronunciations and formatted dates inconsistently.
Video and podcast transcription exhibits tighter competitors—Grok and ElevenLabs tied at 2.4% error fee, with Deepgram and AssemblyAI trailing barely at 3.0% and three.2% respectively.
Technical Options for Builders
Past uncooked transcription, xAI in-built options that enterprise clients really need: word-level timestamps, speaker diarization throughout a number of audio channels, and assist for 25+ languages with seamless switching.
The Inverse Textual content Normalization characteristic routinely converts spoken numbers, dates, and currencies into correct codecs. “4 one 4 5 5 5 one two three 4” turns into a telephone quantity. “Six ninety-nine” turns into $6.99. Small element, however it eliminates post-processing complications.
Textual content to Speech consists of inline tags for prosody management—whispers, laughs, sighs, emphasis, pacing changes. Builders can inject emotional nuance with out wrestling with complicated audio markup.
Strategic Context
This launch follows xAI’s acquisition of X Corp in March 2025 and comes as the corporate expands its infrastructure partnerships. Simply two days earlier than the API announcement, stories emerged that xAI plans to produce computing energy to Cursor, the AI-powered coding startup.
The Colossus supercomputer, operational since December 2024, offers the backend muscle. xAI seems to be monetizing that capability throughout a number of verticals—enterprise AI, developer instruments, and now voice APIs.
For builders constructing voice brokers or transcription instruments, the pricing undercuts established gamers considerably. Whether or not Grok’s accuracy claims survive real-world deployment at scale stays the open query. The documentation and fee limits can be found by means of xAI’s API console for these prepared to check it.
Picture supply: Shutterstock


