Alibaba has released its latest multimodal AI systems - Qwen3 Omni 30B and Qwen3 Omni Realtime - designed to process text, speech, images, and video through a single unified architecture. According to benchmark data from Artificial Analysis, these models significantly outperform Google's Gemini 2.0 Flash in speech reasoning tasks, though they still trail behind OpenAI's GPT-4o. The release signals Alibaba's increasingly competitive position in the global AI landscape.
Performance on Speech Reasoning Benchmarks
Artificial Analysis recently evaluated Alibaba's new models against competitors using the Big Bench Audio dataset, which tests multimodal reasoning capabilities across speech understanding tasks.

The benchmark results reveal a clear performance hierarchy in the current AI landscape. The scores break down as follows:
- GPT-4o Speech-to-Speech Pipeline leading at 90%
- GPT Realtime following at 83%
- Gemini 2.5 Flash Live at 74%
- Gemini 2.5 Flash Native Audio Dialog at 72%
- Qwen3 Omni 30B Realtime at 59%
- Qwen3 Omni 30B at 58%
- Gemini 2.0 Flash at just 36%
This places Alibaba's models solidly in the mid-tier for speech reasoning, above Google's latest Flash release but still behind OpenAI's GPT-4o family.
Architecture and Latency
Alibaba built the system around two Mixture-of-Experts modules: the Thinker MoE handles core reasoning, while the Talker MoE generates natural speech with independent control over style and voice characteristics. This architecture enables native multimodal reasoning across text, audio, and video. Response speed varies notably - Qwen3 Omni 30B averages 4.8 seconds to first audio, while Qwen3 Omni Realtime cuts that down to 0.9 seconds. OpenAI's GPT-4o Realtime remains faster at around 0.6 seconds, though all these systems still lag behind the 0.2–0.3 second human response time.
Features and Accessibility
The models support 119 text languages, 19 speech input languages, and 10 speech output languages, offering 17 distinct voices with 24kHz audio output. Developers can access the system through Alibaba Cloud's DashScope API, and the Qwen3 Omni model weights are openly available on Hugging Face and ModelScope under the Apache 2.0 license. This combination of broad language support and open access makes the technology widely available to developers worldwide.