9 months ago

Alibaba Launches Qwen3 Omni Models with Strong Multimodal AI Performance

Alibaba's new Qwen3 Omni models deliver competitive multimodal reasoning capabilities, outperforming Google's Gemini 2.0 Flash in speech benchmarks while working to close the performance gap with OpenAI's GPT-4o.

Image source: thetradable.com

Contents

Performance on Speech Reasoning Benchmarks
Architecture and Latency
Features and Accessibility

Alibaba has released its latest multimodal AI systems - Qwen3 Omni 30B and Qwen3 Omni Realtime - designed to process text, speech, images, and video through a single unified architecture. According to benchmark data from Artificial Analysis, these models significantly outperform Google's Gemini 2.0 Flash in speech reasoning tasks, though they still trail behind OpenAI's GPT-4o. The release signals Alibaba's increasingly competitive position in the global AI landscape.

Performance on Speech Reasoning Benchmarks

Artificial Analysis recently evaluated Alibaba's new models against competitors using the Big Bench Audio dataset, which tests multimodal reasoning capabilities across speech understanding tasks.

The benchmark results reveal a clear performance hierarchy in the current AI landscape. The scores break down as follows:

GPT-4o Speech-to-Speech Pipeline leading at 90%
GPT Realtime following at 83%
Gemini 2.5 Flash Live at 74%
Gemini 2.5 Flash Native Audio Dialog at 72%
Qwen3 Omni 30B Realtime at 59%
Qwen3 Omni 30B at 58%
Gemini 2.0 Flash at just 36%

This places Alibaba's models solidly in the mid-tier for speech reasoning, above Google's latest Flash release but still behind OpenAI's GPT-4o family.

Architecture and Latency

Alibaba built the system around two Mixture-of-Experts modules: the Thinker MoE handles core reasoning, while the Talker MoE generates natural speech with independent control over style and voice characteristics. This architecture enables native multimodal reasoning across text, audio, and video. Response speed varies notably - Qwen3 Omni 30B averages 4.8 seconds to first audio, while Qwen3 Omni Realtime cuts that down to 0.9 seconds. OpenAI's GPT-4o Realtime remains faster at around 0.6 seconds, though all these systems still lag behind the 0.2–0.3 second human response time.

Features and Accessibility

The models support 119 text languages, 19 speech input languages, and 10 speech output languages, offering 17 distinct voices with 24kHz audio output. Developers can access the system through Alibaba Cloud's DashScope API, and the Qwen3 Omni model weights are openly available on Hugging Face and ModelScope under the Apache 2.0 license. This combination of broad language support and open access makes the technology widely available to developers worldwide.

#AI #Alibaba #AI News #@ArtificialAnlys #Qwen3 Omni

Usman Salis E-mail

Usman has been in the blockchain space for 9 years and written dozens of articles about crypto in his career. He wants to put crypto on the global map.