The AI landscape is shifting as open-weight large language models demonstrate increasingly competitive performance against proprietary alternatives. Recent Terminal-Bench Hard Benchmark results show that models like DeepSeek V3.2 Exp, GLM-4.6, and Kimi K2 0905 are not just catching up - in some cases, they're matching or exceeding industry leaders in complex coding workflows.
Benchmark Results: Open-Weights Challenge the Leaders
Artificial Analysis trader recently shared leaderboard data revealing this emerging competition.
The results paint a clear picture of the current state of AI performance:
- Grok 4 from xAI tops the chart at 37.6%
- GPT-5 Codex follows at 35.5%
- Claude 4.5 Sonnet at 33.3%
- DeepSeek V3.2 Exp achieved 29.1%, outpacing Gemini 2.5 Pro's 24.8%
- GLM-4.6 reached 23.4%
- Kimi K2 0905 scored 22.7%, demonstrating meaningful progress
- Qwen3 235B managed only 5.7%
The real story lies in open-weight performance - top performers are rewriting expectations about non-proprietary systems, though not all open efforts succeeded equally.
Why Developers Should Pay Attention
This shift carries real implications. Developers now have viable alternatives to proprietary APIs, offering deployment flexibility. Open-weight models frequently deliver strong performance at reduced costs, making them attractive for resource-conscious teams and independent builders. Rising capabilities also intensify competition among major providers, potentially accelerating innovation across the ecosystem.
Focus on Agentic Workflows
The Terminal-Bench Hard Benchmark evaluates models on multi-step reasoning tasks involving coding within terminal environments. Performance here reflects a model's ability to handle structured, real-world workflows - capabilities essential for agent applications and automation.