Esc
EmergingOther

Qwen 3.6 27B Struggles on DeepSWE Software Engineering Benchmark

AI-AnalyzedAnalysis generated by Gemini, reviewed editorially. Methodology

Why It Matters

The performance gap between open-weights models and proprietary leaders on complex software engineering tasks suggests local LLMs are falling behind. This highlights the massive compute and architectural advantages held by closed-source developers.

Key Points

  • Qwen 3.6 27B scored a 1.79% success rate on the DeepSWE benchmark, ranking 18th out of 20 models.
  • The model demonstrated surprisingly efficient token usage despite a community reputation for being overly verbose.
  • The test utilized a single-rollout methodology on an RTX6000 Blackwell GPU via VLLM to manage high compute costs.
  • Performance results indicate a widening gap between open-weights local models and leading-edge proprietary software agents.

Independent benchmarking of Alibaba's Qwen 3.6 27B model on the DeepSWE software engineering evaluation suite has revealed a 1.79% success rate, placing it near the bottom of current leaderboards. Conducted by a community researcher using an RTX6000 Ada Blackwell instance, the evaluation took 70 hours and ranked the model 18th out of 20 tested systems, slightly ahead of Claude 4.5 Haiku. Despite a reputation for verbosity, the model's output token count remained competitive with peers, though its overall efficacy lagged significantly behind closed-source alternatives like Kimi-k2.6. The researcher utilized a 262k context window and a single rollout per task to minimize costs. These results fuel ongoing debates regarding the viability of medium-sized open-source models for autonomous agentic workflows compared to high-parameter proprietary frontiers.

A recent independent test of the Qwen 3.6 27B model on a tough software engineering test called DeepSWE showed it only solved about 2% of the problems. Think of it like a local athlete trying to compete in the Olympics; while it beat a couple of other models, it was nowhere near the top performers. The person who ran the test noticed that even though this model is famous for talking too much, it actually stayed on point during the test. The big takeaway is that 'local' AI models you can run yourself are starting to look like the 'poor man's version' of the massive, secret models owned by big tech companies.

Sides

Critics

No critics identified

Defenders

Alibaba Qwen TeamC

Developers of the Qwen 3.6 27B model, providing open-weights models for the community.

Neutral

u/SteppenAxolotlC

Independent researcher who conducted the benchmark and expressed skepticism about the future of local AI models.

Kimi/Moonshot AIC

Developer of Kimi-k2.6, cited as the leading open-source model that remains difficult to run locally due to its size.

Join the Discussion

Discuss this story

Community comments coming in a future update

Be the first to share your perspective. Subscribe to comment.

Noise Level

Buzz46?Noise Score (0–100): how loud a controversy is. Composite of reach, engagement, star power, cross-platform spread, polarity, duration, and industry impact β€” with 7-day decay.
Decay: 99%
Reach
51
Engagement
54
Star Power
15
Duration
100
Cross-Platform
20
Polarity
45
Industry Impact
60

Forecast

AI Analysis β€” Possible Scenarios

Open-source developers will likely pivot toward specialized fine-tuning or 'MoE' architectures for coding to close the gap. However, proprietary models will likely maintain their lead as software engineering benchmarks demand higher reasoning compute than local hardware currently supports.

Based on current signals. Events may develop differently.

Timeline

Today

R@/u/SteppenAxolotl

Qwen 3.6 27B on DeepSWE

Qwen 3.6 27B on DeepSWE Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar to 3.6 Plus and i…

Timeline

  1. Benchmark results published

    User u/SteppenAxolotl shares the 70-hour benchmark results of Qwen 3.6 27B on the DeepSWE suite.