Qwen 3.6 27B Struggles on DeepSWE Software Engineering Benchmark
Why It Matters
The performance gap between open-weights models and proprietary leaders on complex software engineering tasks suggests local LLMs are falling behind. This highlights the massive compute and architectural advantages held by closed-source developers.
Key Points
- Qwen 3.6 27B scored a 1.79% success rate on the DeepSWE benchmark, ranking 18th out of 20 models.
- The model demonstrated surprisingly efficient token usage despite a community reputation for being overly verbose.
- The test utilized a single-rollout methodology on an RTX6000 Blackwell GPU via VLLM to manage high compute costs.
- Performance results indicate a widening gap between open-weights local models and leading-edge proprietary software agents.
Independent benchmarking of Alibaba's Qwen 3.6 27B model on the DeepSWE software engineering evaluation suite has revealed a 1.79% success rate, placing it near the bottom of current leaderboards. Conducted by a community researcher using an RTX6000 Ada Blackwell instance, the evaluation took 70 hours and ranked the model 18th out of 20 tested systems, slightly ahead of Claude 4.5 Haiku. Despite a reputation for verbosity, the model's output token count remained competitive with peers, though its overall efficacy lagged significantly behind closed-source alternatives like Kimi-k2.6. The researcher utilized a 262k context window and a single rollout per task to minimize costs. These results fuel ongoing debates regarding the viability of medium-sized open-source models for autonomous agentic workflows compared to high-parameter proprietary frontiers.
A recent independent test of the Qwen 3.6 27B model on a tough software engineering test called DeepSWE showed it only solved about 2% of the problems. Think of it like a local athlete trying to compete in the Olympics; while it beat a couple of other models, it was nowhere near the top performers. The person who ran the test noticed that even though this model is famous for talking too much, it actually stayed on point during the test. The big takeaway is that 'local' AI models you can run yourself are starting to look like the 'poor man's version' of the massive, secret models owned by big tech companies.
Sides
Critics
No critics identified
Defenders
Developers of the Qwen 3.6 27B model, providing open-weights models for the community.
Neutral
Independent researcher who conducted the benchmark and expressed skepticism about the future of local AI models.
Developer of Kimi-k2.6, cited as the leading open-source model that remains difficult to run locally due to its size.
Noise Level
Forecast
Open-source developers will likely pivot toward specialized fine-tuning or 'MoE' architectures for coding to close the gap. However, proprietary models will likely maintain their lead as software engineering benchmarks demand higher reasoning compute than local hardware currently supports.
Based on current signals. Events may develop differently.
Timeline
Benchmark results published
User u/SteppenAxolotl shares the 70-hour benchmark results of Qwen 3.6 27B on the DeepSWE suite.
Join the Discussion
Discuss this story
Community comments coming in a future update
Be the first to share your perspective. Subscribe to comment.