Qwen 3.5 Chat Template Bug Causes Massive Cache Inefficiency
Why It Matters
This technical oversight leads to significant latency and computational waste in local AI deployments, undermining the efficiency of prefix caching in agentic workflows.
Key Points
- A bug in the Qwen 3.5 chat template causes empty reasoning tags to be emitted, breaking prefix cache reuse.
- The issue affects multiple inference backends including oMLX.ai and llama.cpp, particularly during tool-heavy or agentic workflows.
- Affected users experience unexpected latency spikes where tens of thousands of tokens are reprocessed unnecessarily during follow-up turns.
- The proposed fix involves adding a conditional check to the template to ensure historical blocks only render if reasoning content exists.
A significant technical flaw has been identified in the official chat template for the Qwen 3.5 model series, leading to massive cache misses during inference. The issue, discovered by developer onil_gova, stems from the template emitting empty historical reasoning blocks even when no reasoning content is present. This behavior causes 'prompt drift,' where identical conversation histories are serialized differently across requests, preventing inference engines like llama.cpp and oMLX from reusing previously processed tokens. Consequently, follow-up turns after tool-heavy interactions often trigger the reprocessing of tens of thousands of tokens. The developer has proposed a simple one-line logic fix to the Jinja template to ensure historical blocks are only rendered when they contain actual content. This discovery highlights the critical role of template consistency in maintaining performance for large language model applications.
A developer recently found a 'leak' in how Qwen 3.5 talks to computers that makes it much slower and more expensive to run. Think of it like a chef who insists on re-reading the entire recipe book every time you ask for a second napkin, just because they formatted the bookmark slightly differently. The model's template was adding empty, invisible spaces to its memory, which confused the system into thinking it was seeing new information. By fixing one line of code, the developer stopped this 'prompt drift,' allowing the computer to remember previous parts of the chat instantly instead of wasting time recalculating them.
Sides
Critics
Identified the technical flaw and proposed a one-line code fix to prevent prompt drift and cache misses.
Defenders
No defenders identified
Neutral
Creators of the Qwen 3.5 model and the original chat template currently under scrutiny for efficiency issues.
Noise Level
Forecast
Alibaba's Qwen team is likely to integrate this template fix into their official Hugging Face repositories and model configs within days to maintain their lead in open-source efficiency benchmarks. Developers of local inference engines will likely add temporary overrides or warnings for Qwen 3.5 users until the official templates are updated.
Based on current signals. Events may develop differently.
Timeline
Root cause identified
The issue is traced back to unnecessary empty blocks in the Jinja chat template, and a fix is shared publicly on Reddit.
Investigation begins
Developer onil_gova begins investigating unexplained cache misses on an M5 Max system while using Qwen 3.5.
Join the Discussion
Discuss this story
Community comments coming in a future update
Be the first to share your perspective. Subscribe to comment.