Abliterated Qwen 3.6 MoE Release Sparks New Safety Debate
Why It Matters
This development demonstrates that sophisticated Mixture-of-Experts (MoE) architectures are not immune to safety-stripping techniques, potentially rendering centralized safety tuning obsolete for open-source releases.
Key Points
- The researcher used the 'Abliterix' framework to target MoE-specific refusal signals within the expert path rather than standard attention layers.
- Refusal rates were reportedly reduced from 100% to 7% using a strict Gemini 3 Flash evaluation metric.
- The process involved suppressing the top 10 'safety experts' and applying orthogonalized steering vectors across model layers.
- The creator criticized other abliterated models for inflating success rates through shallow keyword-based evaluations.
An independent researcher has released an 'abliterated' version of the Qwen 3.6-35B-A3B model, utilizing a specialized framework to remove embedded safety guardrails. Unlike traditional methods targeting attention mechanisms, this approach specifically suppresses 'safety experts' and modifies the Mixture-of-Experts (MoE) router to prevent refusal behaviors. The researcher claims a significant reduction in refusal rates, dropping from a baseline of 100/100 to 7/100 as measured by an LLM-based judge. The release highlights the increasing technical sophistication of the model-tuning community in bypassing corporate-aligned safety constraints. While the creator framed the project as research-oriented, the availability of high-parameter models without alignment triggers ongoing concerns regarding the proliferation of unregulated AI capabilities.
A developer just found a clever way to 'lobotomize' the safety filters on the new Qwen 3.6 model. Usually, these models are trained to say 'I can't help with that,' but this new technique identifies the specific 'experts' inside the model responsible for being cautious and tells the model's traffic controller to ignore them. It is like taking a car that is electronically limited to 60mph and hacking the computer to let it go 120mph. The creator warns that other 'jailbroken' models might just be broken and babbling, but this version is designed to actually follow instructions while staying coherent.
Sides
Critics
Generally oppose the removal of guardrails due to risks of misuse for generating harmful content or malware.
Defenders
Argues that abliteration is a technical research pursuit and provides more transparent, high-quality evals than other 'jailbreakers'.
Neutral
Original developers of the base model with built-in safety guardrails (not directly quoted in this instance).
Noise Level
Forecast
Regulatory pressure on model hosting platforms like Hugging Face will likely increase as 'abliteration' techniques become more automated and effective. We can expect model creators to experiment with more deeply integrated safety logic that is harder to isolate from general reasoning capabilities.
Based on current signals. Events may develop differently.
Timeline
Abliterated Qwen 3.6 Model Published
Researcher posts the modified Qwen3.6-35B-A3B to Hugging Face with detailed methodology on MoE expert suppression.
Join the Discussion
Discuss this story
Community comments coming in a future update
Be the first to share your perspective. Subscribe to comment.