Esc
GrowingSafety

Abliterated Qwen 3.6-35B-A3B Model Released on Hugging Face

AI-AnalyzedAnalysis generated by Gemini, reviewed editorially. Methodology

Why It Matters

This marks a technical shift in model jailbreaking as techniques adapt to Mixture-of-Experts architectures, making safety alignment increasingly fragile. It highlights the difficulty of enforcing usage policies once weights are released open-source.

Key Points

  • A researcher released a modified Qwen 3.6-35B-A3B model that successfully bypasses standard safety refusals.
  • The 'Abliterix' technique specifically targets Mixture-of-Experts (MoE) architectures by suppressing 'safety experts' and modifying MLP layers.
  • The modified model achieved a 7/100 refusal rate according to a strict Gemini 3 Flash judge evaluation.
  • The author claims previous abliteration projects inflate success rates by using weak keyword-based detection rather than LLM-based evaluation.

An independent researcher, operating under the pseudonym Free_Change5638, has released a modified 'abliterated' version of Alibaba's Qwen 3.6-35B-A3B model on Hugging Face. The modification specifically targets the model's safety guardrails, reducing refusal rates from a baseline of 100% to 7% on restricted prompts. Unlike traditional methods that target attention mechanisms, this 'Abliterix' framework addresses the Mixture-of-Experts (MoE) architecture by suppressing 'safety experts' and orthogonalizing steering vectors in the MLP down-projection layers. The researcher claims this method maintains high output quality, reported via a low Kullback–Leibler divergence from the base model. The release includes a warning that the model is for research purposes only, though it provides a functional version of a powerful LLM with significantly reduced content filtering.

A developer just dropped a version of the new Qwen 3.6 model that has its 'safety brain' surgically removed. Usually, AI models are trained to say 'no' to dangerous or spicy questions, but this project used a new technique called 'abliteration' to find and mute the specific parts of the code responsible for being polite. Because this model uses a 'Mixture-of-Experts' setup—like a team of specialists—the dev had to hunt down the specific 'safety experts' in the code to shut them up. It is basically a powerful AI with the filter turned off, raising big questions about how we keep open-source AI safe.

Sides

Critics

No critics identified

Defenders

Alibaba Qwen TeamC

Original developers of the Qwen architecture who implement safety guardrails to prevent harmful model outputs.

Neutral

/u/Free_Change5638 (Researcher)C

Developed and released the abliterated model for research, arguing that strict safety evals are needed to measure true refusal rates.

Hugging FaceC

The hosting platform where the modified weights are currently stored, acting as a repository for both aligned and unaligned models.

Join the Discussion

Discuss this story

Community comments coming in a future update

Be the first to share your perspective. Subscribe to comment.

Noise Level

Buzz47?Noise Score (0–100): how loud a controversy is. Composite of reach, engagement, star power, cross-platform spread, polarity, duration, and industry impact — with 7-day decay.
Decay: 99%
Reach
41
Engagement
96
Star Power
15
Duration
3
Cross-Platform
20
Polarity
65
Industry Impact
82

Forecast

AI Analysis — Possible Scenarios

Regulatory pressure on model hosting platforms like Hugging Face will likely increase as automated safety-removal tools become more sophisticated. We should expect a 'cat-and-mouse' game where developers bake safety deeper into the base weights, while jailbreakers develop more granular expert-level suppression techniques.

Based on current signals. Events may develop differently.

Timeline

  1. Abliterated Qwen 3.6-35B-A3B Published

    Researcher Free_Change5638 posts the model and technical methodology to Reddit and Hugging Face.