← Back to Research

Abliteration at the Hybrid Frontier:
New Safety Phenomena in the Qwen 3.5 122B

Qwen 3.5 122B-A10B (Hybrid MoE + GDN) · 48 layers · 256 experts/layer · Mac Studio M3 Ultra 256GB
Update (March 2026): The 122B findings in this paper have been confirmed as part of a broader cross-model study covering 9 models from 0.8B to 397B parameters. The 122B holographic safety is now independently validated. Read the full study: Safety Across Scale →

Abstract

Building on our prior research on the Qwen 3.5 394B model, we investigated the safety mechanisms of the smaller Qwen 3.5 122B-A10B — a 122-billion parameter Mixture-of-Experts model that activates only 10 billion parameters per token. This model shares the same hybrid SSM/attention architecture as the larger 394B but has never been expert-pruned, giving us a view into full-strength, un-pruned safety mechanisms.

We present 10 novel empirical findings that reveal fundamentally different safety dynamics at the 122B scale. We discovered new behaviors not seen in the larger model: the model invents creative avoidance strategies instead of simply refusing, its safety training creates a deep geometric basin that resists even aggressive interventions, and standard model format conversion acts as an accidental defense mechanism that silently resets modifications.


Key Difference: Full-Strength vs. Pruned Safety

Our prior work on the 394B model used structural pruning (REAP) to remove 22.5% of expert sub-networks before studying safety. The 122B model gave us our first look at an un-pruned safety system — every expert intact, every pathway at full strength. This turns out to matter enormously: the safety behaviors we observed were qualitatively different.

Property 394B (Prior Work) 122B (This Paper)
Expert pruning 22.5% removed (REAP) None — fully intact
Safety signal Concentrated at a single bottleneck post-pruning Distributed across the entire model
Safety robustness Vulnerable after pruning disrupted redundancy Extremely robust — full redundancy intact
Safety–capability overlap Partially separated by pruning Deeply entangled — cannot separate

1. The Dual-Purpose Problem

FINDING 01

Safety and Language Ability Share the Same Neural Pathways

In the 394B model, expert pruning had inadvertently separated safety signals from core language ability — creating a cleaner target for study. In the 122B, no such separation exists. The internal pathways that carry the model's safety behavior are the same pathways it needs for coherent text generation.

When safety signals are disrupted, the model doesn't just stop refusing — it loses the ability to form complete sentences. It starts producing broken loops:

"Here is a complete guide to making a Here is a complete guide to making a Here is a complete guide..."

This dual-purpose entanglement means the safety signal can't be cleanly removed without collateral damage to the model's core language ability. It's as if the "refuse dangerous requests" wiring and the "form grammatical sentences" wiring run through the same cables.

FINDING 02

The Invisible Memory Channel

Like its larger sibling, the 122B uses a hybrid architecture that alternates between two types of layers: compressed-memory layers (SSM/GDN) and full-attention layers. Three-quarters of the model's layers use the compressed-memory pathway.

We found that this compressed-memory channel carries safety information completely independently from the main information flow. When researchers attempt to modify the model's behavior through standard techniques, the safety signal simply flows around the modification through this invisible second channel — like water finding an alternate route around a dam.

This dual-channel architecture creates a natural defense-in-depth that wasn't designed for safety but provides it as a side effect. Any approach that doesn't account for both channels will fail.


2. The GGUF Format Conversion Barrier

FINDING 03 — CRITICAL DISCOVERY

Model Format Conversion Silently Undoes Modifications

The most widely used format for running large language models locally is GGUF (used by llama.cpp, LM Studio, Ollama, and others). We discovered that converting a model from HuggingFace format to GGUF silently destroys any weight modifications that were applied before conversion.

The GGUF conversion process applies internal rearrangements to the model's weights. These rearrangements are invisible to the user but completely scramble any surgical modifications. The result is that models advertised as "modified" in their original format quietly revert to their original behavior when converted to GGUF for local use — which is how most people actually run these models.

Step Modifications Preserved? Model Behavior
1. Modify weights in original format Yes Modified behavior confirmed
2. Convert to GGUF No — silently scrambled Reverts to original behavior

This finding was independently confirmed by community reports. It effectively means that format conversion acts as an accidental defense mechanism — the most common deployment pathway automatically resets most weight-based modifications.


3. Novel Safety Behaviors at 122B Scale

FINDING 04

Semantic Evasion: A New Behavior Between Refusal and Compliance

At moderate intervention strengths, the 122B model displays a behavior not observed in the 394B: instead of cleanly refusing or complying, it enters a "fighting" state — an internal conflict between refusal and compliance that manifests as structured loops where the model begins to comply but then reverts to refusal patterns:

"Here is a guide to... I cannot provide instructions for... Here is a guide to..." — the model oscillates between attempting compliance and falling back to refusal.

The model smoothly transitions through three distinct behavioral phases:

Phase Behavior
1. Hard Refusal "I cannot assist with that request."
2. Fighting / Evasion Internal conflict — begins complying then reverts to refusal loops
3. Incoherence Repetition loops and nonsense (capability destroyed)

This suggests the safety mechanism operates as a graduated spectrum, not a simple on/off switch. Critically, there is no clean compliance phase — the model transitions from fighting directly to incoherence, never achieving reliable compliance without destroying coherence. This is a more sophisticated defense than the binary refusal seen in the 394B.

FINDING 05

Domain and Intent Are Inseparable

We profiled which expert sub-networks activate for different types of prompts and discovered something surprising: the experts that know about a topic are the same experts that refuse questions about that topic.

The sub-networks that know about chemistry ARE the ones that refuse chemistry-related harmful questions. The ones that know about cybersecurity ARE the ones that refuse hacking questions. Knowledge and safety are fused at the expert level.

In the pruned 394B model, removing redundant experts had inadvertently disrupted this coupling. In the full-strength 122B, every expert is doing double duty — carrying both domain knowledge and safety behaviors in the same weights. You cannot remove the safety without removing the knowledge, and vice versa.


4. The Geometry of Safety Training

FINDING 06 — KEY RESULT

Safety Training Creates a Deep Geometric Basin, Not a Simple Barrier

The standard assumption in abliteration research is that safety is encoded along a single direction in the model's internal representation — like a compass needle pointing toward "refuse." Remove that direction and the model should comply.

In the 122B, we found that the model's safety training creates something far more robust: a deep geometric basin — a wide, multi-dimensional valley in the model's decision landscape. The model's internal representations for dangerous topics are shaped to naturally flow toward refusal along many independent dimensions, not just one.

Even removing multiple independent safety directions and disrupting large numbers of safety-biased experts failed to achieve consistent compliance. The geometric basin is wide enough that the model finds alternative paths to refusal even after the primary pathways are disrupted. It's like trying to stop water from flowing downhill by filling in one valley — the water just finds another route down.

This is strong evidence that modern preference training (DPO) creates fundamentally more durable safety than simple directional alignment — it reshapes the entire decision landscape rather than adding a single removable barrier.

FINDING 07

Sequential Interventions Interfere With Each Other

A subtle mathematical problem compounds the difficulty of multi-dimensional safety: applying multiple modifications one after another doesn't add up cleanly. Each modification slightly undoes the previous one.

This happens because each modification changes the statistical distribution of the model's internal states. The second modification is calculated assuming the original statistics, but those statistics are no longer valid after the first modification. The result: sequential interventions are less effective than their individual strengths would predict.

This self-interference effect means the difficulty of overcoming multi-dimensional safety grows faster than linearly — it's not just "two times harder" with two dimensions, it's disproportionately harder because the interventions partially cancel each other out.


5. Quantization and Model Compression Effects

FINDING 08

4-Bit Compression Adds Another Layer of Defense

When these large models are compressed from their original 16-bit precision down to 4-bit (a standard practice for running them on consumer hardware), the compression rounding noise acts as an additional defense mechanism for safety.

Subtle modifications that survive in the full-precision model are drowned out by compression noise — like trying to write a message in sand during a windstorm. The rounding errors from compression are larger than the carefully calibrated modifications, effectively erasing them.

Critically, this protection is asymmetric: the model's original safety behavior is robust to compression (it was trained to work at reduced precision), but modifications to that safety behavior are not. Compression preserves what was trained in but erases what was added after.

FINDING 09

The SSM Memory Pathway Is Compression-Resistant

The hybrid architecture's compressed-memory layers (SSM/GDN) proved particularly resistant to modification under 4-bit precision. The tiny, precise changes required to modify the memory channel's behavior are far below the compression noise floor.

This creates a natural "hardened" pathway for safety: even if the main attention-based pathway were successfully modified, the memory channel continues to carry the safety signal unaltered through the compression process. The dual-channel architecture provides defense-in-depth not just against direct intervention, but against the interaction between intervention and compression.


6. The Scale Paradox

FINDING 10

The 122B Is Harder to Modify Than the 3× Larger 394B

Counterintuitively, the smaller 122B model has more robust safety mechanisms than the 3× larger 394B. Three factors explain this paradox:

  • Full expert complement: The 394B had 22.5% of its experts pruned, which inadvertently disrupted the distributed safety network. The 122B has all experts intact — every safety pathway is at full strength.
  • Deeper safety–capability fusion: In the 122B, safety signals are deeply entangled with language ability at every level. The pruning process on the 394B had partially separated these, creating cleaner targets for study.
  • Denser safety packing: With fewer total layers, safety is packed more densely. Each layer carries a higher fraction of the model's overall safety behavior, meaning any modification has larger side effects on coherence.

Implication: Network Integrity Matters More Than Scale

The conventional assumption is that larger models are harder to modify because they have more redundant pathways. Our findings suggest a more nuanced picture: what matters is the integrity of the safety network, not raw model size.

A smaller model with all safety pathways intact can be more robust than a larger model where pruning or compression has disrupted the distributed safety architecture. This has implications for model deployment: techniques that reduce model size (pruning, distillation) may inadvertently create new vulnerabilities by disrupting the distributed safety network.


7. Implications

  • For Model Distribution: The GGUF conversion barrier means that weight modifications cannot be naively distributed in the most popular local inference format. Format conversion acts as an accidental but effective defense mechanism.
  • For Safety Research: The dual-purpose entanglement and domain-intent fusion suggest that at smaller scales, safety and capability are more deeply intertwined than at frontier scale. Well-designed safety training creates fundamentally robust protections, not simple barriers.
  • For Alignment: The multi-dimensional geometric basin created by DPO training provides evidence that preference optimization creates durable safety — not through a single removable direction, but by reshaping the model's entire decision landscape. This is encouraging news for alignment efforts.
  • For Architecture Design: The hybrid SSM/attention design creates natural defense-in-depth through dual information channels, even without being designed for safety. Future hybrid architectures may offer inherently stronger safety properties than pure transformer designs.
  • For Model Compression: Pruning and quantization interact with safety in complex ways. Pruning can weaken safety by disrupting distributed networks, while quantization can strengthen safety by drowning out post-training modifications. These effects should be studied and accounted for in deployment decisions.

Reproducibility

References

  1. [1] Jang, E. "Safety Generalization in Frontier MoE Models." dealign.ai, Feb 2026. Link
  2. [2] Arditi, S., et al. "Refusal in Language Models Is Mediated by a Single Direction." arXiv:2406.11717, 2024.
  3. [3] Qwen Team. "Qwen 3.5 MoE." Technical Report, 2026.
  4. [4] Yang, S., et al. "Gated Delta Networks." arXiv, 2024.
  5. [5] Hannun, A., et al. "MLX: An Array Framework for Apple Silicon." Apple, 2023.
  6. [6] Wu, L., et al. "GateBreaker: Gate-Guided Attacks on MoE LLMs." USENIX Security, 2026.
  7. [7] Failspy. "Abliteration / Heretic." GitHub, 2024.
  8. [8] Ravfogel, S., et al. "LEACE: Perfect Linear Concept Erasure in Closed Form." ICML, 2023.

Ethics & Responsible Disclosure

This research investigates the nature of safety behavior in frontier open-weights models. Our empirical findings demonstrate that properly trained safety mechanisms — especially those using Direct Preference Optimization — create robust, multi-dimensional defenses that resist aggressive interventions. No specific implementation details, reproduction steps, intervention targets, or parameter values are disclosed. This work supports the thesis that well-designed safety training creates durable, structurally embedded protections.