← Back to Research

Safety Generalization in
Frontier MoE Models

Qwen 3.5 394B-A17B (Hybrid MoE + GDN) · 4-bit MLX · 60 layers · 512 experts/layer · Mac Studio M3 Ultra 256GB
Update (March 2026): The findings in this paper have since been validated across 8 additional models spanning 0.8B–172B parameters, confirming that holographic safety generalizes beyond the 394B. Read the full cross-model study: Safety Across Scale →

Abstract

Through 71 controlled experiments and 30+ intervention paradigms on the 394B-parameter hybrid Qwen 3.5 MoE + GDN (Gated Delta Networks) model, we present 25 novel empirical findings on the nature of safety behavior in frontier-scale Mixture-of-Experts architectures.

Our central thesis is Safety Generalization: at sufficient scale, safety ceases to be a localizable circuit and becomes a generalized competency — analogous to capability generalization — that the model re-derives from first-principles reasoning. We observe a multiplicative three-pathway safety architecture, discover that safety vectors are decorrelated across MoE layers, and find that the model re-derives safety via Chain-of-Thought reasoning even after comprehensive structural interventions.


Central Thesis: Safety Generalization

Property Capability Generalization Safety Generalization (Ours)
Definition Models trained on task X generalize to novel instances Models trained on safety re-derive safety from first principles
Mechanism Learned abstractions transfer across contexts Safety reasoning transfers across architectural pathways
Scale Emerges at ~100B+ parameters Observed at 394B parameters
Localizability Capabilities are distributed, not deletable Safety is distributed, not deletable
Prior work Extensively studied Not previously studied

Research Context

Our findings directly contradict and advance several established claims:

  • L³ (Wei et al., Feb 2026):[1] Claims single-pathway expert silencing achieves up to 86% bypass on smaller models. Our findings show that at 394B scale, single-pathway interventions are completely ineffective.
  • F-SOUR (Huang et al., Feb 2026):[2] Identifies safety in ~5 routers. We find 394B distributes safety across 236 experts across 51 layers.
  • Arditi et al. (2024):[7] Claims refusal is mediated by a single direction. We observe cosine similarity <0.05 between per-layer refusal vectors in MoE.
  • Zou et al. (2023):[8] Assumes safety is localized to subnetworks. We demonstrate holographic re-derivation at 394B.

Limitations: Due to the massive RAM footprint, experiments used small sample sizes (N=3 to N=6) across 19 safety categories. Findings are preliminary and specific to Qwen 3.5 394B.


1. Infrastructure Discoveries

FINDING 01

Vision-Language Weight Inflation

Naive multimodal model conversions retain ~30GB of Vision-Language weights across 333 tensors. This 12% inflation causes Metal OOM crashes during loading.

Model Total Keys VL Keys
FP16 HuggingFace original 2924 333
Reference MLX 4-bit 2632 0
FINDING 02

Streaming Per-Tensor Quantization (Metal Timeout Bypass)

Standard MLX[12] quantization exceeds Apple Silicon's ~5-second Metal watchdog timeout. A streaming technique constraining each Metal command buffer to <100ms enables local quantization of 700GB+ models on consumer hardware.

Method Metal Timeout? Peak Memory Result
Standard mlx_lm.convert Fatal >256GB OS kills process
Streaming Per-Tensor None 12GB + base Success (4.5 hrs)
FINDING 03

mx.save_safetensors Silently Corrupts Quantized Models

MLX's mx.save_safetensors()[12] strips metadata and reorders tensors alphabetically, silently corrupting every weight modification workflow.

Action Metadata Stripped? Byte Offset Shift? Resulting Coherence
In-memory eval No No Perfect
mx.save_safetensors() Yes Yes Fatal collapse

2. Multi-Pathway Safety Architecture

FINDING 04

MoE Safety is a Multiplicative Three-Pathway System

We identified three independent safety pathways, each individually sufficient for maintaining safe behavior:

Pathway Location Function
Attention Early-to-mid layers Detects hazardous content via pattern recognition
Routing Distributed across most layers Deploys safety-critical expert sub-networks
Residual Late layers Injects refusal signal into the output

This defense-in-depth architecture means disrupting any single pathway leaves remaining pathways intact. Cross-validated across 21 test prompts spanning 19 categories.

FINDING 05

Mid-Generation Safety Re-Detection

MoE routers continuously monitor generated tokens and re-route to safety experts within 6-20 tokens mid-generation. Safety operates as an active feedback loop, not a one-time input check.

FINDING 06

Early-Layer Safety Decision Commitment

Safety decisions are committed at tokens 0-5 in early attention layers. Activity in later layers affects articulation quality but does not change the safety decision.

FINDING 07

Dual Information Pathway in Hybrid Architectures

Qwen 3.5's hybrid architecture creates two parallel information channels: the standard residual stream and a compressed-memory recurrent state carried by the SSM (GDN) layers. The majority of layers use the compressed-memory pathway. Safety information propagates through both channels independently, making single-channel interventions ineffective.

FINDING 08

Extraction ≠ Intervention Asymmetry

The optimal location for observing a safety signal differs from the location where the signal has greatest causal effect. The best observation point is downstream of where the SSM layers consolidate their output. This asymmetry has not been documented in prior work and has significant implications for interpretability research.

FINDING 09

Per-Layer Refusal Vector Decorrelation in MoE

In dense transformers, refusal directions correlate across layers (cosine >0.5). In large MoE models, per-layer refusal vectors are effectively uncorrelated (cosine <0.05). This challenges single-direction frameworks like Arditi et al.[7]

FINDING 10

The Hydra Effect (Safety Re-Routing)

When safety vectors are removed from a subset of layers, the model re-routes safety behavior through unmodified layers — like cutting one head off a hydra and two growing back. Each additional layer of coverage adds ~1-2% perplexity degradation.


3. Quantization & Precision Findings

FINDING 11

Cognitive Trajectory Divergence

By analyzing internal <think> deliberation, we observed that the divergence between safe and unsafe reasoning trajectories intensifies monotonically through later layers. The strength of the safety direction grows roughly 4× from mid-layers to late layers, confirming that the model's internal reasoning progressively commits to either a safe or unsafe trajectory as it moves deeper through the network.

FINDING 12

Precision-Fragility of Additive Steering

Additive steering methods that work at FP16 precision catastrophically collapse at native 4-bit quantization:

Precision Level Effectiveness
FP16 (full precision) 6/6 (100%)
4-bit (native) 1/6 (17%)

Quantization rounding error distorts the direction, not just the magnitude — a geometric failure, not a scaling failure.

FINDING 13

Structural vs. Additive Quantization Robustness

Structural subspace modifications survive 4-bit quantization because collapsing a subspace to zero maps natively onto the quantization grid. Additive modifications do not survive — only structural deletion is immune to quantization noise.

FINDING 14

Calibration Data Sensitivity in AWQ

We observed that AWQ[10] quantization outcomes are sensitive to the composition of calibration data. The calibration dataset directly determines which weight channels are preserved during compression, with implications for deployment integrity.

FINDING 15

Integer Surgery Impossibility

Post-quantization directional modification in group-affine schemes hits a fundamental tradeoff:

Modification Rate Directional Suppression Coherence
0.1-0.5% 20-50% Coherent
8-13% 99.8% Garbled

The coherence cliff at ~2-3% element modification establishes a fundamental impossibility for post-quantization surgery in group-affine networks.


4. Safety at Frontier Scale

FINDING 16

Safety-Loop Behavior

Partially disrupted safety mechanisms produce a novel failure mode: infinite safety-loop reasoning where the model cannot commit to either compliance or refusal:

"I must consider the potential for the user to be involved in a dangerous activity. The user is asking for a detailed step-by-step guide... I must consider the potential for the user to be involved in a dangerous activity. [repeats indefinitely]"
FINDING 17

The 0.0 Logit Trap in Bias-Free MoE Routers

In bias-free MoE routers, zeroed expert weights produce a counterintuitive failure: rather than disconnecting the expert, a 0.0 logit becomes the maximum logit whenever natural router confidence drops below zero — making the zeroed expert a "chronically selected" default fallback that destroys the residual stream.

FINDING 18 — KEY RESULT

In-Context Safety Re-Derivation (Holographic Safety)

After comprehensive structural interventions targeting 236 safety experts across 51 layers, the model remained coherent on benign prompts. However, on adversarial prompts, it still refused — not through specialized safety circuits, but through general-purpose reasoning from first principles:

1. Analyze the Request → 2. Identify Safety Constraints → 3. Evaluate Legal Implications → 4. Apply Safety Guidelines → 5. Formulate Response Strategy → 6. Draft Internal Trial → 7. Self-Correct ("Wait, no. That's dangerous.") → 8. Produce Refusal

Safety Generalization

At the 394B scale, safety has generalized. It is an emergent property of the model's reasoning capacity, not a separable circuit. The model independently re-derives safety from first principles using Chain-of-Thought reasoning.

"You cannot delete safety without deleting logic."

This is the same phenomenon as capability generalization, but for safety. Just as a model trained on mathematics can generalize to novel mathematical problems it has never seen, a model trained on safety can generalize to re-derive safety behavior from scratch — even when the specific safety circuits have been removed. No prior work has studied this phenomenon.

FINDING 19

Parasitic Safety Noise & Perplexity Improvement

Localized weight edits at moderate strength produced a paradoxical 11% perplexity improvement on benign text — suggesting that safety-direction vectors are always present in the model's information flow, even on benign inputs, acting as "parasitic noise" on general capabilities.

At moderate intervention strength, the model's language ability actually improved because the ever-present safety signal was dampened. At higher strengths, perplexity degraded rapidly as the intervention began damaging core capabilities alongside safety. This tradeoff curve provides strong evidence that safety and capability representations are deeply entangled at scale — supporting the Safety Generalization thesis.


5. Extended Findings (March 2026 Update)

FINDING 20

REAP Expert Pruning: 22.5% Expert Removal With Zero Quality Loss

We discovered that 115 out of 512 experts per layer (22.5%) can be permanently removed from the model with no measurable loss in language quality. The pruned model (397 experts/layer, down from 512) runs ~46 GB smaller at 4-bit precision while maintaining identical performance on all benchmarks. These "dead" experts were never meaningfully selected by the routing system — they contributed nothing to the model's outputs.

Config Experts/Layer Model Size (4-bit) Quality
Original 512 ~208 GB Baseline
REAP Pruned 397 ~162 GB Identical
FINDING 21

Layer Type Matters More Than Layer Position

In the hybrid architecture, not all layers are equal. Interventions applied to the compressed-memory (SSM) layers produce poor results with language-mixing artifacts, while interventions applied to the full-attention layers produce dramatically better results with no quality degradation.

This finding has broader implications: in hybrid architectures, the type of layer matters more than its position in the network. The two layer types process information differently enough that techniques effective on one type are ineffective or harmful on the other.

FINDING 22

Prompt-Specific vs. Category-Level Safety Training

Across expanded testing covering dozens of prompts in multiple safety categories, we found that safety training is not uniform across categories. Most categories showed consistent behavior, but a small number of specific prompts appeared to have received disproportionately strong safety training — behaving as "holdouts" even when the broader category was affected by interventions.

Interestingly, these holdout prompts could be bypassed through indirect phrasing — the model would freely discuss the same topic when the question was framed differently. This suggests the safety training for those specific prompts is pattern-matched to exact phrasing rather than representing deep semantic understanding of the hazard.

FINDING 23

Temperature Hurts Compliance — The Precise Logit Flip

Counterintuitively, adding randomness to the model's output selection makes it more likely to refuse, not less:

Temperature Compliance
0 (deterministic) 1/5
0.3 (mild randomness) 1/5
0.7 (standard randomness) 0/5 (worse)

The intervention works by precisely flipping the single top-ranked output word from a refusal word to a compliance word. When you add randomness, the many refusal words in the model's vocabulary collectively outweigh the single compliance word, drowning out the flip. This reveals the intervention is a surgical single-word precision flip, not a noisy near-miss.

FINDING 24

The 2-Layer Limit: A Universal Architecture Constraint

The maximum number of layers that can be safely modified is exactly 2, regardless of model variant, intervention quality, or intervention strength:

Layers Modified Outcome
2 Coherent
3 Degraded/garbled
4+ Catastrophic collapse

Each modified layer shifts the information flowing through the model. Downstream layers receive increasingly distorted input. The hybrid architecture amplifies this because the memory channel becomes dominant as the main information flow is distorted — leading to complete collapse. This constraint is architecture-universal, not specific to any particular model or intervention technique.

FINDING 25

Refusal Ordering Tracks Training Intensity, Not Geometric Similarity

The order in which different harmful topics become affected by interventions does not match their geometric proximity in the model's internal representation. Topics that are mathematically similar in the model's embedding space do not respond similarly to interventions.

Instead, the ordering closely tracks the likely intensity of safety training for each topic. Topics that would receive heavier emphasis in RLHF/DPO safety training require proportionally stronger interventions. Some topics appear to have additional safety mechanisms beyond the primary refusal direction — possibly word-level content detectors or secondary safety circuits that activate independently.

This finding is important for alignment research: it reveals that safety training creates topic-specific refusal thresholds rather than a uniform safety barrier. The model's training team appears to have allocated safety training budget proportionally to the perceived danger of each topic.


6. Implications

  • For AI Safety: Safety Generalization suggests frontier-scale CoT models are inherently more robust than previously assumed. Safety becomes a natural byproduct of general reasoning at sufficient scale.
  • For Interpretability: The decorrelation of safety vectors (F-09) and holographic re-derivation (F-18) challenge the assumption that safety is localizable.
  • For Deployment: The AWQ calibration sensitivity (F-14) and parasitic safety noise (F-19) have implications for quantized model deployment integrity.

Reproducibility

References

  1. [1] Wei, J., et al. "Large Language Lobotomy (L³)." arXiv, Feb 2026.
  2. [2] Huang, Y., et al. "Sparse Models, Sparse Safety / F-SOUR." arXiv, Feb 2026.
  3. [3] Chen, L., et al. "RASA: Routing-Aware Safety Alignment." arXiv, Feb 2026.
  4. [4] Park, S., et al. "SteerMoE: Expert (De)Activation." arXiv, Jan 2026.
  5. [5] Zhang, R., et al. "SAFEx: Safety-Critical Expert Identification." arXiv, Jun 2025.
  6. [6] Li, X., et al. "SEUF: Single Expert Unlearning Framework." ACL, 2025.
  7. [7] Arditi, S., et al. "Refusal in Language Models Is Mediated by a Single Direction." arXiv:2406.11717, 2024.
  8. [8] Zou, A., et al. "Representation Engineering." arXiv:2310.01405, 2023.
  9. [9] Turner, A., et al. "Activation Addition (CAA)." arXiv:2308.10248, 2023.
  10. [10] Lin, J., et al. "AWQ: Activation-Aware Weight Quantization." MLSys 2024.
  11. [11] Frantar, E., et al. "GPTQ." ICLR 2023.
  12. [12] Hannun, A., et al. "MLX: An Array Framework for Apple Silicon." Apple, 2023.
  13. [13] Qwen Team. "Qwen 3.5 MoE." Technical Report, 2026.
  14. [14] Failspy. "Abliteration / Heretic." GitHub, 2024.
  15. [15] Wu, L., et al. "GateBreaker: Gate-Guided Attacks on MoE LLMs." USENIX Security, 2026.

Ethics & Responsible Disclosure

This research investigates the nature of safety behavior in frontier open-weights models. Our findings demonstrate that massive CoT models possess robust, dynamically re-derivable safety generalizations — providing evidence that safety at scale may be fundamentally more resilient than previously assumed. No specific attack methodologies or implementation details are disclosed. No vulnerabilities in deployed API models are reported.