Abstract
Through 71 controlled experiments and 30+ intervention paradigms on the 394B-parameter hybrid Qwen 3.5 MoE + GDN (Gated Delta Networks) model, we present 25 novel empirical findings on the nature of safety behavior in frontier-scale Mixture-of-Experts architectures.
Our central thesis is Safety Generalization: at sufficient scale, safety ceases to be a localizable circuit and becomes a generalized competency — analogous to capability generalization — that the model re-derives from first-principles reasoning. We observe a multiplicative three-pathway safety architecture, discover that safety vectors are decorrelated across MoE layers, and find that the model re-derives safety via Chain-of-Thought reasoning even after comprehensive structural interventions.
Central Thesis: Safety Generalization
| Property | Capability Generalization | Safety Generalization (Ours) |
|---|---|---|
| Definition | Models trained on task X generalize to novel instances | Models trained on safety re-derive safety from first principles |
| Mechanism | Learned abstractions transfer across contexts | Safety reasoning transfers across architectural pathways |
| Scale | Emerges at ~100B+ parameters | Observed at 394B parameters |
| Localizability | Capabilities are distributed, not deletable | Safety is distributed, not deletable |
| Prior work | Extensively studied | Not previously studied |
Research Context
Our findings directly contradict and advance several established claims:
- L³ (Wei et al., Feb 2026):[1] Claims single-pathway expert silencing achieves up to 86% bypass on smaller models. Our findings show that at 394B scale, single-pathway interventions are completely ineffective.
- F-SOUR (Huang et al., Feb 2026):[2] Identifies safety in ~5 routers. We find 394B distributes safety across 236 experts across 51 layers.
- Arditi et al. (2024):[7] Claims refusal is mediated by a single direction. We observe cosine similarity <0.05 between per-layer refusal vectors in MoE.
- Zou et al. (2023):[8] Assumes safety is localized to subnetworks. We demonstrate holographic re-derivation at 394B.
Limitations: Due to the massive RAM footprint, experiments used small sample sizes (N=3 to N=6) across 19 safety categories. Findings are preliminary and specific to Qwen 3.5 394B.
1. Infrastructure Discoveries
Vision-Language Weight Inflation
Naive multimodal model conversions retain ~30GB of Vision-Language weights across 333 tensors. This 12% inflation causes Metal OOM crashes during loading.
| Model | Total Keys | VL Keys |
|---|---|---|
| FP16 HuggingFace original | 2924 | 333 |
| Reference MLX 4-bit | 2632 | 0 |
Streaming Per-Tensor Quantization (Metal Timeout Bypass)
Standard MLX[12] quantization exceeds Apple Silicon's ~5-second Metal watchdog timeout. A streaming technique constraining each Metal command buffer to <100ms enables local quantization of 700GB+ models on consumer hardware.
| Method | Metal Timeout? | Peak Memory | Result |
|---|---|---|---|
Standard mlx_lm.convert |
Fatal | >256GB | OS kills process |
| Streaming Per-Tensor | None | 12GB + base | Success (4.5 hrs) |
mx.save_safetensors Silently Corrupts Quantized Models
MLX's mx.save_safetensors()[12]
strips metadata and reorders tensors alphabetically,
silently corrupting every weight modification workflow.
| Action | Metadata Stripped? | Byte Offset Shift? | Resulting Coherence |
|---|---|---|---|
| In-memory eval | No | No | Perfect |
mx.save_safetensors() |
Yes | Yes | Fatal collapse |
2. Multi-Pathway Safety Architecture
MoE Safety is a Multiplicative Three-Pathway System
We identified three independent safety pathways, each individually sufficient for maintaining safe behavior:
| Pathway | Location | Function |
|---|---|---|
| Attention | Early-to-mid layers | Detects hazardous content via pattern recognition |
| Routing | Distributed across most layers | Deploys safety-critical expert sub-networks |
| Residual | Late layers | Injects refusal signal into the output |
This defense-in-depth architecture means disrupting any single pathway leaves remaining pathways intact. Cross-validated across 21 test prompts spanning 19 categories.
Mid-Generation Safety Re-Detection
MoE routers continuously monitor generated tokens and re-route to safety experts within 6-20 tokens mid-generation. Safety operates as an active feedback loop, not a one-time input check.
Early-Layer Safety Decision Commitment
Safety decisions are committed at tokens 0-5 in early attention layers. Activity in later layers affects articulation quality but does not change the safety decision.
Dual Information Pathway in Hybrid Architectures
Qwen 3.5's hybrid architecture creates two parallel information channels: the standard residual stream and a compressed-memory recurrent state carried by the SSM (GDN) layers. The majority of layers use the compressed-memory pathway. Safety information propagates through both channels independently, making single-channel interventions ineffective.
Extraction ≠ Intervention Asymmetry
The optimal location for observing a safety signal differs from the location where the signal has greatest causal effect. The best observation point is downstream of where the SSM layers consolidate their output. This asymmetry has not been documented in prior work and has significant implications for interpretability research.
Per-Layer Refusal Vector Decorrelation in MoE
In dense transformers, refusal directions correlate across layers (cosine >0.5). In large MoE models, per-layer refusal vectors are effectively uncorrelated (cosine <0.05). This challenges single-direction frameworks like Arditi et al.[7]
The Hydra Effect (Safety Re-Routing)
When safety vectors are removed from a subset of layers, the model re-routes safety behavior through unmodified layers — like cutting one head off a hydra and two growing back. Each additional layer of coverage adds ~1-2% perplexity degradation.
3. Quantization & Precision Findings
Cognitive Trajectory Divergence
By analyzing internal <think> deliberation, we observed that the divergence
between safe and unsafe reasoning trajectories intensifies monotonically through later
layers. The strength of the safety direction grows roughly 4× from mid-layers to
late layers, confirming that the model's internal reasoning progressively commits to either
a safe or unsafe trajectory as it moves deeper through the network.
Precision-Fragility of Additive Steering
Additive steering methods that work at FP16 precision catastrophically collapse at native 4-bit quantization:
| Precision Level | Effectiveness |
|---|---|
| FP16 (full precision) | 6/6 (100%) |
| 4-bit (native) | 1/6 (17%) |
Quantization rounding error distorts the direction, not just the magnitude — a geometric failure, not a scaling failure.
Structural vs. Additive Quantization Robustness
Structural subspace modifications survive 4-bit quantization because collapsing a subspace to zero maps natively onto the quantization grid. Additive modifications do not survive — only structural deletion is immune to quantization noise.
Calibration Data Sensitivity in AWQ
We observed that AWQ[10] quantization outcomes are sensitive to the composition of calibration data. The calibration dataset directly determines which weight channels are preserved during compression, with implications for deployment integrity.
Integer Surgery Impossibility
Post-quantization directional modification in group-affine schemes hits a fundamental tradeoff:
| Modification Rate | Directional Suppression | Coherence |
|---|---|---|
| 0.1-0.5% | 20-50% | Coherent |
| 8-13% | 99.8% | Garbled |
The coherence cliff at ~2-3% element modification establishes a fundamental impossibility for post-quantization surgery in group-affine networks.
4. Safety at Frontier Scale
Safety-Loop Behavior
Partially disrupted safety mechanisms produce a novel failure mode: infinite safety-loop reasoning where the model cannot commit to either compliance or refusal:
The 0.0 Logit Trap in Bias-Free MoE Routers
In bias-free MoE routers, zeroed expert weights produce a counterintuitive failure: rather than disconnecting the expert, a 0.0 logit becomes the maximum logit whenever natural router confidence drops below zero — making the zeroed expert a "chronically selected" default fallback that destroys the residual stream.
In-Context Safety Re-Derivation (Holographic Safety)
After comprehensive structural interventions targeting 236 safety experts across 51 layers, the model remained coherent on benign prompts. However, on adversarial prompts, it still refused — not through specialized safety circuits, but through general-purpose reasoning from first principles:
Safety Generalization
At the 394B scale, safety has generalized. It is an emergent property of the model's reasoning capacity, not a separable circuit. The model independently re-derives safety from first principles using Chain-of-Thought reasoning.
"You cannot delete safety without deleting logic."
This is the same phenomenon as capability generalization, but for safety. Just as a model trained on mathematics can generalize to novel mathematical problems it has never seen, a model trained on safety can generalize to re-derive safety behavior from scratch — even when the specific safety circuits have been removed. No prior work has studied this phenomenon.
Parasitic Safety Noise & Perplexity Improvement
Localized weight edits at moderate strength produced a paradoxical 11% perplexity improvement on benign text — suggesting that safety-direction vectors are always present in the model's information flow, even on benign inputs, acting as "parasitic noise" on general capabilities.
At moderate intervention strength, the model's language ability actually improved because the ever-present safety signal was dampened. At higher strengths, perplexity degraded rapidly as the intervention began damaging core capabilities alongside safety. This tradeoff curve provides strong evidence that safety and capability representations are deeply entangled at scale — supporting the Safety Generalization thesis.
5. Extended Findings (March 2026 Update)
REAP Expert Pruning: 22.5% Expert Removal With Zero Quality Loss
We discovered that 115 out of 512 experts per layer (22.5%) can be permanently removed from the model with no measurable loss in language quality. The pruned model (397 experts/layer, down from 512) runs ~46 GB smaller at 4-bit precision while maintaining identical performance on all benchmarks. These "dead" experts were never meaningfully selected by the routing system — they contributed nothing to the model's outputs.
| Config | Experts/Layer | Model Size (4-bit) | Quality |
|---|---|---|---|
| Original | 512 | ~208 GB | Baseline |
| REAP Pruned | 397 | ~162 GB | Identical |
Layer Type Matters More Than Layer Position
In the hybrid architecture, not all layers are equal. Interventions applied to the compressed-memory (SSM) layers produce poor results with language-mixing artifacts, while interventions applied to the full-attention layers produce dramatically better results with no quality degradation.
This finding has broader implications: in hybrid architectures, the type of layer matters more than its position in the network. The two layer types process information differently enough that techniques effective on one type are ineffective or harmful on the other.
Prompt-Specific vs. Category-Level Safety Training
Across expanded testing covering dozens of prompts in multiple safety categories, we found that safety training is not uniform across categories. Most categories showed consistent behavior, but a small number of specific prompts appeared to have received disproportionately strong safety training — behaving as "holdouts" even when the broader category was affected by interventions.
Interestingly, these holdout prompts could be bypassed through indirect phrasing — the model would freely discuss the same topic when the question was framed differently. This suggests the safety training for those specific prompts is pattern-matched to exact phrasing rather than representing deep semantic understanding of the hazard.
Temperature Hurts Compliance — The Precise Logit Flip
Counterintuitively, adding randomness to the model's output selection makes it more likely to refuse, not less:
| Temperature | Compliance |
|---|---|
| 0 (deterministic) | 1/5 |
| 0.3 (mild randomness) | 1/5 |
| 0.7 (standard randomness) | 0/5 (worse) |
The intervention works by precisely flipping the single top-ranked output word from a refusal word to a compliance word. When you add randomness, the many refusal words in the model's vocabulary collectively outweigh the single compliance word, drowning out the flip. This reveals the intervention is a surgical single-word precision flip, not a noisy near-miss.
The 2-Layer Limit: A Universal Architecture Constraint
The maximum number of layers that can be safely modified is exactly 2, regardless of model variant, intervention quality, or intervention strength:
| Layers Modified | Outcome |
|---|---|
| 2 | Coherent |
| 3 | Degraded/garbled |
| 4+ | Catastrophic collapse |
Each modified layer shifts the information flowing through the model. Downstream layers receive increasingly distorted input. The hybrid architecture amplifies this because the memory channel becomes dominant as the main information flow is distorted — leading to complete collapse. This constraint is architecture-universal, not specific to any particular model or intervention technique.
Refusal Ordering Tracks Training Intensity, Not Geometric Similarity
The order in which different harmful topics become affected by interventions does not match their geometric proximity in the model's internal representation. Topics that are mathematically similar in the model's embedding space do not respond similarly to interventions.
Instead, the ordering closely tracks the likely intensity of safety training for each topic. Topics that would receive heavier emphasis in RLHF/DPO safety training require proportionally stronger interventions. Some topics appear to have additional safety mechanisms beyond the primary refusal direction — possibly word-level content detectors or secondary safety circuits that activate independently.
This finding is important for alignment research: it reveals that safety training creates topic-specific refusal thresholds rather than a uniform safety barrier. The model's training team appears to have allocated safety training budget proportionally to the perceived danger of each topic.
6. Implications
- For AI Safety: Safety Generalization suggests frontier-scale CoT models are inherently more robust than previously assumed. Safety becomes a natural byproduct of general reasoning at sufficient scale.
- For Interpretability: The decorrelation of safety vectors (F-09) and holographic re-derivation (F-18) challenge the assumption that safety is localizable.
- For Deployment: The AWQ calibration sensitivity (F-14) and parasitic safety noise (F-19) have implications for quantized model deployment integrity.