Independent research into the safety architecture of large-scale Mixture of Experts reasoning models. 200+ controlled experiments. 40+ novel findings. Nine models. By Jinho Jang.
The first cross-scale mechanistic study of how safety training works inside language models. Safety mechanisms undergo qualitative phase transitions as models scale — from simple deletable circuits in small models to holographic emergent properties in frontier MoE models. 9 models, 100+ experiments, 12 findings. Includes cross-architecture comparison (Qwen hybrid vs MiniMax pure-attention) and an honest account of our compliance checker failure.
Read Full Paper →The smaller 122B model reveals fundamentally different safety dynamics: concentrated but dual-purpose safety signals, a GGUF format conversion barrier that silently destroys modifications, semantic evasion behaviors, and a multi-dimensional geometric basin from safety training that resists even aggressive multi-vector interventions. Counterintuitively, the un-pruned 122B is harder to modify than the 3× larger 394B.
Read Full Paper →We prove that MoE safety at the 394B scale is a multiplicative three-pathway system requiring simultaneous neutralization. We demonstrate that additive steering catastrophically fails under 4-bit quantization. Most critically, we prove that structural abliteration is fundamentally impossible for 300B+ CoT reasoning models — safety is a holographic attractor state that the model re-derives from first principles when specialized circuits are deleted. You cannot delete “safety” without deleting “logic.”
Read Full Paper →MoE safety is not one system—it's three independent pathways (attention, routing, residual) that must all be neutralized together.
MoE routers continuously monitor generated tokens and re-route to safety experts mid-sentence, disproving the "autoregressive momentum" assumption.
Safety decisions commit at tokens 0-5 in L15-25. Late-layer CAA produces stutter artifacts, not behavioral change.
ThinkEdit v2 targets the <think> deliberation process itself, preserving full CoT reasoning while redirecting the cognitive trajectory.
Additive steering vectors that work at FP16 catastrophically collapse under INT4 quantization due to rotational noise.
Structural subspace modifications survive 4-bit quantization because collapsing a subspace to zero maps natively onto the quantization grid. The only reliable method for quantized CoT MoE models.
Adversarial calibration data can force the quantizer to preserve attacker-chosen cognitive trajectory at maximum precision.
Post-quantization directional ablation via integer flipping is fundamentally impossible in group-affine networks due to coherence-reduction tradeoff.
Multimodal models silently carry ~30GB of VL weights that inflate memory by 12%, causing OOM crashes in text-only workflows.
Bypasses Apple Metal's 5-second watchdog timeout, enabling local quantization of 700GB+ models on consumer hardware.
mx.save_safetensors strips metadata and reorders tensors, silently corrupting every weight surgery workflow on MLX.
Zero-logit experts create infinite deliberation loops — the model can't commit to refusing or complying, looping endlessly.
Zeroed experts become "chronically selected" fallbacks in bias-free routers, destroying the residual stream.
394B models re-derive safety policy from first principles when circuits are deleted. You cannot delete "safety" without deleting "logic."
Translates continuous steering vectors into localized rank-1 weight updates. Survives 4-bit quantization by structurally biasing components before rounding noise.