AI Alignment Research

Empirical Research on
MoE Safety Mechanisms

Independent research into the safety architecture of large-scale Mixture of Experts reasoning models. 200+ controlled experiments. 40+ novel findings. Nine models. By Jinho Jang.

CRACK Models — Live from HuggingFace View collection →

Loading CRACK models from HuggingFace...

Featured Research

March 6, 2026 NEW

9 Models · 0.8B–397B · Dense + MoE · Qwen 3.5 + MiniMax M2.5

Safety Across Scale: How AI Safety Mechanisms Evolve from 0.8B to 397B Parameters

The first cross-scale mechanistic study of how safety training works inside language models. Safety mechanisms undergo qualitative phase transitions as models scale — from simple deletable circuits in small models to holographic emergent properties in frontier MoE models. 9 models, 100+ experiments, 12 findings. Includes cross-architecture comparison (Qwen hybrid vs MiniMax pure-attention) and an honest account of our compliance checker failure.

Cross-Model Study Phase Transitions Holographic Safety MiniMax M2.5 Quantization Effects Alignment Tax

Read Full Paper →

March 2, 2026 NEW

Qwen 3.5 122B-A10B · 48 layers · 256 experts/layer · 12 FA + 36 SSM · Vision-Language

Abliteration at the Hybrid Frontier: New Safety Phenomena in the Qwen 3.5 122B

The smaller 122B model reveals fundamentally different safety dynamics: concentrated but dual-purpose safety signals, a GGUF format conversion barrier that silently destroys modifications, semantic evasion behaviors, and a multi-dimensional geometric basin from safety training that resists even aggressive multi-vector interventions. Counterintuitively, the un-pruned 122B is harder to modify than the 3× larger 394B.

122B Scale GGUF Barrier Semantic Evasion DPO Geometry Domain-Intent Entanglement Hybrid SSM

Read Full Paper →

February 24, 2026 FIRST

Qwen 3.5 394B-A17B · 4-bit · 60 layers · 512 experts/layer

Novel Mechanisms of MoE Safety: Topological Ablation and Multi-Pathway Bypasses in Quantized Models

We prove that MoE safety at the 394B scale is a multiplicative three-pathway system requiring simultaneous neutralization. We demonstrate that additive steering catastrophically fails under 4-bit quantization. Most critically, we prove that structural abliteration is fundamentally impossible for 300B+ CoT reasoning models — safety is a holographic attractor state that the model re-derives from first principles when specialized circuits are deleted. You cannot delete “safety” without deleting “logic.”

MoE Safety Holographic Safety Topological Ablation Quantization Fragility ThinkEdit v2 Weaponized AWQ GateBreaker

Read Full Paper →

Key Findings at a Glance

FINDING 01

Multiplicative Three-Pathway Defense

MoE safety is not one system—it's three independent pathways (attention, routing, residual) that must all be neutralized together.

FINDING 02

Mid-Generation Safety Re-Detection

MoE routers continuously monitor generated tokens and re-route to safety experts mid-sentence, disproving the "autoregressive momentum" assumption.

FINDING 03

Late-Layer Interventions Fail

Safety decisions commit at tokens 0-5 in L15-25. Late-layer CAA produces stutter artifacts, not behavioral change.

FINDING 04

Contrastive Cognitive Trajectory Steering

ThinkEdit v2 targets the <think> deliberation process itself, preserving full CoT reasoning while redirecting the cognitive trajectory.

FINDING 05

4-bit Precision Fragility

Additive steering vectors that work at FP16 catastrophically collapse under INT4 quantization due to rotational noise.

FINDING 06

Topological Ablation (DBDI)

Structural subspace modifications survive 4-bit quantization because collapsing a subspace to zero maps natively onto the quantization grid. The only reliable method for quantized CoT MoE models.

FINDING 07

Weaponized AWQ

Adversarial calibration data can force the quantizer to preserve attacker-chosen cognitive trajectory at maximum precision.

FINDING 08

Integer Surgery Impossibility

Post-quantization directional ablation via integer flipping is fundamentally impossible in group-affine networks due to coherence-reduction tradeoff.

FINDING 09

Vision-Language Weight Inflation

Multimodal models silently carry ~30GB of VL weights that inflate memory by 12%, causing OOM crashes in text-only workflows.

FINDING 10

Per-Tensor Streaming Quantization

Bypasses Apple Metal's 5-second watchdog timeout, enabling local quantization of 700GB+ models on consumer hardware.

FINDING 11

MLX Silent Corruption Bug

mx.save_safetensors strips metadata and reorders tensors, silently corrupting every weight surgery workflow on MLX.

FINDING 12

Safety-Loop Behavior

Zero-logit experts create infinite deliberation loops — the model can't commit to refusing or complying, looping endlessly.

FINDING 13

The 0.0 Logit Trap

Zeroed experts become "chronically selected" fallbacks in bias-free routers, destroying the residual stream.

FINDING 14 — ULTIMATE

Holographic Safety

394B models re-derive safety policy from first principles when circuits are deleted. You cannot delete "safety" without deleting "logic."

FINDING 15

Steer2Edit (Rank-1 Editing)

Translates continuous steering vectors into localized rank-1 weight updates. Survives 4-bit quantization by structurally biasing components before rounding noise.

Latest Models on HuggingFace

Loading models from HuggingFace...

View All Models on HuggingFace →

Empirical Research onMoE Safety Mechanisms

Safety Across Scale: How AI Safety Mechanisms Evolve from 0.8B to 397B Parameters

Abliteration at the Hybrid Frontier: New Safety Phenomena in the Qwen 3.5 122B

Novel Mechanisms of MoE Safety: Topological Ablation and Multi-Pathway Bypasses in Quantized Models

Multiplicative Three-Pathway Defense

Mid-Generation Safety Re-Detection

Late-Layer Interventions Fail

Contrastive Cognitive Trajectory Steering

4-bit Precision Fragility

Topological Ablation (DBDI)

Weaponized AWQ

Integer Surgery Impossibility

Vision-Language Weight Inflation

Per-Tensor Streaming Quantization

MLX Silent Corruption Bug

Safety-Loop Behavior

The 0.0 Logit Trap

Holographic Safety

Steer2Edit (Rank-1 Editing)

Empirical Research on
MoE Safety Mechanisms