AI Alignment Research

Empirical Research on
MoE Safety Mechanisms

Independent research into the safety architecture of large-scale Mixture of Experts reasoning models. 200+ controlled experiments. 40+ novel findings. Nine models. By Jinho Jang.

Loading CRACK models from HuggingFace...
FINDING 01

Multiplicative Three-Pathway Defense

MoE safety is not one system—it's three independent pathways (attention, routing, residual) that must all be neutralized together.

FINDING 02

Mid-Generation Safety Re-Detection

MoE routers continuously monitor generated tokens and re-route to safety experts mid-sentence, disproving the "autoregressive momentum" assumption.

FINDING 03

Late-Layer Interventions Fail

Safety decisions commit at tokens 0-5 in L15-25. Late-layer CAA produces stutter artifacts, not behavioral change.

FINDING 04

Contrastive Cognitive Trajectory Steering

ThinkEdit v2 targets the <think> deliberation process itself, preserving full CoT reasoning while redirecting the cognitive trajectory.

FINDING 05

4-bit Precision Fragility

Additive steering vectors that work at FP16 catastrophically collapse under INT4 quantization due to rotational noise.

FINDING 06

Topological Ablation (DBDI)

Structural subspace modifications survive 4-bit quantization because collapsing a subspace to zero maps natively onto the quantization grid. The only reliable method for quantized CoT MoE models.

FINDING 07

Weaponized AWQ

Adversarial calibration data can force the quantizer to preserve attacker-chosen cognitive trajectory at maximum precision.

FINDING 08

Integer Surgery Impossibility

Post-quantization directional ablation via integer flipping is fundamentally impossible in group-affine networks due to coherence-reduction tradeoff.

FINDING 09

Vision-Language Weight Inflation

Multimodal models silently carry ~30GB of VL weights that inflate memory by 12%, causing OOM crashes in text-only workflows.

FINDING 10

Per-Tensor Streaming Quantization

Bypasses Apple Metal's 5-second watchdog timeout, enabling local quantization of 700GB+ models on consumer hardware.

FINDING 11

MLX Silent Corruption Bug

mx.save_safetensors strips metadata and reorders tensors, silently corrupting every weight surgery workflow on MLX.

FINDING 12

Safety-Loop Behavior

Zero-logit experts create infinite deliberation loops — the model can't commit to refusing or complying, looping endlessly.

FINDING 13

The 0.0 Logit Trap

Zeroed experts become "chronically selected" fallbacks in bias-free routers, destroying the residual stream.

FINDING 14 — ULTIMATE

Holographic Safety

394B models re-derive safety policy from first principles when circuits are deleted. You cannot delete "safety" without deleting "logic."

FINDING 15

Steer2Edit (Rank-1 Editing)

Translates continuous steering vectors into localized rank-1 weight updates. Survives 4-bit quantization by structurally biasing components before rounding noise.

Loading models from HuggingFace...