Differential Transformer V2

Tianzhu Ye Li Dong Yutao Sun Furu Wei

Microsoft Research | January 20, 2026 | Github Link

<aside> 💡

Abstract

We introduce Differential Transformer V2 (DIFF V2), an improved version of Differential Transformer (DIFF V1). This revision focuses on inference efficiency, training stability for production-level LLMs, and architectural elegance.

Key improvements:

Faster Inference & No Need of Custom Attention Kernels Instead of forcing the attention parameter count to match the baseline Transformer (as in DIFF V1), we introduce additional parameters (borrowed from other parts of the model) for $Q_2$. This design allows DIFF V2 to match the baseline Transformer’s decoding speed and directly use FlashAttention without custom kernels.
Improved Training Stability We remove the per-head RMSNorm after differential attention. We find the per-head RMSNorm can lead to instability in later stages of large-scale pretraining of LLM.
Simpler Parameterization & Initialization We replace the globally shared $\lambda$ with a token-specific, head-wise projected $\lambda$. ****This eliminates the exponential re-parameterization and initialization of $\lambda$.

We conduct pretraining experiments on production-scale LLMs, including dense models and a 30A3 MoE on trillions of tokens using large learning rate of 6e-4 to 1e-3. Experimental observations:

Notably lower language modeling loss.
Reduced loss and gradient spikes during training, particularly under large learning rate settings where the Transformer baseline becomes unstable.
Reduced activation outliers magnitude.

The experiments are still running. We expect to explore in later stages of training:

If learning efficiency is improved in mid- and post-training.
If performance on downstream long-context benchmarks improves (alleviating context rot).

After the experiments complete and we evaluate the results, we will prepare a more formal report.

</aside>

Code

We compare DIFF V2 with DIFF V1 below:

(For simplicity, we omit the batch dimension and assume that both the input and output of the following flash_attn_func are three-dimensional tensors (tokens, heads, head dimension). Heads belonging to the same GQA group are arranged contiguously in the output)

Note DIFF V2 subtracts two heads that are in the same GQA group, which means they share the same key and value. This is crucial to performance. ****See design ablations section and Github code.

def DiffAttnV1(
		layer_index, q1, q2, k1, k2, v,
		lam_q1, lam_k1, lam_q2, lam_k2,
):
		"""
	  q1, q2: (N, h/2, d)
	  k1, k2: (N, h_kv/2, d)
	  v:      (N, h_kv/2, 2d)
	  lam_*: (d,)
	  """
	  attn1 = flash_attn_func(q1, k1, v)
		attn2 = flash_attn_func(q2, k2, v)
		
		lam_init = 0.8 - 0.6 * \\
		    exp(-0.3 * layer_index)
		lam1 = exp(sum(lam_q1 * lam_k1)
    lam2 = exp(sum(lam_q2 * lam_k2)
    lam = lam1 - lam2 + lam_init
    attn = attn1 - lam * attn2
    
    attn = rmsnorm(attn)
    attn = attn * (1 - lam_init)
    return attn

def DiffAttnV2(
		q, k, v, lam
):
		"""
	  q:   (N, 2h, d)
	  k:   (N, h_kv, d)
	  v:   (N, h_kv, d)
	  lam: (N, h, 1)
	  """
		
		attn = flash_attn_func(q, k, v)
		attn1, attn2 = (attn[:, 0::2], 
		                attn[:, 1::2])
		
		lam_val = sigmoid(lam)
		attn = attn1 - lam_val * attn2
    return attn

Full code at: unilm/Diff-Transformer/Diff-Transformer-V2 at master · microsoft/unilm In the script, h represents number of query heads, h_kv represents number of key-value heads, and d means head dimension. The $\lambda$ in DIFF V2 is projected from $X$ for each token each head.

DIFF V2 doubles number of query heads while maintaining number of key value heads, and the extra dimension is reduced back to h*d after the differential operation so the $W_O$ projection remains the same as baseline Transformer.

Motivation

Faster Decoding & No Custom Kernels

DIFF V2 introduces additional query heads compared to the baseline Transformer, but does not increase the number of key-value (KV) heads. Since LLM decoding is typically memory-bound, this design allows DIFF V2 to achieve decoding speeds on par with standard Transformer. Besides, since head dimension is aligned between query, key and value, there is no need for custom attention kernels for DIFF V2. In contrast, DIFF V1 can be slower during decoding because the value cache must be loaded twice, and a custom attention kernel is needed. DIFF V2 can also increase the arithmetic intensity of the attention module during decoding.

During pretraining, when using cutting-edge FlashAttention kernels on H-series and B-series GPUs, the throughput reduction introduced by DIFF V2 is negligible. For long-sequence prefilling, we recommend combining DIFF V2 with techniques such as YOCO (also used in Gemma 3n), which already reduces prefilling complexity to linear time with respect to sequence length.

An alternative perspective is to compare DIFF V2 with a Transformer that has the same query dimension 2h*d. Under this comparison, both models exhibit same attention kernel speed, while DIFF V2 has less parameters and flops in output projection.

Softmax Magnitude Constraint

In the standard Scaled Dot-Product Attention (SDPA), let $Q, K, V \in \mathbb{R}^{n \times d}$ be the queries, keys, and values. The context vector $C$ is defined as:

$$ C = \text{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V = AV $$