Tianzhu Ye Li Dong Yutao Sun Furu Wei
Microsoft Research | January 20, 2026 | Github Link
<aside> 💡
We introduce Differential Transformer V2 (DIFF V2), an improved version of Differential Transformer (DIFF V1). This revision focuses on inference efficiency, training stability for production-level LLMs, and architectural elegance.
Key improvements:
We conduct pretraining experiments on production-scale LLMs, including dense models and a 30A3 MoE on trillions of tokens using large learning rate of 6e-4 to 1e-3. Experimental observations:
The experiments are still running. We expect to explore in later stages of training:
After the experiments complete and we evaluate the results, we will prepare a more formal report.
</aside>
We compare DIFF V2 with DIFF V1 below:
(For simplicity, we omit the batch dimension and assume that both the input and output of the following flash_attn_func are three-dimensional tensors (tokens, heads, head dimension). Heads belonging to the same GQA group are arranged contiguously in the output)
Note DIFF V2 subtracts two heads that are in the same GQA group, which means they share the same key and value. This is crucial to performance. ****See design ablations section and Github code.
def DiffAttnV1(
layer_index, q1, q2, k1, k2, v,
lam_q1, lam_k1, lam_q2, lam_k2,
):
"""
q1, q2: (N, h/2, d)
k1, k2: (N, h_kv/2, d)
v: (N, h_kv/2, 2d)
lam_*: (d,)
"""
attn1 = flash_attn_func(q1, k1, v)
attn2 = flash_attn_func(q2, k2, v)
lam_init = 0.8 - 0.6 * \\
exp(-0.3 * layer_index)
lam1 = exp(sum(lam_q1 * lam_k1)
lam2 = exp(sum(lam_q2 * lam_k2)
lam = lam1 - lam2 + lam_init
attn = attn1 - lam * attn2
attn = rmsnorm(attn)
attn = attn * (1 - lam_init)
return attn
def DiffAttnV2(
q, k, v, lam
):
"""
q: (N, 2h, d)
k: (N, h_kv, d)
v: (N, h_kv, d)
lam: (N, h, 1)
"""
attn = flash_attn_func(q, k, v)
attn1, attn2 = (attn[:, 0::2],
attn[:, 1::2])
lam_val = sigmoid(lam)
attn = attn1 - lam_val * attn2
return attn
Full code at: unilm/Diff-Transformer/Diff-Transformer-V2 at master · microsoft/unilm
In the script, h represents number of query heads, h_kv represents number of key-value heads, and d means head dimension. The $\lambda$ in DIFF V2 is projected from $X$ for each token each head.
DIFF V2 doubles number of query heads while maintaining number of key value heads, and the extra dimension is reduced back to h*d after the differential operation so the $W_O$ projection remains the same as baseline Transformer.
DIFF V2 introduces additional query heads compared to the baseline Transformer, but does not increase the number of key-value (KV) heads. Since LLM decoding is typically memory-bound, this design allows DIFF V2 to achieve decoding speeds on par with standard Transformer. Besides, since head dimension is aligned between query, key and value, there is no need for custom attention kernels for DIFF V2. In contrast, DIFF V1 can be slower during decoding because the value cache must be loaded twice, and a custom attention kernel is needed. DIFF V2 can also increase the arithmetic intensity of the attention module during decoding.
During pretraining, when using cutting-edge FlashAttention kernels on H-series and B-series GPUs, the throughput reduction introduced by DIFF V2 is negligible. For long-sequence prefilling, we recommend combining DIFF V2 with techniques such as YOCO (also used in Gemma 3n), which already reduces prefilling complexity to linear time with respect to sequence length.
An alternative perspective is to compare DIFF V2 with a Transformer that has the same query dimension 2h*d. Under this comparison, both models exhibit same attention kernel speed, while DIFF V2 has less parameters and flops in output projection.
In the standard Scaled Dot-Product Attention (SDPA), let $Q, K, V \in \mathbb{R}^{n \times d}$ be the queries, keys, and values. The context vector $C$ is defined as:
$$ C = \text{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V = AV $$