Cache initialization for Transformer decoding with `scan`. #4680

MRiabov · 2025-04-03T18:07:14Z

MRiabov
Apr 3, 2025

I’m building a Transformer decoder in Flax using nn.scan to improve compilation times. With decode=True in MultiHeadDotProductAttention, cache mutable variable does not init on its own, causing a pytree structure mismatch during execution.
The issue is basically equivalent to #2754, except with a cache twist on it.

Error message:

TypeError: scan body function carry input and carry output must have the same pytree structure, but they differ:

The input carry component c[0][0] is a <class 'dict'> with 0 child but the corresponding component of the carry output is a <class 'dict'> with 1 child, so the numbers of children do not match, with the symmetric difference of key sets: {'cache'}.

Code/Minimal Reproducible Code

import flax.linen as nn
import jax.numpy as jnp

class TransformerLayer(nn.Module):
    qkv_dim: int
    mlp_dim: int
    num_heads: int
    decode: bool
    dropout_rate: float = 0.1
    dtype: jnp.dtype = jnp.bfloat16

    @nn.compact
    def __call__(self, x, mask=None):
        h = nn.LayerNorm(dtype=self.dtype)(x)
        h = nn.MultiHeadDotProductAttention(
            num_heads=self.num_heads,
            qkv_features=self.qkv_dim,
            dropout_rate=self.dropout_rate,
            decode=self.decode,
            deterministic=False,
            dtype=self.dtype,
        )(h, mask=mask, dropout_rng=self.make_rng("dropout"))
        x = x + h  # Residual connection

        h = nn.LayerNorm(dtype=self.dtype)(x)
        h = nn.Dense(self.mlp_dim * 4, dtype=self.dtype)(h)
        h = nn.relu(h)
        h = nn.Dense(x.shape[-1], dtype=self.dtype)(h)
        h = nn.Dropout(rate=self.dropout_rate)(h, deterministic=False)
        
        return x + h, None

class TransformerBlock(nn.Module):
    num_heads: int
    qkv_dim: int
    num_layers: int
    mlp_dim: int
    decode: bool
    dropout_rate: float = 0.1
    dtype: jnp.dtype = jnp.bfloat16

    @nn.compact
    def __call__(self, x, mask=None):
        x = jnp.asarray(x, self.dtype)

        x, _ = nn.scan(
            TransformerLayer,
            variable_axes={"params": 0},
            variable_broadcast={"cache": True},
            split_rngs={"params": True, "dropout": True},
            length=self.num_layers,
        )(
            qkv_dim=self.qkv_dim,
            num_heads=self.num_heads,
            decode=self.decode,
            dropout_rate=self.dropout_rate,
            mlp_dim=self.mlp_dim,
            dtype=self.dtype,
        )(x, mask=mask)

        x = nn.Dense(x.shape[-1], dtype=self.dtype)(x)
        return nn.LayerNorm(dtype=self.dtype)(x)

The issue is that I don't know how to init cache variable for a scan. I did this with generic submodules before, but not with scan'ned ones.
Currently TransformersBlock takes all the time for compilation.

How would I do this? I see nnx has a MutliheadAttention.init_cache, but not linen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache initialization for Transformer decoding with `scan`. #4680

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Cache initialization for Transformer decoding with scan. #4680

Uh oh!

Uh oh!

MRiabov Apr 3, 2025

Code/Minimal Reproducible Code

Replies: 0 comments

Cache initialization for Transformer decoding with `scan`. #4680

MRiabov
Apr 3, 2025