vjp3 and sharding #31958

PhilipVinc · 2025-09-19T11:45:12Z

PhilipVinc
Sep 19, 2025

@yashk2810 pointed out to me that you are working a new variant of vjp that would allow easier accumulation over scan loops, dubbed vjp3.
I was this test case which is quite clear
https://cs.opensource.google/jax/jax/+/main:tests/mutable_array_test.py;l=781-810?q=mutable_array_test&ss=jax%2Fjax

This is one very common use case for us in sci-ml and would remove the need for our batched_vjp implementation which is very messy.

However one common requirement we have is to shard the input XS along multiple devices.
The 'smart' implementation of this would be to compute a vjp on every device, and have a single global reduction at the end.

I started with a naif implementation, below, which does a global reduction at every scan iteration, which is suboptimal.
Then I tried to hide everything under a shard map (see https://gist.github.com/PhilipVinc/e34a5b5c5e2cc46565c0f74ba0c8491f ) but I suspect that I cannot create an array_ref inside of a shard map, because I get an undefined memory error...

Would be happy to have some insights!

import jax
import jax.numpy as jnp
from jax._src.api import vjp3
import numpy as np
from jax.sharding import AxisType, reshard, NamedSharding

jax.config.update("jax_num_cpu_devices", 4)
mesh =jax.make_mesh((4,), ('S',), axis_types=(AxisType.Explicit,))
jax.sharding.set_mesh(mesh)

NUM_LAYERS = 3
NUM_MUBATCHES = 5
N_DEVS = jax.device_count()
MUBATCH_SIZE = 7

def mubatch_loss(Ws, xs):
  # Inner loop: scan over layers
  act, _ = jax.lax.scan(lambda xs, W: (jnp.dot(xs, W), None), xs, Ws)
  print('xs', jax.typeof(xs))
  print('axct', jax.typeof(act))
  return jnp.sum(act, axis=-1)

@jax.jit
def process_batch(Ws, xs_batch, dv_batch):
  grad_acc = jax.array_ref(jnp.zeros_like(Ws))               # CHANGED
  print('grad_acc', jax.typeof(grad_acc))
  
  def process_mubatch(_, args):
    xs, dl = args
    loss, f_vjp = vjp3(lambda Ws: mubatch_loss(Ws, xs), Ws)  # CHANGED
    print("loss", jax.typeof(loss))
    print("dl", jax.typeof(dl))
    f_vjp.with_refs(grad_acc)(dl)           # CHANGED
    return (), loss

  assert xs_batch.shape[0] == NUM_MUBATCHES * MUBATCH_SIZE * N_DEVS
  xs_mubatches = xs_batch.reshape((N_DEVS, NUM_MUBATCHES, MUBATCH_SIZE, *xs_batch.shape[1:]), out_sharding = jax.P('S'))
  dv_batch = dv_batch.reshape((N_DEVS, NUM_MUBATCHES, MUBATCH_SIZE, *dv_batch.shape[1:]), out_sharding = jax.P('S'))
  xs_mubatches = jnp.swapaxes(xs_mubatches, 0, 1)
  dv_batch = jnp.swapaxes(dv_batch, 0, 1)


  # Outer loop: scan over microbatches
  (), _losses = jax.lax.scan(process_mubatch, (), (xs_mubatches,dv_batch))
  return jax.ref.freeze(grad_acc)

Ws = jnp.ones((NUM_LAYERS, 4, 4))
Ws = reshard(Ws, jax.P())
xs_batch = jnp.ones((NUM_MUBATCHES * MUBATCH_SIZE * N_DEVS, 4))
xs_batch = reshard(xs_batch, jax.P('S'))

dv_batch = jax.random.normal(jax.random.key(1), NUM_MUBATCHES * MUBATCH_SIZE * N_DEVS)
dv_batch = reshard(dv_batch, jax.P('S'))
g = process_batch(Ws, xs_batch, dv_batch)


res, vjpfun = jax.vjp(lambda Ws: mubatch_loss(Ws, xs_batch), Ws)
(g_std,) = vjpfun(dv_batch)
np.testing.assert_allclose(g, g_std, atol=1e-3, rtol=1e-3)

fun_l = process_batch.lower(Ws, xs_batch, dv_batch)
fun_c = fun_l.compile()

yashk2810 · 2025-09-19T11:49:41Z

yashk2810
Sep 19, 2025
Collaborator

Maybe unreduced can help here?

I know it's a bit cryptic but I am working on a way to return unreduced arrays out of minibatches and then do a global reduction out of the scan loop. I think I am pretty close to getting it to work which might help with the problem you pointed out?

3 replies

PhilipVinc Sep 19, 2025
Author

Sorry, what do you mean by unreduced? jax.lax.pvary?

PhilipVinc Sep 19, 2025
Author

To be more precise, the error I see is that running (on head)
https://gist.github.com/PhilipVinc/e34a5b5c5e2cc46565c0f74ba0c8491f
fails with

File ~/Nextcloud/Codes/Python/netket/.venv/lib/python3.13/site-packages/jax/_src/lax/lax.py:6723, in _broadcast_in_dim_abstract_eval(x, shape, broadcast_dimensions, sharding, *dyn_shape)
   6719   new_sharding = _broadcast_in_dim_sharding_rule(
   6720       x, shape=shape, broadcast_dimensions=broadcast_dimensions,
   6721       sharding=sharding)
   6722   new_vma = core.standard_vma_rule('broadcast_in_dim', x)
-> 6723   return core.ShapedArray(shape, x.dtype, x.weak_type, sharding=new_sharding,
   6724                           vma=new_vma, memory_space=x.memory_space)
   6725 # If any BInts in shape, or Tracers in dyn_shape, produce a DShapedArray
   6726 # (even if x is a ShapedArray)
   6727 # TODO(mattjj): unify DShapedArray with ShapedArray, and remove this code
   6728 return core.DShapedArray(_merge_dyn_shape(shape, dyn_shape), x.dtype, x.weak_type)

File ~/Nextcloud/Codes/Python/netket/.venv/lib/python3.13/site-packages/jax/_src/core.py:2221, in ShapedArray.__init__(self, shape, dtype, weak_type, sharding, vma, memory_space)
   2219 self.vma = get_vma(vma, self.sharding.mesh)
   2220 # See description of https://github.com/jax-ml/jax/pull/30556
-> 2221 self.memory_space = get_memory_space(memory_space)

File ~/Nextcloud/Codes/Python/netket/.venv/lib/python3.13/site-packages/jax/_src/core.py:2201, in get_memory_space(memory_space)
   2200 def get_memory_space(memory_space):
-> 2201   assert isinstance(memory_space, MemorySpace)
   2202   return memory_space

AssertionError:

which I seem to trace back to

def process_batch(Ws, xs_batch, dv_batch):
  out_specs = jax.tree.map(lambda x: jax.P('S', *jax.typeof(x).sharding.spec), Ws)
  @jax.shard_map(
    out_specs = out_specs,
    )  
  def process_shard(Ws, xs_batch, dv_batch):
    grad_acc = jax.array_ref(jnp.zeros_like(Ws))
    ....

to calling jax.array_ref()inside a shard map, and I am not setting the memory, because I don't know how to...

yashk2810 Oct 17, 2025
Collaborator

https://cs.opensource.google/jax/jax/+/main:tests/pjit_test.py;l=9388-9467?q=pjit_test.py&ss=jax%2Fjax should do what you want!

DiagRisker · 2025-10-17T14:33:49Z

DiagRisker
Oct 17, 2025

Small question, jax.sharding.set_mesh is not documented, is it a "decorator" for general sharding option ? (in addendum to auto sharding mode)

NB: I'm trying to gather more information for auto sharding cf: #32494

1 reply

yashk2810 Oct 17, 2025
Collaborator

It is documented here: https://docs.jax.dev/en/latest/notebooks/explicit-sharding.html (it doesn't have a docstring yet, I'll add it).

set_mesh has dual behavior. It can act as a context manager or a global setter depending on how you use it.

with jax.set_mesh(mesh):   # context manager
  ...

jax.set_mesh(mesh)  # global setter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vjp3 and sharding #31958

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

vjp3 and sharding #31958

Uh oh!

PhilipVinc Sep 19, 2025

Replies: 2 comments · 4 replies

Uh oh!

yashk2810 Sep 19, 2025 Collaborator

Uh oh!

PhilipVinc Sep 19, 2025 Author

Uh oh!

PhilipVinc Sep 19, 2025 Author

Uh oh!

yashk2810 Oct 17, 2025 Collaborator

Uh oh!

DiagRisker Oct 17, 2025

Uh oh!

yashk2810 Oct 17, 2025 Collaborator

PhilipVinc
Sep 19, 2025

Replies: 2 comments 4 replies

yashk2810
Sep 19, 2025
Collaborator

PhilipVinc Sep 19, 2025
Author

PhilipVinc Sep 19, 2025
Author

yashk2810 Oct 17, 2025
Collaborator

DiagRisker
Oct 17, 2025

yashk2810 Oct 17, 2025
Collaborator