-
Notifications
You must be signed in to change notification settings - Fork 221
Open
Description
Summary
- Re-quantizing Gemma-3n-E2B-it to W4A16 with the same recipe that previously worked now yields models with random / very poor accuracy on OpenLLM benchmarks. Suspected to affect both E2B and E4B; only E2B reproduced so far.
- The regression localizes to
self_attn.o_proj.weight_packed
for layers ≥ 20: the bad model’s packed weights collapse to a constant0x88
pattern; other packed tensors (q/k/v, MLP) look normal/close.
Suspected cause
- Recent changes in Transformers Gemma-3n model definition / autowrap (module mapping).
Diff of the two models obtained with same recipe:
A - older run, good model
B - newer run with same recipe, model produces random accuracies
model.language_model.layers.20.self_attn.o_proj.weight_packed
A: model-00001-of-00003.safetensors
shape=(2048, 256) dtype=torch.int32 bytes=2097152
u8_mean=136.000 u8_std=0.000 sum32=285212672 l2=1.969e+05
hex[:16]=88888888888888888888888888888888
B: model-00001-of-00003.safetensors
shape=(2048, 256) dtype=torch.int32 bytes=2097152
u8_mean=135.991 u8_std=43.678 sum32=285193004 l2=2.068e+05
hex[:16]=46885797b66863cca9685cb3637b3a77
(Other tensors like q_proj, mlp.up/gate/down_proj are statistically similar across A/B.)
Model A: https://huggingface.co/RedHatAI/gemma-3n-E2B-it-quantized.w4a16
Model B: local model, can share if needed.
Metadata
Metadata
Assignees
Labels
No labels