-
Notifications
You must be signed in to change notification settings - Fork 218
Open
Labels
Description
Can models quantized with dynamic FP8 using the llmcompressor library be inferred on Sglang? I encountered an error during inference. The quantization code is as follows:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor import oneshot
from llmcompressor.utils import dispatch_for_generation
MODEL_ID = "/open_source/Qwen3-30B-A3B"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True,device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp8 with per channel via ptq
# * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)
output_dir = "./Qwen3-30B-A3B-FP8-DYNAMIC-0819"
# Apply quantization.
oneshot(
model=model,
recipe=recipe,
save_compressed=True,
trust_remote_code_model=True,
output_dir=output_dir,
)
Execution command:
CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model ./Qwen3-30B-A3B-FP8-DYNAMIC-0819 --mem-fraction-static 0.8 --host 0.0.0.0 --port 8801 --context-length 4200 --enable-ep-moe