Skip to content

Error Running Qwen3-30B-A3B-FP-DYNAMIC Model on Sglang #1763

@wangwenmingaa

Description

@wangwenmingaa

Can models quantized with dynamic FP8 using the llmcompressor library be inferred on Sglang? I encountered an error during inference. The quantization code is as follows:

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor import oneshot
from llmcompressor.utils import dispatch_for_generation

MODEL_ID = "/open_source/Qwen3-30B-A3B"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True,device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

output_dir = "./Qwen3-30B-A3B-FP8-DYNAMIC-0819"
# Apply quantization.
oneshot(
    model=model,
    recipe=recipe,
    save_compressed=True,
    trust_remote_code_model=True,
    output_dir=output_dir,
)

Execution command:
CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model ./Qwen3-30B-A3B-FP8-DYNAMIC-0819 --mem-fraction-static 0.8 --host 0.0.0.0 --port 8801 --context-length 4200 --enable-ep-moe

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions