-
Notifications
You must be signed in to change notification settings - Fork 515
support FULL_AND_PIECEWISE graph mode. #3659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for a new FULL_AND_PIECEWISE graph mode. My review focuses on a few critical issues, including a duplicated test function that would prevent one test from running, typos in the new graph mode's name, and incorrect tensor conversions. I've provided suggestions to fix these issues to ensure the new functionality is correctly implemented and tested.
| def test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH(): | ||
| if 'HCCL_OP_EXPANSION_MODE' in os.environ: | ||
| del os.environ['HCCL_OP_EXPANSION_MODE'] | ||
| prompts = [ | ||
| "Hello, my name is", "The president of the United States is", | ||
| "The capital of France is", "The future of AI is" | ||
| ] | ||
| model = "Qwen/Qwen3-30B-A3B" | ||
| sampling_params = SamplingParams(max_tokens=32, temperature=0.0) | ||
| with VllmRunner(model, | ||
| max_model_len=1024, | ||
| tensor_parallel_size=2, | ||
| enforce_eager=False, | ||
| compilation_config={"cudagraph_mode": | ||
| "FULL_AND_PIECEIWSE"}) as runner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test function has the same name as the one above it, test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH, which will cause the first test to be overwritten and not run. Please rename this function to be unique, for example, test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_AND_PIECEWISE_GRAPH.
Additionally, there is a typo in the cudagraph_mode value on line 88: FULL_AND_PIECEIWSE should be FULL_AND_PIECEWISE.
| def test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH(): | |
| if 'HCCL_OP_EXPANSION_MODE' in os.environ: | |
| del os.environ['HCCL_OP_EXPANSION_MODE'] | |
| prompts = [ | |
| "Hello, my name is", "The president of the United States is", | |
| "The capital of France is", "The future of AI is" | |
| ] | |
| model = "Qwen/Qwen3-30B-A3B" | |
| sampling_params = SamplingParams(max_tokens=32, temperature=0.0) | |
| with VllmRunner(model, | |
| max_model_len=1024, | |
| tensor_parallel_size=2, | |
| enforce_eager=False, | |
| compilation_config={"cudagraph_mode": | |
| "FULL_AND_PIECEIWSE"}) as runner: | |
| def test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_AND_PIECEWISE_GRAPH(): | |
| if 'HCCL_OP_EXPANSION_MODE' in os.environ: | |
| del os.environ['HCCL_OP_EXPANSION_MODE'] | |
| prompts = [ | |
| "Hello, my name is", "The president of the United States is", | |
| "The capital of France is", "The future of AI is" | |
| ] | |
| model = "Qwen/Qwen3-30B-A3B" | |
| sampling_params = SamplingParams(max_tokens=32, temperature=0.0) | |
| with VllmRunner(model, | |
| max_model_len=1024, | |
| tensor_parallel_size=2, | |
| enforce_eager=False, | |
| compilation_config={"cudagraph_mode": | |
| "FULL_AND_PIECEWISE"}) as runner: |
| elif compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE or \ | ||
| compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEIWSE: | ||
| logger.info( | ||
| "PIECEWISE compilation enabled on NPU. use_inductor not supported - " | ||
| "using only ACL Graph mode") | ||
| f"{compilation_config.cudagraph_mode} compilation enabled on NPU. use_inductor not supported - " | ||
| f"using only ACL Graph mode" | ||
| f"{compilation_config.cudagraph_mode}") | ||
| assert compilation_config.level == CompilationLevel.PIECEWISE, \ | ||
| "When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few issues in this block:
- There's a typo in
CUDAGraphMode.FULL_AND_PIECEIWSE; it should likely beCUDAGraphMode.FULL_AND_PIECEWISE. - The f-string for logging is malformed and prints
cudagraph_modetwice, resulting in a confusing log message. - The assertion message on line 239 is now outdated as it only mentions
CUDAGraphMode.PIECEWISE, but the condition now also acceptsFULL_AND_PIECEWISE.
I've provided a suggestion to fix these issues and improve readability by using an in clause.
elif compilation_config.cudagraph_mode in (
CUDAGraphMode.PIECEWISE, CUDAGraphMode.FULL_AND_PIECEWISE):
logger.info(
f"{compilation_config.cudagraph_mode.name} compilation enabled on NPU. "
"use_inductor not supported - using only ACL Graph mode.")
assert compilation_config.level == CompilationLevel.PIECEWISE, \
f"When enabling {compilation_config.cudagraph_mode.name} aclgraph, please make sure compilation_config.level is PIECEWISE."| self.seq_lens_np[num_reqs:] = 0 | ||
|
|
||
| cu_num_tokens, arange = self._get_cumsum_and_arange(num_scheduled_tokens) | ||
| self.query_start_loc_cpu[1:num_reqs + 1] = torch.Tensor(cu_num_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using torch.Tensor() to convert a NumPy array is discouraged. It copies the data and defaults to creating a torch.float32 tensor. Since self.query_start_loc_cpu is an IntTensor, this will cause an implicit float-to-int conversion. It's better to use torch.from_numpy() to avoid an unnecessary copy and ensure correct data types.
| self.query_start_loc_cpu[1:num_reqs + 1] = torch.Tensor(cu_num_tokens) | |
| self.query_start_loc_cpu[1:num_reqs + 1] = torch.from_numpy(cu_num_tokens) |
b48a815 to
2f2d1dc
Compare
49b7875 to
9aad4ba
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: wangxiaoxin-sherie <[email protected]>
2cca829 to
bd6e38d
Compare
What this PR does / why we need it?
support FULL_AND_PIECEWISE graph mode.
Does this PR introduce any user-facing change?
How was this patch tested?