support FULL_AND_PIECEWISE graph mode. #3659

momo609 · 2025-10-23T06:38:34Z

What this PR does / why we need it?

support FULL_AND_PIECEWISE graph mode.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@17c540a

github-actions · 2025-10-23T06:38:48Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for a new FULL_AND_PIECEWISE graph mode. My review focuses on a few critical issues, including a duplicated test function that would prevent one test from running, typos in the new graph mode's name, and incorrect tensor conversions. I've provided suggestions to fix these issues to ensure the new functionality is correctly implemented and tested.

gemini-code-assist · 2025-10-23T06:40:15Z

tests/e2e/multicard/test_full_graph_mode.py

+def test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH():
+    if 'HCCL_OP_EXPANSION_MODE' in os.environ:
+        del os.environ['HCCL_OP_EXPANSION_MODE']
+    prompts = [
+        "Hello, my name is", "The president of the United States is",
+        "The capital of France is", "The future of AI is"
+    ]
+    model = "Qwen/Qwen3-30B-A3B"
+    sampling_params = SamplingParams(max_tokens=32, temperature=0.0)
+    with VllmRunner(model,
+                    max_model_len=1024,
+                    tensor_parallel_size=2,
+                    enforce_eager=False,
+                    compilation_config={"cudagraph_mode":
+                                        "FULL_AND_PIECEIWSE"}) as runner:


This test function has the same name as the one above it, test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH, which will cause the first test to be overwritten and not run. Please rename this function to be unique, for example, test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_AND_PIECEWISE_GRAPH.

Additionally, there is a typo in the cudagraph_mode value on line 88: FULL_AND_PIECEIWSE should be FULL_AND_PIECEWISE.

Suggested change

def test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH():

if 'HCCL_OP_EXPANSION_MODE' in os.environ:

del os.environ['HCCL_OP_EXPANSION_MODE']

prompts = [

"Hello, my name is", "The president of the United States is",

"The capital of France is", "The future of AI is"

]

model = "Qwen/Qwen3-30B-A3B"

sampling_params = SamplingParams(max_tokens=32, temperature=0.0)

with VllmRunner(model,

max_model_len=1024,

tensor_parallel_size=2,

enforce_eager=False,

compilation_config={"cudagraph_mode":

"FULL_AND_PIECEIWSE"}) as runner:

def test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_AND_PIECEWISE_GRAPH():

if 'HCCL_OP_EXPANSION_MODE' in os.environ:

del os.environ['HCCL_OP_EXPANSION_MODE']

prompts = [

"Hello, my name is", "The president of the United States is",

"The capital of France is", "The future of AI is"

]

model = "Qwen/Qwen3-30B-A3B"

sampling_params = SamplingParams(max_tokens=32, temperature=0.0)

with VllmRunner(model,

max_model_len=1024,

tensor_parallel_size=2,

enforce_eager=False,

compilation_config={"cudagraph_mode":

"FULL_AND_PIECEWISE"}) as runner:

gemini-code-assist · 2025-10-23T06:40:15Z

vllm_ascend/platform.py

+        elif compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE or \
+             compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEIWSE:
            logger.info(
-                "PIECEWISE compilation enabled on NPU. use_inductor not supported - "
-                "using only ACL Graph mode")
+                f"{compilation_config.cudagraph_mode} compilation enabled on NPU. use_inductor not supported - "
+                f"using only ACL Graph mode"
+                f"{compilation_config.cudagraph_mode}")
            assert compilation_config.level == CompilationLevel.PIECEWISE, \
                "When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE"


There are a few issues in this block:

There's a typo in CUDAGraphMode.FULL_AND_PIECEIWSE; it should likely be CUDAGraphMode.FULL_AND_PIECEWISE.

The f-string for logging is malformed and prints cudagraph_mode twice, resulting in a confusing log message.

The assertion message on line 239 is now outdated as it only mentions CUDAGraphMode.PIECEWISE, but the condition now also accepts FULL_AND_PIECEWISE.

I've provided a suggestion to fix these issues and improve readability by using an in clause.

elif compilation_config.cudagraph_mode in ( CUDAGraphMode.PIECEWISE, CUDAGraphMode.FULL_AND_PIECEWISE): logger.info( f"{compilation_config.cudagraph_mode.name} compilation enabled on NPU. " "use_inductor not supported - using only ACL Graph mode.") assert compilation_config.level == CompilationLevel.PIECEWISE, \ f"When enabling {compilation_config.cudagraph_mode.name} aclgraph, please make sure compilation_config.level is PIECEWISE."

gemini-code-assist · 2025-10-23T06:40:16Z

vllm_ascend/worker/model_runner_v1.py

            self.seq_lens_np[num_reqs:] = 0

+            cu_num_tokens, arange = self._get_cumsum_and_arange(num_scheduled_tokens)
+            self.query_start_loc_cpu[1:num_reqs + 1] = torch.Tensor(cu_num_tokens)


Using torch.Tensor() to convert a NumPy array is discouraged. It copies the data and defaults to creating a torch.float32 tensor. Since self.query_start_loc_cpu is an IntTensor, this will cause an implicit float-to-int conversion. It's better to use torch.from_numpy() to avoid an unnecessary copy and ensure correct data types.

Suggested change

self.query_start_loc_cpu[1:num_reqs + 1] = torch.Tensor(cu_num_tokens)

self.query_start_loc_cpu[1:num_reqs + 1] = torch.from_numpy(cu_num_tokens)

github-actions · 2025-10-24T08:57:26Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: wangxiaoxin-sherie <[email protected]>

github-actions bot added module:tests module:core labels Oct 23, 2025

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

momo609 force-pushed the fullpiecefork branch 3 times, most recently from b48a815 to 2f2d1dc Compare October 23, 2025 10:57

momo609 changed the title ~~support FULL graph mode.~~ support FULL_AND_PIECEWISE graph mode. Oct 23, 2025

momo609 force-pushed the fullpiecefork branch 3 times, most recently from 49b7875 to 9aad4ba Compare October 24, 2025 06:06

github-actions bot added the merge-conflicts label Oct 24, 2025

support FULL_AND_PIECEWISE graph mode

bd6e38d

Signed-off-by: wangxiaoxin-sherie <[email protected]>

momo609 force-pushed the fullpiecefork branch from 2cca829 to bd6e38d Compare October 25, 2025 03:36

momo609 closed this Oct 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support FULL_AND_PIECEWISE graph mode. #3659

support FULL_AND_PIECEWISE graph mode. #3659

momo609 commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	self.query_start_loc_cpu[1:num_reqs + 1] = torch.Tensor(cu_num_tokens)
	self.query_start_loc_cpu[1:num_reqs + 1] = torch.from_numpy(cu_num_tokens)

support FULL_AND_PIECEWISE graph mode. #3659

support FULL_AND_PIECEWISE graph mode. #3659

Conversation

momo609 commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

momo609 commented Oct 23, 2025 •

edited by github-actions bot

Loading