Skip to content

Conversation

@momo609
Copy link
Collaborator

@momo609 momo609 commented Oct 23, 2025

What this PR does / why we need it?

support FULL_AND_PIECEWISE graph mode.

image

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for a new FULL_AND_PIECEWISE graph mode. My review focuses on a few critical issues, including a duplicated test function that would prevent one test from running, typos in the new graph mode's name, and incorrect tensor conversions. I've provided suggestions to fix these issues to ensure the new functionality is correctly implemented and tested.

Comment on lines 74 to 88
def test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH():
if 'HCCL_OP_EXPANSION_MODE' in os.environ:
del os.environ['HCCL_OP_EXPANSION_MODE']
prompts = [
"Hello, my name is", "The president of the United States is",
"The capital of France is", "The future of AI is"
]
model = "Qwen/Qwen3-30B-A3B"
sampling_params = SamplingParams(max_tokens=32, temperature=0.0)
with VllmRunner(model,
max_model_len=1024,
tensor_parallel_size=2,
enforce_eager=False,
compilation_config={"cudagraph_mode":
"FULL_AND_PIECEIWSE"}) as runner:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This test function has the same name as the one above it, test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH, which will cause the first test to be overwritten and not run. Please rename this function to be unique, for example, test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_AND_PIECEWISE_GRAPH.

Additionally, there is a typo in the cudagraph_mode value on line 88: FULL_AND_PIECEIWSE should be FULL_AND_PIECEWISE.

Suggested change
def test_models_distributed_Qwen3_MOE_TP2_WITH_FULLGRAPH():
if 'HCCL_OP_EXPANSION_MODE' in os.environ:
del os.environ['HCCL_OP_EXPANSION_MODE']
prompts = [
"Hello, my name is", "The president of the United States is",
"The capital of France is", "The future of AI is"
]
model = "Qwen/Qwen3-30B-A3B"
sampling_params = SamplingParams(max_tokens=32, temperature=0.0)
with VllmRunner(model,
max_model_len=1024,
tensor_parallel_size=2,
enforce_eager=False,
compilation_config={"cudagraph_mode":
"FULL_AND_PIECEIWSE"}) as runner:
def test_models_distributed_Qwen3_MOE_TP2_WITH_FULL_AND_PIECEWISE_GRAPH():
if 'HCCL_OP_EXPANSION_MODE' in os.environ:
del os.environ['HCCL_OP_EXPANSION_MODE']
prompts = [
"Hello, my name is", "The president of the United States is",
"The capital of France is", "The future of AI is"
]
model = "Qwen/Qwen3-30B-A3B"
sampling_params = SamplingParams(max_tokens=32, temperature=0.0)
with VllmRunner(model,
max_model_len=1024,
tensor_parallel_size=2,
enforce_eager=False,
compilation_config={"cudagraph_mode":
"FULL_AND_PIECEWISE"}) as runner:

Comment on lines 232 to 235
elif compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE or \
compilation_config.cudagraph_mode == CUDAGraphMode.FULL_AND_PIECEIWSE:
logger.info(
"PIECEWISE compilation enabled on NPU. use_inductor not supported - "
"using only ACL Graph mode")
f"{compilation_config.cudagraph_mode} compilation enabled on NPU. use_inductor not supported - "
f"using only ACL Graph mode"
f"{compilation_config.cudagraph_mode}")
assert compilation_config.level == CompilationLevel.PIECEWISE, \
"When enabling piecewise aclgraph, please make sure compilation_config.level == CompilationLevel.PIECEWISE and compilation_config.cudagraph_mode == CUDAGraphMode.PIECEWISE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are a few issues in this block:

  1. There's a typo in CUDAGraphMode.FULL_AND_PIECEIWSE; it should likely be CUDAGraphMode.FULL_AND_PIECEWISE.
  2. The f-string for logging is malformed and prints cudagraph_mode twice, resulting in a confusing log message.
  3. The assertion message on line 239 is now outdated as it only mentions CUDAGraphMode.PIECEWISE, but the condition now also accepts FULL_AND_PIECEWISE.

I've provided a suggestion to fix these issues and improve readability by using an in clause.

        elif compilation_config.cudagraph_mode in (
            CUDAGraphMode.PIECEWISE, CUDAGraphMode.FULL_AND_PIECEWISE):
            logger.info(
                f"{compilation_config.cudagraph_mode.name} compilation enabled on NPU. "
                "use_inductor not supported - using only ACL Graph mode.")
            assert compilation_config.level == CompilationLevel.PIECEWISE, \
                f"When enabling {compilation_config.cudagraph_mode.name} aclgraph, please make sure compilation_config.level is PIECEWISE."

self.seq_lens_np[num_reqs:] = 0

cu_num_tokens, arange = self._get_cumsum_and_arange(num_scheduled_tokens)
self.query_start_loc_cpu[1:num_reqs + 1] = torch.Tensor(cu_num_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using torch.Tensor() to convert a NumPy array is discouraged. It copies the data and defaults to creating a torch.float32 tensor. Since self.query_start_loc_cpu is an IntTensor, this will cause an implicit float-to-int conversion. It's better to use torch.from_numpy() to avoid an unnecessary copy and ensure correct data types.

Suggested change
self.query_start_loc_cpu[1:num_reqs + 1] = torch.Tensor(cu_num_tokens)
self.query_start_loc_cpu[1:num_reqs + 1] = torch.from_numpy(cu_num_tokens)

@momo609 momo609 force-pushed the fullpiecefork branch 3 times, most recently from b48a815 to 2f2d1dc Compare October 23, 2025 10:57
@momo609 momo609 changed the title support FULL graph mode. support FULL_AND_PIECEWISE graph mode. Oct 23, 2025
@momo609 momo609 force-pushed the fullpiecefork branch 3 times, most recently from 49b7875 to 9aad4ba Compare October 24, 2025 06:06
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: wangxiaoxin-sherie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant