Skip to content

Conversation

@louie-tsai
Copy link
Collaborator

this PR depends on a vLLM PR vllm-project/vllm#18444
Since I don't fully understand how workflow works, this is just a early draft to start the work.

@huydhn
Copy link
Contributor

huydhn commented Jun 14, 2025

Let me add you to the list of contributors so you do not need to wait for CI approval

@louie-tsai louie-tsai force-pushed the cpu_vllm_benchmark branch 2 times, most recently from 0671ad5 to 41fa9ce Compare June 26, 2025 06:26
@louie-tsai louie-tsai requested a review from huydhn June 26, 2025 15:50
@louie-tsai louie-tsai changed the title [WIP] Draft to enable CPU benchmark for VLLM Perf Dashboard. enable CPU benchmark for VLLM Perf Dashboard. Jun 26, 2025
@huydhn
Copy link
Contributor

huydhn commented Jul 1, 2025

I have the PR to publish the docker image up at vllm-project/ci-infra#118, will ask the team for a review

2: [
"linux.aws.h100.4",
"linux.rocm.gpu.mi300.2",
"intel-cpu-emr",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line means that this runner intel-cpu-emr will only be used when tensor_parallel_size is 2. Is this the expected behavior? From your json files, it looks like this should be under 1 and 4?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the EMR machine from Chendi only has 2 numa nodes, so we put it under TP 2 case to only run TP 1 and 2 test cases. however, Chendi indeed plans to have a new EMR system, so we will try to have 4 numa node in the new system.
Therefore, I moved it to TP 4 case for now.

@louie-tsai louie-tsai force-pushed the cpu_vllm_benchmark branch from e3a9885 to c253948 Compare July 3, 2025 01:23
@louie-tsai louie-tsai requested a review from huydhn July 3, 2025 01:41
@huydhn
Copy link
Contributor

huydhn commented Jul 7, 2025

sounds good. will do that once I have the write access.

Oh, you should have it now, please let me know if it works

--to-benchmark-configs-dir vllm-benchmarks/vllm/.buildkite/nightly-benchmarks/tests \
--models "${MODELS}"
--models "${MODELS}" \
--device "${DEVICE_NAME// /_}"
Copy link
Contributor

@huydhn huydhn Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug here where DEVICE_NAME is set to cuda or rocm for non-cpu cases. In these cases, the logic in .github/scripts/setup_vllm_benchmark.py will fail to find the JSON benchmark suite because they don't have the _cuda or _rocm suffix, only _cpu has it. DEVICE_NAME should just be empty in these cases.

You can see that https://github.com/pytorch/pytorch-integration-testing/actions/runs/16163751659/job/45620654542#step:13:71 found no JSON file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn you are right. made a quick change. hopefully it fixed the issue.

@louie-tsai
Copy link
Collaborator Author

moved the works into #44. closed for duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants