[ROCm][FEAT] Integrate AITER tgemm. #23712
Draft
+183
−49
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
This PR integrates aiter tgemm kernel that support bfloat16, float16, and fp8 data types, resulting in improved model performance.
This PR also introduces
_aiter_ops.py
as proposed in the RFC here. Theaiter_ops
namespace provides several key benefits:Centralized kernel registration: Ensures that kernels from the aiter package are properly registered
Environment availability checks: Encapsulates aiter support detection and environment compatibility validation
Reduced code duplication: Eliminates the need for duplicate helper functions across different vLLM modules
This implementation establishes the foundation for future refactoring efforts, where existing kernels throughout the vLLM repository will be migrated to use this unified approach for better maintainability and consistency.
This PR uses 7aa65b6 commit from
aiter
repo.Benchmark Results
meta-llama/Llama-3.3-70B-Instruct tp2
amd/Llama-3.3-70B-Instruct-FP8-KV tp2
benchmark setting
python vllm/benchmarks/benchmark_serving.py --backend vllm --model "$model_name" --dataset-name random --num-prompts 1000 --max-concurrency 32 --random-input-len 1000 --random-output-len 1000
AITER tgemm tuning guide
Run vllm serve command with
AITER_TUNE_GEMM=1
environment flag:Example:
The above command will record the requested shapes based on cudagraph_capture_sizes into
aiter/configs/untuned_gemm.csv
in the directory where you have installed/cloned theaiter
package/repo.Run
python3 gradlib/gradlib/gemm_tuner.py --tuned_file aiter/configs/tuned_gemm.csv --input_file aiter/configs/untuned_gemm.csv
in the directory whereaiter
package exists.for more instruction, follow the documentation here.
Test Plan
Test models that are afftected by this change, using lm_eval on gsm8k dataset.
environment setting
Step 1: run vllm serve
VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 vllm serve $MODEL_NAME --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "cudagraph_capture_sizes": [1,2,4,8,16,24,32]}' -tp 2 --trust-remote-code --swap-space 16 --distributed-executor-backend mp
Step 2: run lm_eval
lm_eval --model local-completions --tasks gsm8k --model_args model=$MODEL_NAME,base_url=http://localhost:8000/v1/completions --trust_remote_code --num_fewshot 5 --batch_size 256
Test Results
meta-llama/Llama-3.3-70B-Instruct tp2
amd/Llama-3.3-70B-Instruct-FP8-KV tp2
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.