-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
Scale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelism<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismTesting<NV>Continuous integration, build system, and testing infrastructure issues<NV>Continuous integration, build system, and testing infrastructure issuesbugSomething isn't workingSomething isn't working
Description
System Info
TRTLLM today's main branch.
cmake 3.27.
gcc 11
RHEL8
CUDA 12.9
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Try to build without ENABLE_MULTI_DEVICE:
cmake .. -DCMAKE_VERBOSE_MAKEFILE=ON -DBUILD_PYBIND=OFF -DENABLE_MULTI_DEVICE=OFF
See cmake runs correctly.
But compiling fails afterward.
Expected behavior
compilation success.
actual behavior
See compilation error when trying to build the userbuffer kernel:
In file included from TensorRT-LLM/cpp/tensorrt_llm/kernels/userbuffers/ub_interface.h:20,
from TensorRT-LLM/cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp:16:
TensorRT-LLM/cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h:38:5:
error: ‘ncclWindow_t’ does not name a type
38 | ncclWindow_t window;
additional notes
It s because when ENABLE_MULTI_DEVICE=OFF, the cmake script does not search for nccl so no src code should try to include/use nccl structs.
Possible solutions:
1 move the UBBuffer inside the ENABLE_MULTI_DEVICE block:
#if ENABLE_MULTI_DEVICE
struct UBBuffer
{
void* addr;
int handle;
size_t size;
ncclWindow_t window;
2 Keep the UBBuffer class but ifdef the "ncclWindow_t window;" line
struct UBBuffer
{
void* addr;
int handle;
size_t size;
#if ENABLE_MULTI_DEVICE
ncclWindow_t window;
#endif
I can prepare a PR if you d advice which solution you d prefer.
Best
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
Scale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelism<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismTesting<NV>Continuous integration, build system, and testing infrastructure issues<NV>Continuous integration, build system, and testing infrastructure issuesbugSomething isn't workingSomething isn't working