Skip to content

[Bug]: Compilation error in uballocator when ENABLE_MULTI_DEVICE=0 #6798

@WilliamTambellini

Description

@WilliamTambellini

System Info

TRTLLM today's main branch.
cmake 3.27.
gcc 11
RHEL8
CUDA 12.9

Who can help?

@tongyuantongyu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Try to build without ENABLE_MULTI_DEVICE:

cmake ..  -DCMAKE_VERBOSE_MAKEFILE=ON  -DBUILD_PYBIND=OFF   -DENABLE_MULTI_DEVICE=OFF

See cmake runs correctly.
But compiling fails afterward.

Expected behavior

compilation success.

actual behavior

See compilation error when trying to build the userbuffer kernel:

In file included from TensorRT-LLM/cpp/tensorrt_llm/kernels/userbuffers/ub_interface.h:20,
                 from TensorRT-LLM/cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp:16:
TensorRT-LLM/cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h:38:5: 
error: ‘ncclWindow_t’ does not name a type
   38 |     ncclWindow_t window;

additional notes

It s because when ENABLE_MULTI_DEVICE=OFF, the cmake script does not search for nccl so no src code should try to include/use nccl structs.

Possible solutions:
1 move the UBBuffer inside the ENABLE_MULTI_DEVICE block:

#if ENABLE_MULTI_DEVICE
struct UBBuffer
{
    void* addr;
    int handle;
    size_t size;
    ncclWindow_t window;

2 Keep the UBBuffer class but ifdef the "ncclWindow_t window;" line

struct UBBuffer
{
    void* addr;
    int handle;
    size_t size;
#if ENABLE_MULTI_DEVICE
    ncclWindow_t window;
#endif

I can prepare a PR if you d advice which solution you d prefer.
Best

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Scale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismTesting<NV>Continuous integration, build system, and testing infrastructure issuesbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions