KeyError: 6 when getting nvlink_bandwidth

### System Info

GPU: NVIDIA RTX A6000

### Who can help?

@Tracin 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Run `git clone https://github.com/NVIDIA/TensorRT-LLM.git`
2. Create `Dockerfile` and `docker-compose.yaml` in `TensorRT-LLM/`
    <details><summary>Dockerfile</summary>
    
    ```
    # Obtain and start the basic docker image environment.
    FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
    
    # Install dependencies, TensorRT-LLM requires Python 3.10
    RUN apt-get update && apt-get -y install \
        python3.10 \
        python3-pip \
        openmpi-bin \
        libopenmpi-dev
    
    # Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
    # If you want to install the stable version (corresponding to the release branch), please
    # remove the `--pre` option.
    RUN --mount=type=cache,target=/root/.cache/pip pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
    
    COPY ./examples/qwen/requirements.txt .
    RUN --mount=type=cache,target=/root/.cache/pip pip3 install -r requirements.txt
    
    WORKDIR /workdir
    ```
    </details>
    <details><summary>docker-compose.yaml</summary>
    
    ```
    services:
      tensorrt:
        image: tensorrt-llm
        volumes:
          - .:/workdir
          - /mnt/models:/mnt/models
        command:
        - bash
        - -ec
        - |
          cd examples/qwen
          pip install -r requirements.txt
          python3 convert_checkpoint.py --model_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/ \
                    --dtype float32 \
                    --output_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_ckpt/fp32/1-gpu/
          trtllm-build --checkpoint_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_ckpt/fp32/1-gpu/ \
                    --gemm_plugin float32 \
                    --output_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_engines/fp32/1-gpu/
        deploy:
            resources:
              reservations:
                devices:
                  - driver: nvidia
                    count: 1
                    capabilities: [gpu]
    ```
    </details>
3. Run `git clone https://huggingface.co/Qwen/Qwen-7B-Chat` in `/mnt/models/Large_Language_Model`
4. Run `docker compose up`

### Expected behavior

No error

### actual behavior

```
[04/16/2024-22:50:23] [TRT-LLM] [I] NVLink is active: True
[04/16/2024-22:50:23] [TRT-LLM] [I] NVLink version: 6
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 411, in main
    cluster_config = infer_cluster_config()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 523, in infer_cluster_config
    cluster_info=infer_cluster_info(),
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 487, in infer_cluster_info
    nvl_bw = nvlink_bandwidth(nvl_version)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 433, in nvlink_bandwidth
    return nvl_bw_table[nvlink_version]
KeyError: 6
```

### additional notes

Relevant code: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/auto_parallel/cluster_info.py#L427-L433

Can't seem to find info about NVLink version 6's bandwidth online.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KeyError: 6 when getting nvlink_bandwidth #1467

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KeyError: 6 when getting nvlink_bandwidth #1467

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions