Skip to content

Bus error running t5 conversion script using the latest main #1538

@sc-gr

Description

@sc-gr

System Info

GPU (a10g). I have tried with an AWS g5.2xlarge instance and AWS g5.12xlarge instance.

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I pretty much follow the official installation:

  1. docker run --shm-size=2g --rm --runtime=nvidia --GPUs all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04
  2. apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git python-is-python3 vim
  3. pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
  4. git clone https://github.com/NVIDIA/TensorRT-LLM.git (05/02 version)
  5. cd TensorRT-LLM
export MODEL_TYPE="t5"
export MODEL_NAME="google/flan-t5-large"
export INFERENCE_PRECISION="float32"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1

python examples/enc_dec/convert_checkpoint.py --model_type ${MODEL_TYPE}   
              --model_dir ${MODEL_NAME}         
        --output_dir tmp/trt_models/${MODEL_NAME}/${INFERENCE_PRECISION}        
        --tp_size ${TP_SIZE}            
       --pp_size ${PP_SIZE}             
       --weight_data_type float32            
       --dtype ${INFERENCE_PRECISION}

Expected behavior

Model converted

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)

additional notes

I also tried to use bart model with the same script, and it successfully exits. Just change to export MODEL_TYPE="bart"
export MODEL_NAME="facebook/bart-large-cnn". So this might be a t5 architecture only problem, or it could relate to the GPU type I'm using (a10g)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions