-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
GPU (a10g). I have tried with an AWS g5.2xlarge instance and AWS g5.12xlarge instance.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I pretty much follow the official installation:
- docker run --shm-size=2g --rm --runtime=nvidia --GPUs all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04
- apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git python-is-python3 vim
- pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
- git clone https://github.com/NVIDIA/TensorRT-LLM.git (05/02 version)
- cd TensorRT-LLM
export MODEL_TYPE="t5"
export MODEL_NAME="google/flan-t5-large"
export INFERENCE_PRECISION="float32"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
python examples/enc_dec/convert_checkpoint.py --model_type ${MODEL_TYPE}
--model_dir ${MODEL_NAME}
--output_dir tmp/trt_models/${MODEL_NAME}/${INFERENCE_PRECISION}
--tp_size ${TP_SIZE}
--pp_size ${PP_SIZE}
--weight_data_type float32
--dtype ${INFERENCE_PRECISION}
Expected behavior
Model converted
actual behavior
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)
additional notes
I also tried to use bart model with the same script, and it successfully exits. Just change to export MODEL_TYPE="bart"
export MODEL_NAME="facebook/bart-large-cnn". So this might be a t5 architecture only problem, or it could relate to the GPU type I'm using (a10g)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working