Segmentation fault with pipeline parallelism and `gather_all_token_logits`

### System Info

- NVIDIA H100 DGX
- CUDA 12.1
- TensorRT-LLM 0.8.0

### Who can help?

@byshiue 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Based on the Falcon examples, I added the use of pipeline parallelism and gather_all_token_logits:

```shell
python convert_checkpoint.py --model_dir ./falcon/7b-instruct --dtype bfloat16 --output_dir ./falcon/7b-instruct/trt_ckpt/bf16/2-gpu/ --pp_size 2

trtllm-build --checkpoint_dir ./falcon/7b-instruct/trt_ckpt/bf16/2-gpu/ --gemm_plugin bfloat16 --remove_input_padding enable --gpt_attention_plugin bfloat16 --output_dir ./falcon/7b-instruct/trt_engines/bf16/2-gpu/ --gather_all_token_logits

python ../summarize.py --test_trt_llm --hf_model_dir ./falcon/7b-instruct --engine_dir ./falcon/7b-instruct/trt_engines/bf16/2-gpu/
```

### Expected behavior

Produces a similar result to the case without pipelining and without gather_all_token_logits

### actual behavior

Crashes with the following stack trace:

```
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x8
[ 0] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fa73ade1520]
[ 1] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/virtualenv/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession18executeContextStepERKSt6vectorINS0_15GenerationInputESaIS3_EERKS2_IiSaIiEEPKNS_13batch_manager16kv_cache_manager14KVCacheManagerE+0x5a2)[0x7fa455c9a7c2]
[ 2] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/virtualenv/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession15generateBatchedERSt6vectorINS0_16GenerationOutputESaIS3_EERKS2_INS0_15GenerationInputESaIS7_EERKNS0_14SamplingConfigERKSt8functionIFvibEE+0xc0b)[0x7fa455c9b89b]
[ 3] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/virtualenv/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession8generateERNS0_16GenerationOutputERKNS0_15GenerationInputERKNS0_14SamplingConfigE+0xc43)[0x7fa455c9d2f3]
[ 4] /virtualenv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x42f79)[0x7fa484d80f79]
[ 5] /virtualenv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d19e)[0x7fa484d6b19e]
[ 6] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x15a10e)[0x55cc0703e10e]
[ 7] python(_PyObject_MakeTpCall+0x25b)[0x55cc07034a7b]
[ 8] python(+0x168acb)[0x55cc0704cacb]
[ 9] python(_PyEval_EvalFrameDefault+0x614a)[0x55cc0702ccfa]
[10] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x1687f1)[0x55cc0704c7f1]
[11] python(PyObject_Call+0x122)[0x55cc0704d492]
[12] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyEval_EvalFrameDefault+0x2a27)[0x55cc070295d7]
[13] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyFunction_Vectorcall+0x7c)[0x55cc0703e9fc]
[14] python(_PyEval_EvalFrameDefault+0x198c)[0x55cc0702853c]
[15] python(_PyFunction_Vectorcall+0x7c)[0x55cc0703e9fc]
Tue Mar 12 12:30:27 2024[1,0]<stderr>:[16] python(_PyEval_EvalFrameDefault+0x6bd)[0x55cc0702726d]
[17] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x13f9c6)[0x55cc070239c6]
[18] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(PyEval_EvalCode+0x86)[0x55cc07119256]
[19] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x260108)[0x55cc07144108]
[20] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x2599cb)[0x55cc0713d9cb]
[21] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x25fe55)[0x55cc07143e55]
[22] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyRun_SimpleFileObject+0x1a8)[0x55cc07143338]
[23] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyRun_AnyFileObject+0x43)[0x55cc07142f83]
[24] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(Py_RunMain+0x2be)[0x55cc07135a5e]
[25] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(Py_BytesMain+0x2d)[0x55cc0710c02d]
[26] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fa73adc8d90]
[27] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fa73adc8e40]
[28] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_start+0x25)[0x55cc0710bf25]
*** End of error message ***
```

If I add --use_py_session, I get the following error:

```
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/falcon/../summarize.py", line 644, in <module>
    main(args)
  File "/TensorRT-LLM/examples/falcon/../summarize.py", line 388, in main
    output, *_ = eval_trt_llm(datapoint,
  File "/TensorRT-LLM/examples/falcon/../summarize.py", line 233, in eval_trt_llm
    outputs = runner.generate(
  File "/virtualenv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner.py", line 642, in generate
    outputs = self._prepare_outputs(outputs, input_lengths)
  File "/virtualenv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner.py", line 237, in _prepare_outputs
    context_logits = context_logits.flatten(end_dim=-2)
AttributeError: 'NoneType' object has no attribute 'flatten'
```

### additional notes

We noticed this error in different tasks that require us to gather logits and use pipeline parallelism. We managed to reproduce this issue based on the official examples. For simplicity, I base this issue description on these observations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segmentation fault with pipeline parallelism and `gather_all_token_logits` #1284

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation fault with pipeline parallelism and gather_all_token_logits #1284

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Segmentation fault with pipeline parallelism and `gather_all_token_logits` #1284