Skip to content

Eval bug: Crashes when model is loaded across a Vega VII card with Mi50s #17086

@optiqal

Description

@optiqal

Name and Version

build: 6963 (6db3d1f) with cc (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

HIP

Hardware

Vega VII + 2 x Mi50s

Models

No response

Problem description & steps to reproduce

When I run any model, of any size, split across the Vega VII and either or both of the Mi50s, this error presents. I can run inference on the Vega VII fine and either or both of the Mi50s fine but I cannot run them mixed. After tracing it down with AI, the issue seems to be related to the fact that although all of these cards use gfx906, the Vega VII does not have ECC RAM but the Mi50s do, leading to this issue. It seems it is not possible to compile with both ecc and non-ecc versions of gfx906. ROCm has flags for both but those flags (gfx906:sramecc- and gfx906:sramecc+) are not exposed, perhaps, in the build commands on llama cpp?

I do not believe this fact matters very much to the issue, but the Vega VII is an MPX module in a 2019 Mac Pro. I am using Pop OS 22 with patches from T2 Linux and ROCm 7.0.1 with the tensile fix (copying the tensile files from a rocblas build for gfx906). I have tested this in ROCm 7.1, 6.4, 6.3 and 6.2 and the crash has never changed.

In this particular case, I compiled with HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release
&& cmake --build build --config Release -- -j 16

First Bad Commit

No response

Relevant log output

~/Desktop/LLAMA_NEW/llama.cpp/build/bin$ ./llama-server -m /home/name/Downloads/MiniMax-M2-UD-IQ3_XXS-00001-of-00002.gguf -ngl 3
0 -c 128000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
build: 6963 (6db3d1ffe) with cc (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0 for x86_64-linux-gnu
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv    load_model: loading model '/home/name/Downloads/MiniMax-M2-UD-IQ3_XXS-00001-of-00002.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) (0000:09:00.0) - 32728 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon Graphics) (0000:10:00.0) - 32728 MiB free
llama_model_load_from_file_impl: using device ROCm2 (AMD Radeon Graphics) (0000:16:00.0) - 32728 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 44 key-value pairs and 809 tensors from /home/name/Downloads/MiniMax-M2-UD-IQ3_XXS-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Minimax-M2
llama_model_loader: - kv   3:                           general.basename str              = Minimax-M2
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 256x4.9B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   9:                     minimax-m2.block_count u32              = 62
llama_model_loader: - kv  10:                  minimax-m2.context_length u32              = 196608
llama_model_loader: - kv  11:                minimax-m2.embedding_length u32              = 3072
llama_model_loader: - kv  12:             minimax-m2.feed_forward_length u32              = 1536
llama_model_loader: - kv  13:            minimax-m2.attention.head_count u32              = 48
llama_model_loader: - kv  14:         minimax-m2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                  minimax-m2.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  16: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                    minimax-m2.expert_count u32              = 256
llama_model_loader: - kv  18:               minimax-m2.expert_used_count u32              = 8
llama_model_loader: - kv  19:            minimax-m2.attention.key_length u32              = 128
llama_model_loader: - kv  20:          minimax-m2.attention.value_length u32              = 128
llama_model_loader: - kv  21:              minimax-m2.expert_gating_func u32              = 2
llama_model_loader: - kv  22:      minimax-m2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  23:            minimax-m2.rope.dimension_count u32              = 64
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = minimax-m2
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 200034
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 200020
llama_model_loader: - kv  31:            tokenizer.ggml.unknown_token_id u32              = 200021
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 200004
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {# Unsloth & community template fixes...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                          general.file_type u32              = 23
llama_model_loader: - kv  37:                      quantize.imatrix.file str              = MiniMax-M2-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  38:                   quantize.imatrix.dataset str              = unsloth_calibration_MiniMax-M2.txt
llama_model_loader: - kv  39:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  40:              quantize.imatrix.chunks_count u32              = 697
llama_model_loader: - kv  41:                                   split.no u16              = 0
llama_model_loader: - kv  42:                        split.tensors.count i32              = 809
llama_model_loader: - kv  43:                                split.count u16              = 2
llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q5_K:   20 tensors
llama_model_loader: - type q6_K:   11 tensors
llama_model_loader: - type iq3_xxs:  128 tensors
llama_model_loader: - type iq3_s:   44 tensors
llama_model_loader: - type iq4_xs:  232 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ3_XXS - 3.0625 bpw
print_info: file size   = 87.17 GiB (3.27 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 200004 ('<fim_pad>')
load:   - 200005 ('<reponame>')
load:   - 200020 ('[e~[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
print_info: arch             = minimax-m2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 196608
print_info: n_embd           = 3072
print_info: n_layer          = 62
print_info: n_head           = 48
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 1536
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 196608
print_info: rope_finetuned   = unknown
print_info: model type       = 230B.A10B
print_info: model params     = 228.69 B
print_info: general.name     = Minimax-M2
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199744
print_info: BOS token        = 200034 ']~!b['
print_info: EOS token        = 200020 '[e~['
print_info: UNK token        = 200021 ']!d~['
print_info: PAD token        = 200004 '<fim_pad>'
print_info: LF token         = 10 'Ċ'
print_info: FIM PRE token    = 200001 '<fim_prefix>'
print_info: FIM SUF token    = 200003 '<fim_suffix>'
print_info: FIM MID token    = 200002 '<fim_middle>'
print_info: FIM PAD token    = 200004 '<fim_pad>'
print_info: FIM REP token    = 200005 '<reponame>'
print_info: EOG token        = 200004 '<fim_pad>'
print_info: EOG token        = 200005 '<reponame>'
print_info: EOG token        = 200020 '[e~['
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloaded 30/63 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 45174.95 MiB
load_tensors:        ROCm0 model buffer size = 14651.21 MiB
load_tensors:        ROCm1 model buffer size = 14307.85 MiB
load_tensors:        ROCm2 model buffer size = 15129.94 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 128000
llama_context: n_ctx_seq     = 128000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (128000) < n_ctx_train (196608) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     3.05 MiB
llama_kv_cache:        CPU KV buffer size = 16000.00 MiB
llama_kv_cache:      ROCm0 KV buffer size =  5000.00 MiB
llama_kv_cache:      ROCm1 KV buffer size =  5000.00 MiB
llama_kv_cache:      ROCm2 KV buffer size =  5000.00 MiB
llama_kv_cache: size = 31000.00 MiB (128000 cells,  62 layers,  4/1 seqs), K (f16): 15500.00 MiB, V (f16): 15500.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      ROCm0 compute buffer size =  1532.56 MiB
llama_context:      ROCm1 compute buffer size =   210.51 MiB
llama_context:      ROCm2 compute buffer size =   210.51 MiB
llama_context:  ROCm_Host compute buffer size =   256.01 MiB
llama_context: graph nodes  = 3975
llama_context: graph splits = 486 (with bs=512), 5 (with bs=1)
common_init_from_params: added <fim_pad> logit bias = -inf
common_init_from_params: added <reponame> logit bias = -inf
common_init_from_params: added [e~[ logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 128000
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 503
ggml_cuda_compute_forward: ADD failed
ROCm error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /home/name/Desktop/LLAMA_NEW/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2722
/home/name/Desktop/LLAMA_NEW/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:90: ROCm error
  err
[New LWP 1370285]
[New LWP 1370288]
[New LWP 1370289]
[New LWP 1370290]
[New LWP 1370291]
[New LWP 1370292]
[New LWP 1370293]
[New LWP 1370294]
[New LWP 1370295]
[New LWP 1370296]
[New LWP 1370297]
[New LWP 1370298]
[New LWP 1370299]
[New LWP 1370300]
[New LWP 1370301]
[New LWP 1370302]
[New LWP 1370303]
[New LWP 1370304]
[New LWP 1370305]
[New LWP 1370306]
[New LWP 1370307]
[New LWP 1370308]
[New LWP 1370309]
[New LWP 1370310]
[New LWP 1370311]
[New LWP 1370312]
[New LWP 1370314]
[New LWP 1370326]
[New LWP 1370327]
[New LWP 1370328]
[New LWP 1370329]
[New LWP 1370330]
[New LWP 1370331]
[New LWP 1370332]
[New LWP 1370333]
[New LWP 1370334]
[New LWP 1370335]
[New LWP 1370336]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007313506ea42f in __GI___wait4 (pid=1370353, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007313506ea42f in __GI___wait4 (pid=1370353, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000731350d7058b in ggml_print_backtrace () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-base.so
#2  0x0000731350d70723 in ggml_abort () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-base.so
#3  0x000073134f85def2 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-hip.so
#4  0x000073134f865a54 in evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, bool&, bool&, bool&) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-hip.so
#5  0x000073134f8630bf in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-hip.so
#6  0x0000731350d8be57 in ggml_backend_sched_graph_compute_async () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-base.so
#7  0x0000731350ea0811 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#8  0x0000731350ea20cc in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#9  0x0000731350ea7cb9 in llama_context::decode(llama_batch const&) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#10 0x0000731350ea8c2f in llama_decode () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#11 0x0000561f239cc7a8 in common_init_from_params(common_params&) ()
#12 0x0000561f2389f349 in server_context::load_model(common_params const&) ()
#13 0x0000561f238327e8 in main ()
[Inferior 1 (process 1370284) detached]
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions