Skip to content

Conversation

@danbev
Copy link
Member

@danbev danbev commented May 9, 2024

This commit changes the value assigned to llama_timings.n_p_eval when ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value.

The motivation for this change is that if session caching is enabled, for example using the --prompt-cache main-session.txt command line argument for the main example, and if the same prompt is used then on subsequent runs, the prompt tokens will not actually be passed to llama_decode, and n_p_eval will not be updated by llama_synchoronize.

But the value of n_p_eval will be set 1 by llama_get_timings because ctx->n_p_eval will be 0. This could be interpreted as 1 token was evaluated for the prompt which could be misleading for applications using this value.


Example using main

Current behavior on master

First run:

./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'Hello world' -n 3 --verbose-prompt --prompt-cache main-session.txt
...
main: prompt: 'Hello world'
main: number of tokens in prompt = 3
     1 -> '<s>'
 15043 -> ' Hello'
  3186 -> ' world'

sampling: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 3, n_keep = 1


<s> Hello world, what is
llama_print_timings:        load time =     109,87 ms
llama_print_timings:      sample time =       0,07 ms /     3 runs   (    0,02 ms per token, 44776,12 tokens per second)
llama_print_timings: prompt eval time =     129,70 ms /     3 tokens (   43,23 ms per token,    23,13 tokens per second)
llama_print_timings:        eval time =     109,38 ms /     2 runs   (   54,69 ms per token,    18,29 tokens per second)
llama_print_timings:       total time =     240,37 ms /     5 tokens

Second run:

./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'Hello world' -n 3 --verbose-prompt --prompt-cache main-session.txt
...
main: prompt: 'Hello world'
main: number of tokens in prompt = 3
     1 -> '<s>'
 15043 -> ' Hello'
  3186 -> ' world'

sampling: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 3, n_keep = 1


<s> Hello world, this is
llama_print_timings:        load time =     136,40 ms
llama_print_timings:      sample time =       0,07 ms /     3 runs   (    0,02 ms per token, 46153,85 tokens per second)
llama_print_timings: prompt eval time =       0,00 ms /     1 tokens (    0,00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =     101,48 ms /     2 runs   (   50,74 ms per token,    19,71 tokens per second)
llama_print_timings:       total time =     123,87 ms /     3 tokens

Current behavior using this pull request

First run is the same as the first run above.

Second run:

./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'Hello world' -n 3 --verbose-prompt --prompt-cache main-session.txt
...
main: prompt: 'Hello world'
main: number of tokens in prompt = 3
     1 -> '<s>'
 15043 -> ' Hello'
  3186 -> ' world'

sampling: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 3, n_keep = 1


<s> Hello worlds in which
llama_print_timings:        load time =     122,11 ms
llama_print_timings:      sample time =       0,06 ms /     3 runs   (    0,02 ms per token, 50847,46 tokens per second)
llama_print_timings: prompt eval time =       0,00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =     100,23 ms /     2 runs   (   50,12 ms per token,    19,95 tokens per second)
llama_print_timings:       total time =     122,41 ms /     2 tokens

This commit changes the value assigned to llama_timings.n_p_eval when
ctx->n_p_eval is 0 to be 1 instead of 1 which is the current value.

The motivation for this change is that if session caching is enabled,
for example using the `--prompt-cache main-session.txt` command line
argument for the main example, and if the same prompt is used then on
subsequent runs, the prompt tokens will not actually be passed to
llama_decode, and n_p_eval will not be updated by llama_synchoronize.

But the value of n_p_eval will be set 1 by llama_get_timings because
ctx->n_p_eval will be 0. This could be interpreted as 1 token was
evaluated for the prompt which could be misleading for applications
using this value.

Signed-off-by: Daniel Bevenius <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 542 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8630.71ms p(95)=20082.53ms fails=, finish reason: stop=472 truncated=70
  • Prompt processing (pp): avg=99.38tk/s p(95)=485.99tk/s
  • Token generation (tg): avg=34.71tk/s p(95)=49.54tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=llama_get_timings_n_p_eval commit=0db386f4ca67350eeee8dd1797f71d22e4daf01d

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 379.25, 379.25, 379.25, 379.25, 379.25, 402.79, 402.79, 402.79, 402.79, 402.79, 425.24, 425.24, 425.24, 425.24, 425.24, 453.09, 453.09, 453.09, 453.09, 453.09, 505.92, 505.92, 505.92, 505.92, 505.92, 529.1, 529.1, 529.1, 529.1, 529.1, 538.0, 538.0, 538.0, 538.0, 538.0, 562.76, 562.76, 562.76, 562.76, 562.76, 588.07, 588.07, 588.07, 588.07, 588.07, 587.88, 587.88, 587.88, 587.88, 587.88, 609.72, 609.72, 609.72, 609.72, 609.72, 639.25, 639.25, 639.25, 639.25, 639.25, 639.33, 639.33, 639.33, 639.33, 639.33, 666.26, 666.26, 666.26, 666.26, 666.26, 676.39, 676.39, 676.39, 676.39, 676.39, 675.62, 675.62, 675.62, 675.62, 675.62, 687.88, 687.88, 687.88, 687.88, 687.88, 691.73, 691.73, 691.73, 691.73, 691.73, 700.99, 700.99, 700.99, 700.99, 700.99, 706.17, 706.17, 706.17, 706.17, 706.17, 712.54, 712.54, 712.54, 712.54, 712.54, 720.06, 720.06, 720.06, 720.06, 720.06, 724.98, 724.98, 724.98, 724.98, 724.98, 729.15, 729.15, 729.15, 729.15, 729.15, 736.31, 736.31, 736.31, 736.31, 736.31, 739.55, 739.55, 739.55, 739.55, 739.55, 742.23, 742.23, 742.23, 742.23, 742.23, 756.17, 756.17, 756.17, 756.17, 756.17, 753.35, 753.35, 753.35, 753.35, 753.35, 754.3, 754.3, 754.3, 754.3, 754.3, 754.55, 754.55, 754.55, 754.55, 754.55, 761.87, 761.87, 761.87, 761.87, 761.87, 763.11, 763.11, 763.11, 763.11, 763.11, 764.16, 764.16, 764.16, 764.16, 764.16, 767.76, 767.76, 767.76, 767.76, 767.76, 780.37, 780.37, 780.37, 780.37, 780.37, 787.75, 787.75, 787.75, 787.75, 787.75, 797.54, 797.54, 797.54, 797.54, 797.54, 790.94, 790.94, 790.94, 790.94, 790.94, 791.06, 791.06, 791.06, 791.06, 791.06, 791.38, 791.38, 791.38, 791.38, 791.38, 793.06, 793.06, 793.06, 793.06, 793.06, 801.49, 801.49, 801.49, 801.49, 801.49, 796.56, 796.56, 796.56, 796.56, 796.56, 778.09, 778.09, 778.09, 778.09, 778.09, 777.97, 777.97, 777.97, 777.97, 777.97, 777.04, 777.04, 777.04, 777.04, 777.04, 781.71, 781.71, 781.71, 781.71, 781.71, 786.12, 786.12, 786.12, 786.12, 786.12, 786.48, 786.48, 786.48, 786.48, 786.48, 789.17, 789.17, 789.17, 789.17, 789.17, 789.97, 789.97, 789.97, 789.97, 789.97, 793.95, 793.95, 793.95, 793.95, 793.95, 796.73, 796.73, 796.73, 796.73, 796.73, 796.69, 796.69, 796.69, 796.69, 796.69, 803.08, 803.08, 803.08, 803.08, 803.08, 805.17, 805.17, 805.17, 805.17, 805.17, 805.34, 805.34, 805.34, 805.34, 805.34, 807.08, 807.08, 807.08, 807.08, 807.08, 809.57, 809.57, 809.57, 809.57, 809.57, 810.12, 810.12, 810.12, 810.12]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.68, 40.68, 40.68, 40.68, 40.68, 42.98, 42.98, 42.98, 42.98, 42.98, 36.01, 36.01, 36.01, 36.01, 36.01, 33.24, 33.24, 33.24, 33.24, 33.24, 33.04, 33.04, 33.04, 33.04, 33.04, 32.65, 32.65, 32.65, 32.65, 32.65, 33.8, 33.8, 33.8, 33.8, 33.8, 34.69, 34.69, 34.69, 34.69, 34.69, 34.87, 34.87, 34.87, 34.87, 34.87, 34.83, 34.83, 34.83, 34.83, 34.83, 34.33, 34.33, 34.33, 34.33, 34.33, 33.99, 33.99, 33.99, 33.99, 33.99, 34.01, 34.01, 34.01, 34.01, 34.01, 33.15, 33.15, 33.15, 33.15, 33.15, 32.76, 32.76, 32.76, 32.76, 32.76, 32.47, 32.47, 32.47, 32.47, 32.47, 31.91, 31.91, 31.91, 31.91, 31.91, 32.05, 32.05, 32.05, 32.05, 32.05, 32.1, 32.1, 32.1, 32.1, 32.1, 31.79, 31.79, 31.79, 31.79, 31.79, 31.77, 31.77, 31.77, 31.77, 31.77, 31.8, 31.8, 31.8, 31.8, 31.8, 31.69, 31.69, 31.69, 31.69, 31.69, 31.9, 31.9, 31.9, 31.9, 31.9, 31.83, 31.83, 31.83, 31.83, 31.83, 31.92, 31.92, 31.92, 31.92, 31.92, 31.97, 31.97, 31.97, 31.97, 31.97, 32.04, 32.04, 32.04, 32.04, 32.04, 31.57, 31.57, 31.57, 31.57, 31.57, 31.29, 31.29, 31.29, 31.29, 31.29, 31.39, 31.39, 31.39, 31.39, 31.39, 31.51, 31.51, 31.51, 31.51, 31.51, 31.56, 31.56, 31.56, 31.56, 31.56, 31.75, 31.75, 31.75, 31.75, 31.75, 31.79, 31.79, 31.79, 31.79, 31.79, 31.68, 31.68, 31.68, 31.68, 31.68, 31.52, 31.52, 31.52, 31.52, 31.52, 31.44, 31.44, 31.44, 31.44, 31.44, 31.14, 31.14, 31.14, 31.14, 31.14, 31.25, 31.25, 31.25, 31.25, 31.25, 31.44, 31.44, 31.44, 31.44, 31.44, 31.5, 31.5, 31.5, 31.5, 31.5, 31.54, 31.54, 31.54, 31.54, 31.54, 31.5, 31.5, 31.5, 31.5, 31.5, 30.85, 30.85, 30.85, 30.85, 30.85, 30.82, 30.82, 30.82, 30.82, 30.82, 30.03, 30.03, 30.03, 30.03, 30.03, 29.82, 29.82, 29.82, 29.82, 29.82, 29.79, 29.79, 29.79, 29.79, 29.79, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 29.94, 30.11, 30.11, 30.11, 30.11, 30.11, 30.13, 30.13, 30.13, 30.13, 30.13, 30.07, 30.07, 30.07, 30.07, 30.07, 30.02, 30.02, 30.02, 30.02, 30.02, 30.05, 30.05, 30.05, 30.05, 30.05, 30.1, 30.1, 30.1, 30.1, 30.1, 30.29, 30.29, 30.29, 30.29, 30.29, 30.41, 30.41, 30.41, 30.41, 30.41, 30.46, 30.46, 30.46, 30.46, 30.46, 30.51, 30.51, 30.51, 30.51]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.45, 0.45, 0.45, 0.45, 0.45, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.37, 0.37, 0.37, 0.37, 0.37, 0.33, 0.33, 0.33, 0.33, 0.33, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.3, 0.3, 0.3, 0.3, 0.3, 0.43, 0.43, 0.43, 0.43, 0.43, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.49, 0.42, 0.42, 0.42, 0.42, 0.42, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 542 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715230940 --> 1715231566
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@ggerganov ggerganov merged commit fd9f92b into ggml-org:master May 9, 2024
@mofosyne mofosyne added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 10, 2024
@danbev danbev deleted the llama_get_timings_n_p_eval branch August 13, 2025 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants