Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 19, 2025

ref #17004 (comment)

The time spent in common/sampling in some cases can be non-negligible, so measure it and report it as sampling time. The time spent just in llama_sampler is reported as samplers time. Also report unaccounted time equal to total - sampling - peval - eval:

common_perf_print:    sampling time =      70.90 ms
common_perf_print:    samplers time =      45.40 ms /   520 tokens
common_perf_print:        load time =    1226.89 ms
common_perf_print: prompt eval time =      19.04 ms /     8 tokens (    2.38 ms per token,   420.28 tokens per second)
common_perf_print:        eval time =    4852.86 ms /   511 runs   (    9.50 ms per token,   105.30 tokens per second)
common_perf_print:       total time =    4945.97 ms /   519 tokens
common_perf_print: unaccounted time =       3.18 ms /   0.1 %      (total - sampling - peval - eval) / (total)
common_perf_print:    graphs reused =        508

@ggerganov ggerganov force-pushed the gg/common-sampling-timing branch from 7aae8af to 81f238b Compare November 19, 2025 13:13
Comment on lines +152 to +158

// note: the time for chat template initialization is not negligible:
auto chat_templates = common_chat_templates_init(model, params.chat_template);

// start measuring performance timings from here
llama_perf_context_reset(ctx);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize until now that this chat template initialization call can take a significant amount of time (tens of milliseconds). Accounting for this, now the reported timings for sampling, prompt eval, eval and total add up nicely.

@ORippler
Copy link
Contributor

./build-x64-linux-gcc-reldbg/bin/llama-cli -m /mnt/share/gguf/gpt-oss-20b-mxfp4.gguf -n 2000 -p "What is the Capital of Sweden?" -no-cnv --top-k 0 --top-p 1 -fa 1

yields

common_perf_print:    sampling time =     926.95 ms
common_perf_print:    samplers time =     610.69 ms /  2007 tokens
common_perf_print:        load time =   18284.43 ms
common_perf_print: prompt eval time =      23.69 ms /     7 tokens (    3.38 ms per token,   295.43 tokens per second)
common_perf_print:        eval time =    5324.02 ms /  1999 runs   (    2.66 ms per token,   375.47 tokens per second)
common_perf_print:       total time =    6294.84 ms /  2006 tokens
common_perf_print: unaccounted time =      20.18 ms                (total - sampling - peval - eval

and -n 200 yields

common_perf_print:    sampling time =      94.72 ms
common_perf_print:    samplers time =      61.52 ms /   207 tokens
common_perf_print:        load time =   18876.47 ms
common_perf_print: prompt eval time =      23.51 ms /     7 tokens (    3.36 ms per token,   297.73 tokens per second)
common_perf_print:        eval time =     528.65 ms /   199 runs   (    2.66 ms per token,   376.43 tokens per second)
common_perf_print:       total time =     649.87 ms /   206 tokens
common_perf_print: unaccounted time =       2.99 ms                (total - sampling - peval - eval)

Having only 0.5% unaccounted for is more than enough

Copy link
Contributor

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is a bit unintuitive in the case of interactive mode in llama-cli:

  1. The samplers are apparently reset after every return to the user prompt (23 tokens vs. 72 runs).
  2. Unaccounted time includes time spent waiting on user input.
common_perf_print:    sampling time =       1.84 ms
common_perf_print:    samplers time =       0.63 ms /    23 tokens
common_perf_print:        load time =    3598.71 ms
common_perf_print: prompt eval time =      58.43 ms /    58 tokens (    1.01 ms per token,   992.67 tokens per second)
common_perf_print:        eval time =     165.06 ms /    72 runs   (    2.29 ms per token,   436.20 tokens per second)
common_perf_print:       total time =   16975.94 ms /   130 tokens
common_perf_print: unaccounted time =   16750.61 ms                (total - sampling - peval - eval)
common_perf_print:    graphs reused =         71

Comment on lines +31 to +39
struct common_time_meas {
common_time_meas(int64_t & t_acc, bool disable = false);
~common_time_meas();

const int64_t t_start_us;

int64_t & t_acc;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is effectively a code dup of time_meas defined in llama-impl.h. Not sure if this is something we would like to avoid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK to duplicate as it is quite simple functionality

llama_sampler_reset(gsmpl->grmr);

llama_sampler_reset(gsmpl->chain);
gsmpl->reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we did not reset prev ringbuffer, I presume this is a bugfix?

LOG_INF("%s: eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)\n",
__func__, data.t_eval_ms, data.n_eval, data.t_eval_ms / data.n_eval, 1e3 / data.t_eval_ms * data.n_eval);
LOG_INF("%s: total time = %10.2f ms / %5d tokens\n", __func__, (t_end_ms - data.t_start_ms), (data.n_p_eval + data.n_eval));
LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - peval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - peval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);
LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - prompt eval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);

@ggerganov
Copy link
Member Author

The samplers are apparently reset after every return to the user prompt (23 tokens vs. 72 runs).

This seemed to be incorrect. I changed the logic to no longer reset the timings and the token count when calling llama_sampler_reset(). These can be reset by calling llama_perf_sampler_reset() if needed.

Unaccounted time includes time spent waiting on user input.

I guess this is OK in some sense - we are not accounting for the wait time. Think we can leave it like this for now.

@ggerganov ggerganov requested a review from ORippler November 20, 2025 07:50
Copy link
Contributor

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ggerganov ggerganov merged commit 196f508 into master Nov 20, 2025
74 checks passed
@ggerganov ggerganov deleted the gg/common-sampling-timing branch November 20, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants