common : more accurate sampling timing #17382

ggerganov · 2025-11-19T13:11:52Z

The time spent in common/sampling in some cases can be non-negligible, so measure it and report it as sampling time. The time spent just in llama_sampler is reported as samplers time. Also report unaccounted time equal to total - sampling - peval - eval:

common_perf_print:    sampling time =      70.90 ms
common_perf_print:    samplers time =      45.40 ms /   520 tokens
common_perf_print:        load time =    1226.89 ms
common_perf_print: prompt eval time =      19.04 ms /     8 tokens (    2.38 ms per token,   420.28 tokens per second)
common_perf_print:        eval time =    4852.86 ms /   511 runs   (    9.50 ms per token,   105.30 tokens per second)
common_perf_print:       total time =    4945.97 ms /   519 tokens
common_perf_print: unaccounted time =       3.18 ms /   0.1 %      (total - sampling - peval - eval) / (total)
common_perf_print:    graphs reused =        508

ggerganov · 2025-11-19T14:26:43Z

tools/main/main.cpp

+
+    // note: the time for chat template initialization is not negligible:
    auto chat_templates = common_chat_templates_init(model, params.chat_template);

+    // start measuring performance timings from here
+    llama_perf_context_reset(ctx);
+


Didn't realize until now that this chat template initialization call can take a significant amount of time (tens of milliseconds). Accounting for this, now the reported timings for sampling, prompt eval, eval and total add up nicely.

ORippler · 2025-11-19T14:51:10Z

./build-x64-linux-gcc-reldbg/bin/llama-cli -m /mnt/share/gguf/gpt-oss-20b-mxfp4.gguf -n 2000 -p "What is the Capital of Sweden?" -no-cnv --top-k 0 --top-p 1 -fa 1

yields

common_perf_print:    sampling time =     926.95 ms
common_perf_print:    samplers time =     610.69 ms /  2007 tokens
common_perf_print:        load time =   18284.43 ms
common_perf_print: prompt eval time =      23.69 ms /     7 tokens (    3.38 ms per token,   295.43 tokens per second)
common_perf_print:        eval time =    5324.02 ms /  1999 runs   (    2.66 ms per token,   375.47 tokens per second)
common_perf_print:       total time =    6294.84 ms /  2006 tokens
common_perf_print: unaccounted time =      20.18 ms                (total - sampling - peval - eval

and -n 200 yields

common_perf_print:    sampling time =      94.72 ms
common_perf_print:    samplers time =      61.52 ms /   207 tokens
common_perf_print:        load time =   18876.47 ms
common_perf_print: prompt eval time =      23.51 ms /     7 tokens (    3.36 ms per token,   297.73 tokens per second)
common_perf_print:        eval time =     528.65 ms /   199 runs   (    2.66 ms per token,   376.43 tokens per second)
common_perf_print:       total time =     649.87 ms /   206 tokens
common_perf_print: unaccounted time =       2.99 ms                (total - sampling - peval - eval)

Having only 0.5% unaccounted for is more than enough

ORippler

The behavior is a bit unintuitive in the case of interactive mode in llama-cli:

The samplers are apparently reset after every return to the user prompt (23 tokens vs. 72 runs).
Unaccounted time includes time spent waiting on user input.

common_perf_print:    sampling time =       1.84 ms
common_perf_print:    samplers time =       0.63 ms /    23 tokens
common_perf_print:        load time =    3598.71 ms
common_perf_print: prompt eval time =      58.43 ms /    58 tokens (    1.01 ms per token,   992.67 tokens per second)
common_perf_print:        eval time =     165.06 ms /    72 runs   (    2.29 ms per token,   436.20 tokens per second)
common_perf_print:       total time =   16975.94 ms /   130 tokens
common_perf_print: unaccounted time =   16750.61 ms                (total - sampling - peval - eval)
common_perf_print:    graphs reused =         71

ORippler · 2025-11-19T15:27:27Z

common/common.h

+struct common_time_meas {
+    common_time_meas(int64_t & t_acc, bool disable = false);
+    ~common_time_meas();
+
+    const int64_t t_start_us;
+
+    int64_t & t_acc;
+};
+


This struct is effectively a code dup of time_meas defined in llama-impl.h. Not sure if this is something we would like to avoid

It's OK to duplicate as it is quite simple functionality

ORippler · 2025-11-19T15:28:18Z

common/sampling.cpp

-    llama_sampler_reset(gsmpl->grmr);
-
-    llama_sampler_reset(gsmpl->chain);
+    gsmpl->reset();


Previously we did not reset prev ringbuffer, I presume this is a bugfix?

ORippler · 2025-11-19T15:45:52Z

common/sampling.cpp

+        LOG_INF("%s:        eval time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)\n",
+                __func__, data.t_eval_ms, data.n_eval, data.t_eval_ms / data.n_eval, 1e3 / data.t_eval_ms * data.n_eval);
+        LOG_INF("%s:       total time = %10.2f ms / %5d tokens\n", __func__, (t_end_ms - data.t_start_ms), (data.n_p_eval + data.n_eval));
+        LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %%      (total - sampling - peval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);


Suggested change

LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - peval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);

LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - prompt eval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);

ggerganov · 2025-11-20T07:49:32Z

The samplers are apparently reset after every return to the user prompt (23 tokens vs. 72 runs).

This seemed to be incorrect. I changed the logic to no longer reset the timings and the token count when calling llama_sampler_reset(). These can be reset by calling llama_perf_sampler_reset() if needed.

Unaccounted time includes time spent waiting on user input.

I guess this is OK in some sense - we are not accounting for the wait time. Think we can leave it like this for now.

ORippler

Thanks!

common : more accurate sampling timing

81f238b

ggerganov force-pushed the gg/common-sampling-timing branch from 7aae8af to 81f238b Compare November 19, 2025 13:13

ggerganov added 2 commits November 19, 2025 15:15

eval-callback : minor fixes

cecc8b7

cont : add time_meas impl

bc8587a

github-actions bot added the examples label Nov 19, 2025

ggerganov added 3 commits November 19, 2025 15:32

cont : fix log msg [no ci]

22e9963

cont : fix multiple definitions of time_meas

c051897

llama-cli : exclude chat template init from time measurement

96566b0

ggerganov commented Nov 19, 2025

View reviewed changes

cont : print percentage of unaccounted time

4c26677

danbev approved these changes Nov 19, 2025

View reviewed changes

ORippler reviewed Nov 19, 2025

View reviewed changes

cont : do not reset timings

f99ce95

ggerganov requested a review from ORippler November 20, 2025 07:50

ORippler approved these changes Nov 20, 2025

View reviewed changes

ggerganov merged commit 196f508 into master Nov 20, 2025
74 checks passed

ggerganov deleted the gg/common-sampling-timing branch November 20, 2025 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

common : more accurate sampling timing #17382

common : more accurate sampling timing #17382

ggerganov commented Nov 19, 2025 •

edited

Loading

Uh oh!

ggerganov Nov 19, 2025

Uh oh!

ORippler commented Nov 19, 2025

Uh oh!

ORippler left a comment •

edited

Loading

Uh oh!

ORippler Nov 19, 2025

Uh oh!

ggerganov Nov 20, 2025

Uh oh!

ORippler Nov 19, 2025

Uh oh!

ORippler Nov 19, 2025

Uh oh!

ggerganov commented Nov 20, 2025

Uh oh!

ORippler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - peval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);
	LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - prompt eval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc);

common : more accurate sampling timing #17382

common : more accurate sampling timing #17382

Conversation

ggerganov commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ORippler commented Nov 19, 2025

Uh oh!

ORippler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ORippler Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

ORippler Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ORippler Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Nov 20, 2025

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Nov 19, 2025 •

edited

Loading

ORippler left a comment •

edited

Loading