-
Notifications
You must be signed in to change notification settings - Fork 13.8k
common : more accurate sampling timing #17382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7aae8af to
81f238b
Compare
|
|
||
| // note: the time for chat template initialization is not negligible: | ||
| auto chat_templates = common_chat_templates_init(model, params.chat_template); | ||
|
|
||
| // start measuring performance timings from here | ||
| llama_perf_context_reset(ctx); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize until now that this chat template initialization call can take a significant amount of time (tens of milliseconds). Accounting for this, now the reported timings for sampling, prompt eval, eval and total add up nicely.
|
yields and -n 200 yields Having only 0.5% unaccounted for is more than enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is a bit unintuitive in the case of interactive mode in llama-cli:
- The samplers are apparently reset after every return to the user prompt (23 tokens vs. 72 runs).
- Unaccounted time includes time spent waiting on user input.
common_perf_print: sampling time = 1.84 ms
common_perf_print: samplers time = 0.63 ms / 23 tokens
common_perf_print: load time = 3598.71 ms
common_perf_print: prompt eval time = 58.43 ms / 58 tokens ( 1.01 ms per token, 992.67 tokens per second)
common_perf_print: eval time = 165.06 ms / 72 runs ( 2.29 ms per token, 436.20 tokens per second)
common_perf_print: total time = 16975.94 ms / 130 tokens
common_perf_print: unaccounted time = 16750.61 ms (total - sampling - peval - eval)
common_perf_print: graphs reused = 71
| struct common_time_meas { | ||
| common_time_meas(int64_t & t_acc, bool disable = false); | ||
| ~common_time_meas(); | ||
|
|
||
| const int64_t t_start_us; | ||
|
|
||
| int64_t & t_acc; | ||
| }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct is effectively a code dup of time_meas defined in llama-impl.h. Not sure if this is something we would like to avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to duplicate as it is quite simple functionality
| llama_sampler_reset(gsmpl->grmr); | ||
|
|
||
| llama_sampler_reset(gsmpl->chain); | ||
| gsmpl->reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we did not reset prev ringbuffer, I presume this is a bugfix?
common/sampling.cpp
Outdated
| LOG_INF("%s: eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)\n", | ||
| __func__, data.t_eval_ms, data.n_eval, data.t_eval_ms / data.n_eval, 1e3 / data.t_eval_ms * data.n_eval); | ||
| LOG_INF("%s: total time = %10.2f ms / %5d tokens\n", __func__, (t_end_ms - data.t_start_ms), (data.n_p_eval + data.n_eval)); | ||
| LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - peval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - peval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc); | |
| LOG_INF("%s: unaccounted time = %10.2f ms / %5.1f %% (total - sampling - prompt eval - eval) / (total)\n", __func__, t_unacc_ms, t_unacc_pc); |
This seemed to be incorrect. I changed the logic to no longer reset the timings and the token count when calling
I guess this is OK in some sense - we are not accounting for the wait time. Think we can leave it like this for now. |
ORippler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
ref #17004 (comment)
The time spent in
common/samplingin some cases can be non-negligible, so measure it and report it assampling time. The time spent just inllama_sampleris reported assamplers time. Also reportunaccounted timeequal tototal - sampling - peval - eval: