sampling : optimize samplers by reusing bucket sort #15665

ggerganov · 2025-08-29T14:13:35Z

The major change here is that the dist sampler no longer sorts. A lot of the tests in test-sampling assume that the outputs are sorted, hence the many reordering of the values there. If we don't make this change, then using a sampler chain such as the one recommended by OpenAI for gpt-oss (top-p=1 + min-p=0 + top-k=0 (i.e. disabled)) would result in very slow sorting of the full vocabulary in the dist sampler because there is nothing to cut the low-probability tokens.

Use bucket sort in top-p and min-p samplers
Do not sort in dist sampler (255b070)
Avoid memory allocations by keeping sorting buffers in the sampler context
common_sampler_get_candidates() can now explicitly return sorted candidates if requested via bool do_sort

`libllama` API changes

Remove deprecated llama_sampler_init_softmax()
The dist sampler created with llama_sampler_init_dist() will no longer sort the candidates. The old behaviour of implicitly sorting the candidates was not documented, so technically this is not a breaking change, but it's possible that user code assumed that the results will be sorted - therefore making a note

JohannesGaessler

Logically this seems correct but I think it would be better to integrate the limit for top_p into the bucket sort function. You could keep track of the probability content per bucket and only process as many as would be needed to reach the required threshold. Though quite honestly I think the top 256 tokens are going to be enough for all practical applications so this probably doesn't matter much.

ggerganov · 2025-08-30T08:32:42Z

tools/server/server.cpp

+
        if (post_sampling) {
-            const auto * cur_p = common_sampler_get_candidates(slot.smpl);
+            const auto * cur_p = common_sampler_get_candidates(slot.smpl, true);


In situations where the application requires the candidates to be sorted, using common_sampler_get_candidates(smpl, true); will perform the sorting for convenience.

common/sampling.h

slaren

Keeping the buffers to avoid allocations is probably overkill. If there is a significant overhead from creating new vectors on every call, I think it is more likely that the issue is the memory being initialized in the resize call.

slaren · 2025-08-30T19:19:43Z

src/llama-sampling.cpp

+static void llama_token_data_array_sort_inplace(llama_token_data_array * cur_p, int k, llama_sort_data & buf) {
+    static const auto comp = [](const llama_token_data & a, const llama_token_data & b) {
+        return a.logit > b.logit;
+    };
+
+    if (k <= 128) {
+        std::partial_sort(cur_p->data, cur_p->data + k, cur_p->data + cur_p->size, comp);
+        return;
+    }
+
+    llama_token_data_array_sort(*cur_p, k, buf);
+
+    std::memcpy(cur_p->data, buf.data.data(), k*sizeof(llama_token_data));
+}


cur_p->sorted = true could be set here, currently it is done after every call to this function.

The last memcpy copying only the first k elements does not seem correct. If you do a partial sort, you still need to copy all the elements. If the intention is to reduce the size, then cur_p->size should be updated here.

Also, in C++ code std::copy may be preferred to memcpy, since it is a type-safe.

One more thing, it was not obvious to me that the k parameter is used to do a partial sort. The name of the argument is not clear enough by itself, so this should be documented. Renaming the function to "partial_sort" could also help.

ggml-ci

ggerganov · 2025-08-31T17:05:16Z

Thanks. I removed the helpers buffers and updated the code as recommended (std::copy + better naming + change size of cur_p after partial sort).

Co-authored-by: Johannes Gäßler <[email protected]>

* sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

github-actions bot added the testing Everything test related label Aug 29, 2025

JohannesGaessler approved these changes Aug 29, 2025

View reviewed changes

ggerganov requested a review from ngxson as a code owner August 30, 2025 08:30

github-actions bot added examples server labels Aug 30, 2025

ggerganov commented Aug 30, 2025

View reviewed changes

ggerganov mentioned this pull request Aug 21, 2025

changelog : libllama API #9289

Open

ggerganov force-pushed the gg/sampling-sort-opt branch from 37d3dd4 to 2efc7e4 Compare August 30, 2025 08:39

ggerganov requested a review from slaren August 30, 2025 08:40

JohannesGaessler reviewed Aug 30, 2025

View reviewed changes

common/sampling.h Outdated Show resolved Hide resolved

slaren approved these changes Aug 30, 2025

View reviewed changes

ggerganov added 7 commits August 31, 2025 19:45

sampling : optimize sorting using bucket sort in more places

7d34a4b

ggml-ci

sampling : do not sort in dist sampler

97167e6

ggml-ci

sampling : avoid heap allocations for sort buffers

15557b8

ggml-ci

common : add option to sort sampling candidates by probability

70bce49

ggml-ci

sampling : revert the change for preserving sort buffers

c8a2ada

sampling : use std::copy instead of memcpy

de2902d

sampling : clarify purpose of partial sort helpers

6d2a38c

ggml-ci

ggerganov force-pushed the gg/sampling-sort-opt branch from d74a6ab to 6d2a38c Compare August 31, 2025 17:04

ggerganov and others added 2 commits August 31, 2025 20:05

cont : remove wrong comment [no ci]

08d5ff4

common : update comment

1136efb

Co-authored-by: Johannes Gäßler <[email protected]>

ggerganov merged commit e92d53b into master Aug 31, 2025
7 checks passed

ggerganov deleted the gg/sampling-sort-opt branch August 31, 2025 17:41

ggerganov mentioned this pull request Aug 31, 2025

sampling : optimize dist sampler #15704

Merged

JohannesGaessler mentioned this pull request Sep 6, 2025

gpt-oss 20b gguf model fail to run ollama/ollama#11714

Open

jakexcosme mentioned this pull request Oct 22, 2025

changelog : libllama API COG-GTM/llama.cpp#246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sampling : optimize samplers by reusing bucket sort #15665

sampling : optimize samplers by reusing bucket sort #15665

Uh oh!

ggerganov commented Aug 29, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

ggerganov Aug 30, 2025

Uh oh!

Uh oh!

slaren left a comment

Uh oh!

slaren Aug 30, 2025

Uh oh!

slaren Aug 30, 2025

Uh oh!

slaren Aug 30, 2025 •

edited

Loading

Uh oh!

slaren Aug 30, 2025 •

edited

Loading

Uh oh!

ggerganov commented Aug 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sampling : optimize samplers by reusing bucket sort #15665

sampling : optimize samplers by reusing bucket sort #15665

Uh oh!

Conversation

ggerganov commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

libllama API changes

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

slaren Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Aug 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Aug 29, 2025 •

edited

Loading

`libllama` API changes

slaren Aug 30, 2025 •

edited

Loading

slaren Aug 30, 2025 •

edited

Loading