Investigate and maybe implement "Deep Think with Confidence" #15522

hamannju · 2025-08-23T10:12:05Z

hamannju
Aug 23, 2025

Hello, first let me clarify that I am no expert on this subject so I apologize in advance if I get something wrong or accidently wasted your time.

I just saw a post on Twitter by an AI researcher from Meta making these claims:

"Introducing DeepConf: Deep Think with Confidence

🚀 First method to achieve 99.9% on AIME 2025 with open-source models! Using GPT-OSS-120B even without tools, we reached this almost-perfect accuracy while saving up to 85% generated tokens.

It also delivers many strong advantages for parallel thinking:
🔥 Performance boost: ~10% accuracy across models & datasets
⚡ Ultra-efficient: Up to 85% fewer tokens generated
🔧 Plug & play: Works with ANY existing model - zero training needed (no hyperparameter tuning as well!)
⭐ Easy to deploy: Just ~50 lines of code in vLLM (see PR below)"

Here is the link to it: https://x.com/jiawzhao/status/1958982524333678877

To me it sounds pretty much too good to be true, but maybe someone who understands the llamacpp project better could take a look at it? I would be up to testing an implementation or contribute in any other way that I can because unfortunately I do not have C++ programming experience.

hamannju · 2025-08-23T10:36:35Z

hamannju
Aug 23, 2025
Author

Oh someone has already opened an issue about this, too: #15518

0 replies

ggerganov · 2025-08-23T10:53:38Z

ggerganov
Aug 23, 2025
Maintainer

Quickly skimming the paper, few impressions:

Feels like rediscovered beam search with extra hoops
Not sure how it reduces generated tokens if you need to generate 100s of traces?
AIME25 99.9% (from baseline 97%) feels a bit weak given that it depends on a free parameter η which is unclear how to determine. For example using η = 10% improves GPT-OSS AIME25 but worsens some other benchmarks

Could be missing something, but does not feel very relevant for local/edge use cases where we usually don't have the resources to generate ensembles. But either way, I don't think it requires any changes in llama.cpp to be able to implement this approach - it's entirely client-side filtering and weighting of tokens/traces.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate and maybe implement "Deep Think with Confidence" #15522

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Investigate and maybe implement "Deep Think with Confidence" #15522

Uh oh!

hamannju Aug 23, 2025

Replies: 2 comments

Uh oh!

hamannju Aug 23, 2025 Author

Uh oh!

ggerganov Aug 23, 2025 Maintainer

hamannju
Aug 23, 2025

hamannju
Aug 23, 2025
Author

ggerganov
Aug 23, 2025
Maintainer