server : add SWA checkpoints #15293

ggerganov · 2025-08-13T13:49:52Z

The server now makes checkpoints of the SWA memory in order to minimize the amount of context reprocessing. A SWA checkpoint represents the state (both the KV cells and KV data) of the cache. Only the SWA part is stored in the checkpoint, therefore the size is relatively small (proportional to the SWA window that the model uses).

The number of checkpoints per slot by default is 3 and can be configured with --swa-checkpoints N.

A checkpoint is created upon finishing the processing of a prompt:

llama.cpp/tools/server/server.cpp

Lines 3579 to 3604 in e7d2ecd

    
           // make a checkpoint with the SWA memory 
        
           // checkpoints are needed only if we are not using "--swa-full" 
        
           if (llama_model_n_swa(model) > 0 && !params_base.swa_full) { 
        
               if (slot.swa_checkpoints.size() >= SERVER_MAX_SWA_CHECKPOINTS_PER_SLOT) { 
        
                   { 
        
                       const auto & cur = slot.swa_checkpoints.back(); 
        
                       SLT_WRN(slot, "SWA checkpoint erase, pos_min = %d, pos_max = %d, size = %.3f MiB\n", cur.pos_min, cur.pos_max, (float) cur.data.size() / 1024 / 1024); 
        
                   } 
        
                   slot.swa_checkpoints.erase(slot.swa_checkpoints.begin()); 
        
               } 
        
               const size_t swa_size = llama_state_seq_get_size_ext(ctx, slot.id, LLAMA_STATE_SEQ_FLAGS_SWA_ONLY); 
        
               auto & cur = slot.swa_checkpoints.emplace_back(swa_checkpoint{ 
        
                   /*.pos_min = */ llama_memory_seq_pos_min(llama_get_memory(ctx), slot.id), 
        
                   /*.pos_max = */ llama_memory_seq_pos_max(llama_get_memory(ctx), slot.id), 
        
                   /*.data    = */ std::vector<uint8_t>(swa_size), 
        
               }); 
        
               llama_state_seq_get_data_ext(ctx, cur.data.data(), swa_size, slot.id, LLAMA_STATE_SEQ_FLAGS_SWA_ONLY); 
        
               SLT_WRN(slot, "SWA checkpoint create, pos_min = %d, pos_max = %d, size = %.3f MiB\n", cur.pos_min, cur.pos_max, (float) swa_size / 1024 / 1024); 
        
           }

Checkpoints are created only if the --swa-full argument is not specified. If the argument is used, we can branch from any past positions of the context (so no need to do checkpoints), but the drawback is that the SWA memory size is much larger in this case.

`libllama` API changes

Add llama_state_seq_get_size_ext()
Add llama_state_seq_get_data_ext()
Add llama_state_seq_set_data_ext()

TODO:

Update libllama interface to specify SWA and non-SWA state saving
Sanity-checks that the SWA checkpoint is valid
Clean-up llama-server

ggml-ci

ggerganov · 2025-08-13T13:51:59Z

@slaren Let me know if this works on your end. I'll look to clean this up and prepare for merge.

ggerganov · 2025-08-13T13:56:09Z

src/llama-kv-cache-unified.cpp

+        if (!hparams.is_swa(il)) {
+            continue;
+        }
+


Temporary hack to store just the SWA data

slaren · 2025-08-13T14:43:17Z

I have been trying this for a while with the 20B and 120B models, and it seems to work as expected. Definitely helps a lot, instead of several minutes reprocessing the entire context before every interaction, it takes only a few seconds before it starts generating the response. This improves dramatically the usability of the 120B model on systems with limited VRAM.

ggerganov · 2025-08-13T18:25:49Z

Suggestions how to update the llama_state_seq_... API to support this use case are welcome:

llama.cpp/include/llama.h

Lines 835 to 857 in 0b64ee5

    
           // Get the exact size needed to copy the state of a single sequence 
        
           LLAMA_API size_t llama_state_seq_get_size( 
        
                   struct llama_context * ctx, 
        
                           llama_seq_id   seq_id); 
        
           // Copy the state of a single sequence into the specified buffer 
        
           LLAMA_API size_t llama_state_seq_get_data( 
        
                   struct llama_context * ctx, 
        
                                uint8_t * dst, 
        
                                 size_t   size, 
        
                           llama_seq_id   seq_id); 
        
           // Copy the sequence data (originally copied with `llama_state_seq_get_data`) into the specified sequence 
        
           // Returns: 
        
           //  - Positive: Ok 
        
           //  - Zero: Failed to load 
        
           LLAMA_API size_t llama_state_seq_set_data( 
        
                   struct llama_context * ctx, 
        
                          const uint8_t * src, 
        
                                 size_t   size, 
        
                           llama_seq_id   dest_seq_id);

I'll wrap this up tomorrow.

slaren · 2025-08-13T19:29:33Z

Suggestions how to update the llama_state_seq_... API to support this use case are welcome:

I can't think of anything better than just adding a flag parameter to use only the SWA layers, this use case is too specific to generalize it. It could be a generic bit flags parameter that can be extended with additional flags in the future if necessary.

ggml-ci

ggerganov · 2025-08-14T08:14:06Z

This is ready for review and testing

include/llama.h

tools/server/server.cpp

ggml-ci

common/arg.cpp

ddh0 · 2025-08-15T04:13:17Z

Could the changes in this PR also be applied to fix #14625? (Jamba)

ggerganov · 2025-08-15T04:58:55Z

I think so. Likely the change is as simple as respecting the SWA flag in the hybrid cache implementation.

Marrim · 2025-10-07T11:32:07Z

Will saving checkpoints to disk also be implemented? If I understand correctly, there's no way to tell llama to save cache to disk from within the web interface; you need to call the slots endpoint, and it's an older feature completely separate from the functionality described in this PR.

ggerganov · 2025-10-07T11:47:14Z

Currently implementing checkpoint saving to host RAM: #16391. Adding save to disk should be relatively easy, but I consider it low priority for now. After the host memory checkpoints work, we can consider extending to disk.

server : add SWA checkpoints

96db966

ggml-ci

ggerganov requested a review from slaren August 13, 2025 13:49

github-actions bot added examples server labels Aug 13, 2025

ggerganov commented Aug 13, 2025

View reviewed changes

cont : server clean-up

487b922

ggerganov added 3 commits August 14, 2025 09:31

server : handle state restore fails

5b0d207

llama : add extended llama_state_seq_ API

025af15

server : do not make checkpoints if --swa-full

e7d2ecd

ggml-ci

ggerganov marked this pull request as ready for review August 14, 2025 08:10

ggerganov requested a review from ngxson as a code owner August 14, 2025 08:10

slaren reviewed Aug 14, 2025

View reviewed changes

include/llama.h Outdated Show resolved Hide resolved

slaren reviewed Aug 14, 2025

View reviewed changes

tools/server/server.cpp Outdated Show resolved Hide resolved

ggerganov added 2 commits August 14, 2025 13:35

llama : remove flags value for NONE

c2b5cfb

server : configure number of SWA checkpoints with CLI arg

52b775e

ggml-ci

slaren reviewed Aug 14, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

args : fix scope of new argument

3d08a65

slaren approved these changes Aug 14, 2025

View reviewed changes

ggerganov merged commit d32e03f into master Aug 14, 2025
45 of 47 checks passed

ggerganov deleted the gg/server-swa-checkpoints branch August 14, 2025 11:59

ggerganov mentioned this pull request Jul 16, 2025

changelog : libllama API #9289

Open

espen96 mentioned this pull request Aug 14, 2025

Add GPT-OSS from OpenAI - closed in favor of 689 ikawrakow/ik_llama.cpp#683

Closed

ggerganov mentioned this pull request Sep 1, 2025

Eval bug: Nemotron v2 Nano always reprocesses prompt #15677

Closed

ddh0 mentioned this pull request Oct 2, 2025

implement context checkpointing for hybrid and recurrent models #16382

Merged

jakexcosme mentioned this pull request Oct 22, 2025

changelog : libllama API COG-GTM/llama.cpp#246

Open


	// make a checkpoint with the SWA memory
	// checkpoints are needed only if we are not using "--swa-full"
	if (llama_model_n_swa(model) > 0 && !params_base.swa_full) {
	if (slot.swa_checkpoints.size() >= SERVER_MAX_SWA_CHECKPOINTS_PER_SLOT) {
	{
	const auto & cur = slot.swa_checkpoints.back();

	SLT_WRN(slot, "SWA checkpoint erase, pos_min = %d, pos_max = %d, size = %.3f MiB\n", cur.pos_min, cur.pos_max, (float) cur.data.size() / 1024 / 1024);
	}

	slot.swa_checkpoints.erase(slot.swa_checkpoints.begin());
	}

	const size_t swa_size = llama_state_seq_get_size_ext(ctx, slot.id, LLAMA_STATE_SEQ_FLAGS_SWA_ONLY);

	auto & cur = slot.swa_checkpoints.emplace_back(swa_checkpoint{
	/.pos_min = / llama_memory_seq_pos_min(llama_get_memory(ctx), slot.id),
	/.pos_max = / llama_memory_seq_pos_max(llama_get_memory(ctx), slot.id),
	/.data = / std::vector<uint8_t>(swa_size),
	});

	llama_state_seq_get_data_ext(ctx, cur.data.data(), swa_size, slot.id, LLAMA_STATE_SEQ_FLAGS_SWA_ONLY);

	SLT_WRN(slot, "SWA checkpoint create, pos_min = %d, pos_max = %d, size = %.3f MiB\n", cur.pos_min, cur.pos_max, (float) swa_size / 1024 / 1024);
	}

server : add SWA checkpoints #15293

server : add SWA checkpoints #15293

Uh oh!

Conversation

ggerganov commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

libllama API changes

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

ggerganov Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

slaren commented Aug 13, 2025

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

slaren commented Aug 13, 2025

Uh oh!

ggerganov commented Aug 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ddh0 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 15, 2025

Uh oh!

Marrim commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented Aug 13, 2025 •

edited

Loading

`libllama` API changes

ddh0 commented Aug 15, 2025 •

edited

Loading

ggerganov commented Oct 7, 2025 •

edited

Loading