WIP: New Foundation Model #404

tarun-menta · 2025-06-27T17:25:37Z

Integration of the new foundation model. Major changes are:

Moving the foundation model into its own reusable class. The prediction loop and cache have now changed to a static shape with our custom sliding window logic
Refactor layout and recognition to use the foundation predictor

TODOs:

Implement multi-token prediction
Cache logic is slow due to batch-wise looping. This needs to be vectorized
Surface top-k for layout

Allows us to share it bw models more easily later

Custom cache with sliding window, static shape, and special handling of image tokens

Decode update is way faster now. Leverages the fact that flash attention now has an option to both left and right pad the cache

During decode, if the sliding window is not full, we should update the attention mask in the last `sliding_window` positions to only attend to valid tokens. This update was not offset by the `text_cache_start`, so we were actually making updates in the image cache space Simple change to include this offset

Do the topK on GPU before moving to CPU, avoids an expensive and slow GPU<->CPU memory transfer of the full logits

We wanted to limit the text token count to max of `text_sliding_window`, but were clamping to min instead, which messed up the logic in a lot of places downstream Also removed dependence on huggingface caching

Cache now supports "ragged" input_ids where each batch can have a different number of "true tokens", with padding. This helps for lots of scenarios, including MTP and beacons, when some sequences have shorter preds than others Can be improved further

tarun-menta added 30 commits June 17, 2025 15:16

Refactor rec predictor - Easier to separate out

a458e16

Fix rec refactor

27966ea

Refactor foundation prediction loop into its own class

cd2acf0

Allows us to share it bw models more easily later

Refactor rec predictor initialization in scripts + benchmarks

351b1be

Minor comment [skip ci]

13fd3a5

Update to new foundation model

5e8adad

Update to new cache

ff78bfa

Custom cache with sliding window, static shape, and special handling of image tokens

Extra input to cache to update selectively

2f12afd

Add prefill cache update

6a4b1c3

Pass through cache indices to be updated

8c47fa2

Add option for extra left padding in processor

6772a99

Bugfix in cache

4fd7598

Pass through more inputs for cache support

e82e05a

Initial setup of static cache with the new model

2b4ede1

Attention mask as a buffer instead of parameter

084239f

Dynamically determine image cache length

8fc9660

Minor variable name change [skip ci]

84b037e

Add beacon token generation support

a18cbff

Special handling when inserting beacon tokens into the seq

a4d365e

Update processor with beacon from config

32779ca

Bugfixes in continuous batching

a1feb09

Use max_batch_size for cache

01066b9

Check for decoder attn impl, not wrapper model

25abe86

Shape and other fixes in caching logic

b071553

Add new distance projection logic

a754977

Max cache size from tasks, not images

e6878d4

Better variable name

94c8e34

Better variable name

92f9397

Pipe through extra argument

512ae7d

Working continuous batching caching

37bf16c

VikParuchuri added 12 commits July 9, 2025 18:46

Fix compile issues

8044eda

Refactor cache for tpu

7e819c3

Revert encoder changes

6ae83d8

Improve prefill and decode speed

b256889

Fix

58b3054

Improve embeddings

9bb7fe5

Improve prefill

d167213

Prefill experiments

f8a9ced

Keep on cpu for longer

5160f77

Default sliding window:

a8d6509

Fix issues

ca9137d

Add item conv

2351d34

tarun-menta force-pushed the foundation-update branch from 5688df6 to ca9137d Compare July 24, 2025 23:59

tarun-menta added 12 commits July 25, 2025 19:51

Faster static cache implementation

6251ab2

Decode update is way faster now. Leverages the fact that flash attention now has an option to both left and right pad the cache

Cleanup

fabcb0e

Delete unused function

467e702

Fix speed issues due to topk

99693a9

Do the topK on GPU before moving to CPU, avoids an expensive and slow GPU<->CPU memory transfer of the full logits

Bugfix in decode udpate - Text token counts were wrong

5a7347a

We wanted to limit the text token count to max of `text_sliding_window`, but were clamping to min instead, which messed up the logic in a lot of places downstream Also removed dependence on huggingface caching

Allow multi token prediction for OCR

bb8e5f8

Fix topk

92eee41

Cleanup

d681c04

Cleanup

a754f8f

Merge branch 'vik/tpu' into foundation-update

a41fb6d

VikParuchuri closed this Jul 29, 2025

github-actions bot locked and limited conversation to collaborators Jul 29, 2025

VikParuchuri reopened this Jul 29, 2025

tarun-menta added 2 commits August 2, 2025 17:34

Merge new cache changes and better image token logic

daa3988

Recognition model updates to match new model

80db807

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: New Foundation Model #404

WIP: New Foundation Model #404

Uh oh!

tarun-menta commented Jun 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

WIP: New Foundation Model #404

Are you sure you want to change the base?

WIP: New Foundation Model #404

Uh oh!

Conversation

tarun-menta commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tarun-menta commented Jun 27, 2025 •

edited

Loading