Skip to content

Conversation

tarun-menta
Copy link
Contributor

@tarun-menta tarun-menta commented Jun 27, 2025

Integration of the new foundation model. Major changes are:

  • Moving the foundation model into its own reusable class. The prediction loop and cache have now changed to a static shape with our custom sliding window logic
  • Refactor layout and recognition to use the foundation predictor

TODOs:

  • Implement multi-token prediction
  • Cache logic is slow due to batch-wise looping. This needs to be vectorized
  • Surface top-k for layout

Allows us to share it bw models more easily later
Custom cache with sliding window, static shape, and special handling of
image tokens
Decode update is way faster now. Leverages the fact that flash
attention now has an option to both left and right pad the
cache
During decode, if the sliding window is not full, we should update
the attention mask in the last `sliding_window` positions to only
attend to valid tokens. This update was not offset by the `text_cache_start`,
so we were actually making updates in the image cache space

Simple change to include this offset
Do the topK on GPU before moving to CPU, avoids an expensive and slow
GPU<->CPU memory transfer of the full logits
We wanted to limit the text token count to max of `text_sliding_window`,
but were clamping to min instead, which messed up the logic in a lot
of places downstream

Also removed dependence on huggingface caching
Cache now supports "ragged" input_ids where each batch can have a different
number of "true tokens", with padding. This helps for lots of scenarios,
including MTP and beacons, when some sequences have shorter preds than others

Can be improved further
@github-actions github-actions bot locked and limited conversation to collaborators Jul 29, 2025
@VikParuchuri VikParuchuri reopened this Jul 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants