-
Notifications
You must be signed in to change notification settings - Fork 1.2k
WIP: New Foundation Model #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
tarun-menta
wants to merge
87
commits into
dev
Choose a base branch
from
foundation-update
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Allows us to share it bw models more easily later
Custom cache with sliding window, static shape, and special handling of image tokens
5688df6
to
ca9137d
Compare
Decode update is way faster now. Leverages the fact that flash attention now has an option to both left and right pad the cache
During decode, if the sliding window is not full, we should update the attention mask in the last `sliding_window` positions to only attend to valid tokens. This update was not offset by the `text_cache_start`, so we were actually making updates in the image cache space Simple change to include this offset
Do the topK on GPU before moving to CPU, avoids an expensive and slow GPU<->CPU memory transfer of the full logits
We wanted to limit the text token count to max of `text_sliding_window`, but were clamping to min instead, which messed up the logic in a lot of places downstream Also removed dependence on huggingface caching
Cache now supports "ragged" input_ids where each batch can have a different number of "true tokens", with padding. This helps for lots of scenarios, including MTP and beacons, when some sequences have shorter preds than others Can be improved further
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Integration of the new foundation model. Major changes are:
TODOs: