-
Notifications
You must be signed in to change notification settings - Fork 2k
[BPE PR 2.2] All BPE Utils #7770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
3a655f4
Add Tokenizer base class.
pforderique 85e34c3
Update licence to 2023
pforderique 8121219
Fix lint errors.
pforderique 7b9c1f2
Only expose WhiteSpaceTokenizer in tests
pforderique cd2fc6e
Rename WhitespaceTokenizer to SimpleTokenizer
pforderique 22e7b81
Add example to Tokenizers docstring.
pforderique 992fb99
BytePairEncoding implementation started.
pforderique 2f3e944
Update test name to Tokenizer
pforderique cf8c974
Destructure TokenizerOptions to assign default mode value
pforderique 9eb393a
Use destructured mode in call method
pforderique 06b4f26
Wrap tokenizer in beforeEach clause.
pforderique b4912ab
Make TokenizerOptions optional in call()
pforderique d805bb8
Add utils file for tokenizers.
pforderique 9a3ef53
Update example in Tokenizers to print output.
pforderique 2bd1943
Don't register test SimpleTokenizer
pforderique af76234
Bring in changes from Tokenizer
pforderique a00f45c
Add bytesToUnicode and HashTable for tensors.
pforderique b792a41
Bring in changes from main
pforderique e5b9a59
Merge in tokenizers_utils file
pforderique 0a83a76
Add utils for BPE
pforderique 9b909c2
Change test name to createStaticHashtable
pforderique 836a5be
Add BytePairTokenizer Cache
pforderique f1a72d9
Add tests for BytePairTokenizerCache
pforderique ebed82c
Move BytePairTokenizerCache to correct location
pforderique 509de9b
Add removeStringsFromInputs
pforderique 3aebe00
Fix test case for removes strings successfully
pforderique 48fd42b
Add createAltsForUnsplittableTokens.
pforderique a52b406
Fix regex pattern in createAlts
pforderique 72b1836
Fix test case for createAlt
pforderique 8819bdd
Switch to using await data() rather than dataSync().
pforderique b4b316d
Fix lint errors.
pforderique c28bb38
Merge branch 'orderique' into all-bpe-utils
pforderique e11a6a8
Remove dataSyncs().
pforderique a17937d
Switch to using tensor instead of tensor1d
pforderique 9e77b5d
Add whitespace Regex strings
pforderique 4817427
Add regexSplit
pforderique f902617
Fix removeStringsFromInputs
pforderique 3a81199
Implement and fix regexSplit
pforderique 57637b0
splitstringsforbpe progress
pforderique 30a1ffd
splitStringForBpe passing 1/2 tests
pforderique 6b6a348
Add mergeLastTwoDims
pforderique b60b52f
Implement splitStringsForBpe and add tests.
pforderique 18018d8
Merge branch 'main' into all-bpe-utils
pforderique 39de746
Replace lookahead regex ude to Safari not supporting it
pforderique 3bde5d3
address comments
pforderique ed3c41a
Merge branch 'main' into all-bpe-utils
pforderique 911b41e
Use polyfill matchAll instead
pforderique d6b639d
Merge branch 'main' into all-bpe-utils
pforderique 1b28b0e
Utilize tensor transforms
pforderique File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.