-
-
Notifications
You must be signed in to change notification settings - Fork 45
Closed
Description
With the changes in Correctly split text into sentences #204, SudachiTokenizer
analyzed all characters (was only first 4096).
The change is fine. But due to the change, I see OOM issue.
SudachiTokenizer.reset()
analyzes all text and store the result in ArrayList<MorphemeList>
. It causes OOM due to large list size.
I think it would be better to change the analyzing to be done gradually with the SudachiTokenizer.incrementToken()
function, instead of all at once with the SudachiTokenizer.reset()
function. as well as StandartTokenizer.java
Metadata
Metadata
Assignees
Labels
No labels