Skip to content

Huge input causes OOM.  #131

@kenmasumitsu

Description

@kenmasumitsu

With the changes in Correctly split text into sentences #204, SudachiTokenizer analyzed all characters (was only first 4096).

The change is fine. But due to the change, I see OOM issue.
SudachiTokenizer.reset() analyzes all text and store the result in ArrayList<MorphemeList>. It causes OOM due to large list size.

I think it would be better to change the analyzing to be done gradually with the SudachiTokenizer.incrementToken() function, instead of all at once with the SudachiTokenizer.reset() function. as well as StandartTokenizer.java

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions