Skip to content

Bug for subwords related to utf-8 #8

@linetor

Description

@linetor

Hi, I'm Korean developer and I have using your library well.
But when I trained my model by fasttext model with subgram like minn-3 and maxn-6, model's prediction is different between original library(vi python) and your library(java).
And I debugged the situation, and I found the reason is charMatches.
I found you rewrote cpp code to java code.
The line Original code ( https://github.com/facebookresearch/fastText/blob/0c6db7c2d6ba9e0ff81713ed9f50c3142e4ba700/src/dictionary.cc#L172-L195 ) 's char input( by using string index) is byte. So it need to find is it 3 byte(like Korean or Japanese) or 1 byte(like number or English)
But java's char input is not byte. In java, we can easily get the one char(like Korean) not byte, so we don't need to compare byte for getting char( like & 0xC0) == 0x80 ). At word, it cause bug.
So I think you need to change your code like removing line containing charMatches function (at

protected void computeSubwords(String word, List<Integer> ngrams, List<String> substrings) {
for(int i = 0; i < word.length(); i++) {
StringBuilder ngram = new StringBuilder();
if (!charMatches(word.charAt(i))) {
for (int j = i, n = 1; j < word.length() && n <= args.getMaxn(); n++) {
ngram.append(word.charAt(j++));
while (j < word.length() && charMatches(word.charAt(j))) {
ngram.append(word.charAt(j++));
}
if (n >= args.getMinn() && !(n == 1 && (i == 0 || j == word.length()))) {
UnsignedLong h = UnsignedLong.valueOf(hash(ngram.toString()));
h = h.mod(UnsignedLong.valueOf(args.getBucketNumber()));
ngrams.add(nWords + h.intValue());
substrings.add(ngram.toString());
}
}
}
}
}
or may be more ?)

And I hope You manage this. Tell me if you need anything.
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions