Bug for subwords related to utf-8

Hi, I'm Korean developer and I have using your library well.
But when I trained my model by fasttext model with subgram like minn-3 and maxn-6, model's prediction is different between original library(vi python) and your library(java).
And I debugged the situation, and I found the reason is `charMatches`.
I found you rewrote cpp code to java code. 
The line Original code ( https://github.com/facebookresearch/fastText/blob/0c6db7c2d6ba9e0ff81713ed9f50c3142e4ba700/src/dictionary.cc#L172-L195 ) 's char input( by using string index) is byte. So it need to find is it 3 byte(like Korean or Japanese)  or 1 byte(like number or English)
But java's char input is not byte.  In java, we can easily get the one char(like Korean) not byte, so we don't need to compare byte for getting char( like & 0xC0) == 0x80 ). At word, it cause bug.
So I think you need to change your code like removing line containing charMatches function (at https://github.com/linkfluence/fastText4j/blob/c3eb898005f08e7d4a801d970330f1a49a24cab4/src/main/java/fasttext/BaseDictionary.java#L321-L339  or may be more ?) 

And I hope You manage this. Tell me if you need anything. 
Thanks

	protected void computeSubwords(String word, List<Integer> ngrams, List<String> substrings) {
	for(int i = 0; i < word.length(); i++) {
	StringBuilder ngram = new StringBuilder();
	if (!charMatches(word.charAt(i))) {
	for (int j = i, n = 1; j < word.length() && n <= args.getMaxn(); n++) {
	ngram.append(word.charAt(j++));
	while (j < word.length() && charMatches(word.charAt(j))) {
	ngram.append(word.charAt(j++));
	}
	if (n >= args.getMinn() && !(n == 1 && (i == 0 \|\| j == word.length()))) {
	UnsignedLong h = UnsignedLong.valueOf(hash(ngram.toString()));
	h = h.mod(UnsignedLong.valueOf(args.getBucketNumber()));
	ngrams.add(nWords + h.intValue());
	substrings.add(ngram.toString());
	}
	}
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug for subwords related to utf-8 #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug for subwords related to utf-8 #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions