- 
                Notifications
    You must be signed in to change notification settings 
- Fork 26
Description
Hi, I'm Korean developer and I have using your library well.
But when I trained my model by fasttext model with subgram like minn-3 and maxn-6, model's prediction is different between original library(vi python) and your library(java).
And I debugged the situation, and I found the reason is charMatches.
I found you rewrote cpp code to java code.
The line Original code ( https://github.com/facebookresearch/fastText/blob/0c6db7c2d6ba9e0ff81713ed9f50c3142e4ba700/src/dictionary.cc#L172-L195 ) 's char input( by using string index) is byte. So it need to find is it 3 byte(like Korean or Japanese)  or 1 byte(like number or English)
But java's char input is not byte.  In java, we can easily get the one char(like Korean) not byte, so we don't need to compare byte for getting char( like & 0xC0) == 0x80 ). At word, it cause bug.
So I think you need to change your code like removing line containing charMatches function (at 
fastText4j/src/main/java/fasttext/BaseDictionary.java
Lines 321 to 339 in c3eb898
| protected void computeSubwords(String word, List<Integer> ngrams, List<String> substrings) { | |
| for(int i = 0; i < word.length(); i++) { | |
| StringBuilder ngram = new StringBuilder(); | |
| if (!charMatches(word.charAt(i))) { | |
| for (int j = i, n = 1; j < word.length() && n <= args.getMaxn(); n++) { | |
| ngram.append(word.charAt(j++)); | |
| while (j < word.length() && charMatches(word.charAt(j))) { | |
| ngram.append(word.charAt(j++)); | |
| } | |
| if (n >= args.getMinn() && !(n == 1 && (i == 0 || j == word.length()))) { | |
| UnsignedLong h = UnsignedLong.valueOf(hash(ngram.toString())); | |
| h = h.mod(UnsignedLong.valueOf(args.getBucketNumber())); | |
| ngrams.add(nWords + h.intValue()); | |
| substrings.add(ngram.toString()); | |
| } | |
| } | |
| } | |
| } | |
| } | 
And I hope You manage this. Tell me if you need anything.
Thanks