-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translation from Chinese to English or Italian is not working #58
Comments
@ceilican I was able to reproduce this error in the following way. ...
!userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
}));
console.debug('Filtered List:', countedWordsList);
var targetLength = Math.floor((length(countedWords) * translationProbability) / 100);
... Add a debug line to print counted words in ...
console.log('starting translation');
var countedWords = getAllWords(ngramMin, ngramMax);
console.debug('countedWords', countedWords);
... Run the extension on http://worldcrm.org/About/index/id/3 As you can see, the counted words object has elements, but the filtered list has zero elements. Error
var countedWordsList = shuffle(toList(countedWords, function(word, count) {
return !!word && word.length >= minimumSourceWordLength && // no words that are too short
word !== '' && !/\d/.test(word) && // no empty words
word.charAt(0) != word.charAt(0).toUpperCase() && // no proper nouns
!userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
})); This callback function is returning false for all the Chinese words as they are non-ASCII. SolutionTo put a check for non-ASCII characters. ...
!userBlacklistedWords.test(word.toLowerCase()) || // no blacklisted words
/[^\x00-\x7F]+/.test(word));
... This did get me all the translations. But for some reason all words are not getting replaced. Just Pan is getting replaced. It will require a little more work. I cannot work on this issue today but i will submit a PR by tomorrow. |
That's great, @ankit-m ! |
@ceilican the problem is more complicated than what I thought. The reason why all words are not getting replaced is because of spaces. Look at the following lines of code in sortedSourceWords.forEach(function concatRExp(sourceWord) {
rExp += '(\\s' + escapeRegExp(sourceWord) + '\\s)|';
}); The variable Here in lay the problem. Languages like Chinese do not have space separated words all the time. For example So as a solution I tried to change the rExp += '(\\s*' + escapeRegExp(sourceWord) + '\\s*)|'; i.e. it should have With a few other changes, this seemed to have solved the problem. I got Chinese translations working. But soon I realized that this will destroy English translations, as it will not translate word by word but instead do it for all substrings. For example if there was the word |
That is a very interesting observation, @ankit-m . The spaces that I added in |
issue #58 fix chinese translations,detection using unicode values See merge request !206
Zhenghui Chen reported:
The text was updated successfully, but these errors were encountered: