Translation from Chinese to English or Italian is not working #58

ceilican · 2015-12-12T10:11:15Z

Zhenghui Chen reported:

when i try to translate chinese in italian or english (i am a chinese), it doesn‘t work.
it works when i translate italian or english to chinese

ankit-m · 2016-03-16T09:00:40Z

@ceilican I was able to reproduce this error in the following way.
Add a debug line to print filtered words in filterSourceWords

...
      !userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
  }));
  console.debug('Filtered List:', countedWordsList);
  var targetLength = Math.floor((length(countedWords) * translationProbability) / 100);
...

Add a debug line to print counted words in main function

...
  console.log('starting translation');
  var countedWords = getAllWords(ngramMin, ngramMax);
  console.debug('countedWords', countedWords);
...

Run the extension on http://worldcrm.org/About/index/id/3

Console Debug:

As you can see, the counted words object has elements, but the filtered list has zero elements.

Error

filterSourceWords function is not testing for non-ASCII characters. As a result it is skipping all Chinese words. It will also skip all languages which use non-ASCII characters. This is why we are having issue #24.

var countedWordsList = shuffle(toList(countedWords, function(word, count) {
    return !!word && word.length >= minimumSourceWordLength && // no words that are too short
      word !== '' && !/\d/.test(word) && // no empty words
      word.charAt(0) != word.charAt(0).toUpperCase() && // no proper nouns
      !userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
  }));

This callback function is returning false for all the Chinese words as they are non-ASCII.

Solution

To put a check for non-ASCII characters.

...
 !userBlacklistedWords.test(word.toLowerCase()) || // no blacklisted words 
  /[^\x00-\x7F]+/.test(word)); 
...

This did get me all the translations. But for some reason all words are not getting replaced. Just Pan is getting replaced. It will require a little more work. I cannot work on this issue today but i will submit a PR by tomorrow.

ceilican · 2016-03-16T09:07:26Z

That's great, @ankit-m !

ankit-m · 2016-03-17T15:23:59Z

@ceilican the problem is more complicated than what I thought. The reason why all words are not getting replaced is because of spaces. Look at the following lines of code in replaceAll()

  sortedSourceWords.forEach(function concatRExp(sourceWord) {
    rExp += '(\\s' + escapeRegExp(sourceWord) + '\\s)|';
  });

The variable rExp is set to find all patterns like space + word + space i.e. translate word by word.

Here in lay the problem. Languages like Chinese do not have space separated words all the time. For example 董事會 means The Board of Trustees. It has no spaces anywhere - neither in the beginning nor the ending.

So as a solution I tried to change the rExp to

rExp += '(\\s*' + escapeRegExp(sourceWord) + '\\s*)|';

i.e. it should have zero or more spaces on both sides.

With a few other changes, this seemed to have solved the problem. I got Chinese translations working. But soon I realized that this will destroy English translations, as it will not translate word by word but instead do it for all substrings.

For example if there was the word how in the regular expression, it will also translate the word somehow as some <space> <translation of how>. I am now thinking as to how to solve this problem. We do need different rExp for these types of languages.

ceilican · 2016-03-17T22:20:08Z

That is a very interesting observation, @ankit-m .

The spaces that I added in rExp a long time ago were a hack to deal with issues like the one you described. But I have never been fully satisfied with his "hacky" solution. Maybe the ideal solution would not be to have different rExp for different languages, but to make rExp without spaces work for English too.

issue #58 fix chinese translations,detection using unicode values See merge request !206

ceilican added the HighPriority label Jul 19, 2016

ceilican pushed a commit that referenced this issue Sep 29, 2017

issue #58 fix chinese translations,detection using unicode values

0859fd8

ceilican added a commit that referenced this issue Sep 29, 2017

Merge branch 'chinese_translations' into 'master'

91fa98c

issue #58 fix chinese translations,detection using unicode values See merge request !206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translation from Chinese to English or Italian is not working #58

Translation from Chinese to English or Italian is not working #58

ceilican commented Dec 12, 2015

ankit-m commented Mar 16, 2016

ceilican commented Mar 16, 2016

ankit-m commented Mar 17, 2016

ceilican commented Mar 17, 2016

Translation from Chinese to English or Italian is not working #58

Translation from Chinese to English or Italian is not working #58

Comments

ceilican commented Dec 12, 2015

ankit-m commented Mar 16, 2016

Error

Solution

ceilican commented Mar 16, 2016

ankit-m commented Mar 17, 2016

ceilican commented Mar 17, 2016