Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation from Chinese to English or Italian is not working #58

Open
ceilican opened this issue Dec 12, 2015 · 4 comments
Open

Translation from Chinese to English or Italian is not working #58

ceilican opened this issue Dec 12, 2015 · 4 comments

Comments

@ceilican
Copy link
Contributor

Zhenghui Chen reported:

when i try to translate chinese in italian or english (i am a chinese), it doesn‘t work.
it works when i translate italian or english to chinese

@ankit-m
Copy link
Contributor

ankit-m commented Mar 16, 2016

@ceilican I was able to reproduce this error in the following way.
Add a debug line to print filtered words in filterSourceWords

...
      !userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
  }));
  console.debug('Filtered List:', countedWordsList);
  var targetLength = Math.floor((length(countedWords) * translationProbability) / 100);
...

Add a debug line to print counted words in main function

...
  console.log('starting translation');
  var countedWords = getAllWords(ngramMin, ngramMax);
  console.debug('countedWords', countedWords);
...

Run the extension on http://worldcrm.org/About/index/id/3

Console Debug:
chinese_error

As you can see, the counted words object has elements, but the filtered list has zero elements.

Error

filterSourceWords function is not testing for non-ASCII characters. As a result it is skipping all Chinese words. It will also skip all languages which use non-ASCII characters. This is why we are having issue #24.

var countedWordsList = shuffle(toList(countedWords, function(word, count) {
    return !!word && word.length >= minimumSourceWordLength && // no words that are too short
      word !== '' && !/\d/.test(word) && // no empty words
      word.charAt(0) != word.charAt(0).toUpperCase() && // no proper nouns
      !userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
  }));

This callback function is returning false for all the Chinese words as they are non-ASCII.

Solution

To put a check for non-ASCII characters.

...
 !userBlacklistedWords.test(word.toLowerCase()) || // no blacklisted words 
  /[^\x00-\x7F]+/.test(word)); 
...

working_chinese

This did get me all the translations. But for some reason all words are not getting replaced. Just Pan is getting replaced. It will require a little more work. I cannot work on this issue today but i will submit a PR by tomorrow.

@ceilican
Copy link
Contributor Author

That's great, @ankit-m !

@ankit-m
Copy link
Contributor

ankit-m commented Mar 17, 2016

@ceilican the problem is more complicated than what I thought. The reason why all words are not getting replaced is because of spaces. Look at the following lines of code in replaceAll()

  sortedSourceWords.forEach(function concatRExp(sourceWord) {
    rExp += '(\\s' + escapeRegExp(sourceWord) + '\\s)|';
  });

The variable rExp is set to find all patterns like space + word + space i.e. translate word by word.

Here in lay the problem. Languages like Chinese do not have space separated words all the time. For example 董事會 means The Board of Trustees. It has no spaces anywhere - neither in the beginning nor the ending.

So as a solution I tried to change the rExp to

rExp += '(\\s*' + escapeRegExp(sourceWord) + '\\s*)|';

i.e. it should have zero or more spaces on both sides.

With a few other changes, this seemed to have solved the problem. I got Chinese translations working. But soon I realized that this will destroy English translations, as it will not translate word by word but instead do it for all substrings.

For example if there was the word how in the regular expression, it will also translate the word somehow as some <space> <translation of how>. I am now thinking as to how to solve this problem. We do need different rExp for these types of languages.

@ceilican
Copy link
Contributor Author

That is a very interesting observation, @ankit-m .

The spaces that I added in rExp a long time ago were a hack to deal with issues like the one you described. But I have never been fully satisfied with his "hacky" solution. Maybe the ideal solution would not be to have different rExp for different languages, but to make rExp without spaces work for English too.

ceilican added a commit that referenced this issue Sep 29, 2017
issue #58 fix chinese translations,detection using unicode values

See merge request !206
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants