Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom mappings #70

Closed
wants to merge 39 commits into from
Closed

Custom mappings #70

wants to merge 39 commits into from

Conversation

Geggles
Copy link
Contributor

@Geggles Geggles commented Oct 22, 2017

So after discussing issue #68, the toKana or rather the splitIntoKana function seemed to be too complex to allow for changes like having a custom mapping from romaji to kana. That's why I took it upon myself to create a new method for converting romaji to kana, in order to make things easier in the long run.

High level descriptions of the most important changes:

  • the mapping from romaji to kana is not a flat object anymore (FROM_ROMAJI) but a tree where each level is a romaji character (so instead of map['cha'] you now do map['c']['h']['a']['']); toKana still functions to the outside world as it did before
  • the options now have a customKanaMapping property, that defaults to the identity function; it takes as input a mapping tree from romaji to kana and returns a new tree (it does not modify the input in place) that is the updated mapping (the result has to be the complete mapping, not just the updates)
  • IMEMode and useObsoleteKana are also implemented as custom mappings; for compatibility, they are still used via of the options object, so you do not handle those mappings explicitly
  • double consonants like ttsu now form a unit by themselves, so they are only replaced once you write out the whole thing

Some issues:

The script works fine in Google Chrome, but npm run test gives me an error saying that Object.entries is not a function. I did manage to run the tests though by adding

const entries = require('object.entries');
if (!Object.entries) {
    entries.shim();
}

to the top of the constants.js file, but I'm not sure how this works (I have no experience with node.js whatsoever, I just used the example from object.entries).

I had to change the tests for the edge cases. I did that, because for example I think that nn should be replaced by in IME mode, but んん without IME mode. Is that wrong? Also, I replaced lwe -> with lwa -> , I think that was just I typo by someone, since neither Microsoft, nor Google's IMEs do that. Speaking of which, just like ltsu get replaced by , they transliterate la to instead of , which is what WanaKana was doing. I changed that, too. I also removed some tests that are checking for conversion of double consonants of ones that are not supposed to take a sokuon, according to Wikipedia, like mma -> っま, but because it was explicitly checked by the test, I don't know if this incorrect behaviour was intended or not.

Because these changes are quit severe and only one direction has been adapted so far (the other one is still implemented like it used to), I made them in a new branch (customMappings). Is that how you're supposed to use branches? This is the first time I'm contributing to a project that is not my own, so I'm not really sure how to use GitHub in this regard.

@DJTB
Copy link
Collaborator

DJTB commented Oct 23, 2017

At a glance, looks like there's some good things happening here.

double consonants like ttsu now form a unit by themselves, so they are only replaced once you write out the whole thing

This seems fine, but the static tests aren't adequate to ensure IME mode truly works correctly, will have to confirm manually with some different devices/keyboards.

The script works fine in Google Chrome, but npm run test gives me an error saying that Object.entries is not a function.

Is your local/dev environment node 6 or below? Object.entries is 7+
The CI tests run on node 7 & latest (currently 8.7.0)
n is a good version manager for local node installation.

I had to change the tests for the ん edge cases. I did that, because for example I think that nn should be replaced by ん in IME mode, but んん without IME mode. Is that wrong?

Hmm, good question. I guess that makes perfect sense.

Also, I replaced lwe -> ゎ with lwa -> ゎ, I think that was just I typo by someone

Agreed, nice catch.

Speaking of which, just like ltsu get replaced by っ, they transliterate la to ぁ instead of ら, which is what WanaKana was doing.

This was before my time, but I guess the reasoning was the whole L/R combo sound of the R kanas. I prefer to follow the IME's here as well. The other day I was struggling to write a little ぁ and it should be possible to do with Wanakana. Supporting non-standard whacky romaji doesn't seem reasonable to me.

I also removed some tests that are checking for conversion of double consonants of ones that are not supposed to take a sokuon, according to Wikipedia, like mma -> っま, but because it was explicitly checked by the test, I don't know if this incorrect behaviour was intended or not.

I believe it was intentional since the other IMEs still convert these and I'd say it falls under the umbrella of expected behaviour, disregarding whether the Japanese is nonsense.
katta => かった, mumma => むっま

We should probably keep this in to keep backwards compat, and I don't really see a downside to allowing it.

Is that how you're supposed to use branches? This is the first time I'm contributing to a project that is not my own, so I'm not really sure how to use GitHub in this regard.

A branch per feature/change/fix is a reasonable rule.
It doesn't matter so much in this scenario since the branch is in your fork (pre-PR), not the Wanakana repo itself.

… to https://en.wikipedia.org/wiki/Sokuon"

This reverts commit 1a0c293.

Revert commit 1a0c293 that removed the
superfluous double consonants tests, because they seem to be intended
@DJTB
Copy link
Collaborator

DJTB commented Oct 23, 2017

Would you be willing to update toRomaji to work in the same fashion rather than having two different approaches?

#65 (dumb/simple conversion) and other potential future requests (revised hepburn etc) could be solved via customRomajiMapping option.

@Geggles
Copy link
Contributor Author

Geggles commented Oct 23, 2017

Sure. Right now I'm first going to revert the double consonant test commit and then implement the sokuon even for consonants that don't actually take one in Japanese, by simply changing the sokuonConsonants variable to include those consonants as well. Then I'll go to university (it's 8:42 right now) and when I have time on my hands, I will implement the toRomaji conversion.

@DJTB
Copy link
Collaborator

DJTB commented Nov 14, 2017

I really appreciate what you're doing here, I've just been a bit busy.

Are you waiting on me to take a look at the katakanaToHiragana stuff that's going on?
Or you've just been busy as well! 😄

@Geggles
Copy link
Contributor Author

Geggles commented Nov 17, 2017

Both, I guess 😆
For one, I've been extremely busy with university work right now, but it was also unclear as to what still needs to be done (aside from documentation, which needs to wait until we ultimately agree on the specification for the new functions), so I didn't really get around to doing any more work on this project lately.
As for the long dash thingy, I fail to see how it would be a problem if you can convert ゲーム to GEEMU, but not GEEMU to ゲーム, because the toRomaji and toKana functions do not claim to be inverses of each other, that wouldn't even make sense, as the romaji -> kana scheme that is being used by the common IMEs is a mixture of all kinds of romanization systems, so they can't be inverses. Obviously it would be nice if GEEMU would transliterate to ゲーム and such a feature could easily be achieved through the postprocessing functions, but I don't currently have the time on my hands to implement this. Is it really such a big deal?

@DJTB
Copy link
Collaborator

DJTB commented Nov 26, 2017

I'm with you on this, I can't see a valid reason either that it has to convert xEー to xEE. As you suggested, a postprocessing function could add this if needed.

I'm fairly busy at the moment as well; hoping to go through this soon since I prefer the approach.

I think it would also be a good idea once this has been merged (in a separate PR, probably by me) to simplify the punctuation conversion to just match Google IME. Since users will now be able to pass their own custom processing for wonderfully confusing choices like (the currently implemented...)

 ‘ ->「
 ’ -> 」
 [ ->[
 ] -> ]

@DJTB
Copy link
Collaborator

DJTB commented Dec 12, 2017

Merged to dev for completion.
Re-fork from there for additions.

@DJTB DJTB closed this Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants