Custom mappings #70

Geggles · 2017-10-22T15:17:59Z

So after discussing issue #68, the toKana or rather the splitIntoKana function seemed to be too complex to allow for changes like having a custom mapping from romaji to kana. That's why I took it upon myself to create a new method for converting romaji to kana, in order to make things easier in the long run.

High level descriptions of the most important changes:

the mapping from romaji to kana is not a flat object anymore (FROM_ROMAJI) but a tree where each level is a romaji character (so instead of map['cha'] you now do map['c']['h']['a']['']); toKana still functions to the outside world as it did before
the options now have a customKanaMapping property, that defaults to the identity function; it takes as input a mapping tree from romaji to kana and returns a new tree (it does not modify the input in place) that is the updated mapping (the result has to be the complete mapping, not just the updates)
IMEMode and useObsoleteKana are also implemented as custom mappings; for compatibility, they are still used via of the options object, so you do not handle those mappings explicitly
double consonants like ttsu now form a unit by themselves, so they are only replaced once you write out the whole thing

Some issues:

The script works fine in Google Chrome, but npm run test gives me an error saying that Object.entries is not a function. I did manage to run the tests though by adding

const entries = require('object.entries');
if (!Object.entries) {
    entries.shim();
}

to the top of the constants.js file, but I'm not sure how this works (I have no experience with node.js whatsoever, I just used the example from object.entries).

I had to change the tests for the ん edge cases. I did that, because for example I think that nn should be replaced by ん in IME mode, but んん without IME mode. Is that wrong? Also, I replaced lwe -> ゎ with lwa -> ゎ, I think that was just I typo by someone, since neither Microsoft, nor Google's IMEs do that. Speaking of which, just like ltsu get replaced by っ, they transliterate la to ぁ instead of ら, which is what WanaKana was doing. I changed that, too. I also removed some tests that are checking for conversion of double consonants of ones that are not supposed to take a sokuon, according to Wikipedia, like mma -> っま, but because it was explicitly checked by the test, I don't know if this incorrect behaviour was intended or not.

Because these changes are quit severe and only one direction has been adapted so far (the other one is still implemented like it used to), I made them in a new branch (customMappings). Is that how you're supposed to use branches? This is the first time I'm contributing to a project that is not my own, so I'm not really sure how to use GitHub in this regard.

…omaji. This module is supposed to replace the constants in constants.js like FROM_ROMAJI. For now only romaji -> kana is implemented this way.

…s://en.wikipedia.org/wiki/Sokuon

…s translitterate la to a little a instead

…o a single token; Change expected results for n edge cases, because without IME mode, all ns should be transliterated

…ining

…te parameter of toKana

DJTB · 2017-10-23T05:25:34Z

At a glance, looks like there's some good things happening here.

double consonants like ttsu now form a unit by themselves, so they are only replaced once you write out the whole thing

This seems fine, but the static tests aren't adequate to ensure IME mode truly works correctly, will have to confirm manually with some different devices/keyboards.

The script works fine in Google Chrome, but npm run test gives me an error saying that Object.entries is not a function.

Is your local/dev environment node 6 or below? Object.entries is 7+
The CI tests run on node 7 & latest (currently 8.7.0)
n is a good version manager for local node installation.

I had to change the tests for the ん edge cases. I did that, because for example I think that nn should be replaced by ん in IME mode, but んん without IME mode. Is that wrong?

Hmm, good question. I guess that makes perfect sense.

Also, I replaced lwe -> ゎ with lwa -> ゎ, I think that was just I typo by someone

Agreed, nice catch.

Speaking of which, just like ltsu get replaced by っ, they transliterate la to ぁ instead of ら, which is what WanaKana was doing.

This was before my time, but I guess the reasoning was the whole L/R combo sound of the R kanas. I prefer to follow the IME's here as well. The other day I was struggling to write a little ぁ and it should be possible to do with Wanakana. Supporting non-standard whacky romaji doesn't seem reasonable to me.

I also removed some tests that are checking for conversion of double consonants of ones that are not supposed to take a sokuon, according to Wikipedia, like mma -> っま, but because it was explicitly checked by the test, I don't know if this incorrect behaviour was intended or not.

I believe it was intentional since the other IMEs still convert these and I'd say it falls under the umbrella of expected behaviour, disregarding whether the Japanese is nonsense.
katta => かった, mumma => むっま

We should probably keep this in to keep backwards compat, and I don't really see a downside to allowing it.

Is that how you're supposed to use branches? This is the first time I'm contributing to a project that is not my own, so I'm not really sure how to use GitHub in this regard.

A branch per feature/change/fix is a reasonable rule.
It doesn't matter so much in this scenario since the branch is in your fork (pre-PR), not the Wanakana repo itself.

… to https://en.wikipedia.org/wiki/Sokuon" This reverts commit 1a0c293. Revert commit 1a0c293 that removed the superfluous double consonants tests, because they seem to be intended

DJTB · 2017-10-23T06:39:17Z

Would you be willing to update toRomaji to work in the same fashion rather than having two different approaches?

#65 (dumb/simple conversion) and other potential future requests (revised hepburn etc) could be solved via customRomajiMapping option.

Geggles · 2017-10-23T06:44:44Z

Sure. Right now I'm first going to revert the double consonant test commit and then implement the sokuon even for consonants that don't actually take one in Japanese, by simply changing the sokuonConsonants variable to include those consonants as well. Then I'll go to university (it's 8:42 right now) and when I have time on my hands, I will implement the toRomaji conversion.

…ng function

…n mapping, without making any assumptions about IMEMode etc.

…abic_n], `ちゃっちゃ` should be transliterated as `chatcha`

… romanization methods enum

DJTB · 2017-11-14T10:28:11Z

I really appreciate what you're doing here, I've just been a bit busy.

Are you waiting on me to take a look at the katakanaToHiragana stuff that's going on?
Or you've just been busy as well! 😄

Geggles · 2017-11-17T21:44:34Z

Both, I guess 😆
For one, I've been extremely busy with university work right now, but it was also unclear as to what still needs to be done (aside from documentation, which needs to wait until we ultimately agree on the specification for the new functions), so I didn't really get around to doing any more work on this project lately.
As for the long dash thingy, I fail to see how it would be a problem if you can convert ゲーム to GEEMU, but not GEEMU to ゲーム, because the toRomaji and toKana functions do not claim to be inverses of each other, that wouldn't even make sense, as the romaji -> kana scheme that is being used by the common IMEs is a mixture of all kinds of romanization systems, so they can't be inverses. Obviously it would be nice if GEEMU would transliterate to ゲーム and such a feature could easily be achieved through the postprocessing functions, but I don't currently have the time on my hands to implement this. Is it really such a big deal?

DJTB · 2017-11-26T01:43:45Z

I'm with you on this, I can't see a valid reason either that it has to convert xEー to xEE. As you suggested, a postprocessing function could add this if needed.

I'm fairly busy at the moment as well; hoping to go through this soon since I prefer the approach.

I think it would also be a good idea once this has been merged (in a separate PR, probably by me) to simplify the punctuation conversion to just match Google IME. Since users will now be able to pass their own custom processing for wonderfully confusing choices like (the currently implemented...)

 ‘ ->「
 ’ -> 」
 [ ->［
 ] -> ］

DJTB · 2017-12-12T00:10:50Z

Merged to dev for completion.
Re-fork from there for additions.

Geggles added 14 commits October 22, 2017 14:03

Create new module for anything related to the maps between kana and r…

42cca56

…omaji. This module is supposed to replace the constants in constants.js like FROM_ROMAJI. For now only romaji -> kana is implemented this way.

Adapt for the new mapping system

5e81867

Remove the old FROM_ROMAJI constant and adapt the default settings

c21a2c1

Remove double consonants that do not take a sokuon, according to http…

1a0c293

…s://en.wikipedia.org/wiki/Sokuon

Remove conversion of la to ra etc. because Google and Microsoft's IME…

bc2d0a0

…s translitterate la to a little a instead

Replace lwe by lwa, suspect is was a typo by somebody

41e4da4

Without IME mode, 'nn' should be translitterated as double n

d81b49d

Change specification of splitIntoKana, because it now groups ttsu int…

6e6b8ea

…o a single token; Change expected results for n edge cases, because without IME mode, all ns should be transliterated

Return an empty list when there is nothing to commit and nothing rema…

67cc4f9

…ining

Ignore case is now a part of the options object, rather than a separa…

1f16bfe

…te parameter of toKana

Update IME_MODE_MAP to transliterate n[space] and nn to a single n

372d6d6

Remove unnecessary branch

f23226e

Remove branch that is never used

771fd82

Built

4fd7d75

Revert "Remove double consonants that do not take a sokuon, according…

56af4a0

… to https://en.wikipedia.org/wiki/Sokuon" This reverts commit 1a0c293. Revert commit 1a0c293 that removed the superfluous double consonants tests, because they seem to be intended

Geggles added 12 commits October 23, 2017 09:03

Add sokuon to everything

a0228cf

Double l is still wrong, though

2a304d6

Started with the kana -> romaji conversion

21980bb

Freeze the result tree

c925a95

Simplify the splitIntoKana function to use the new generic applyMappi…

c99d9dc

…ng function

Create the new applyMapping function that parses a string with a give…

21a0de0

…n mapping, without making any assumptions about IMEMode etc.

Some cleaning up

1025416

Built

8bb40b7

Commenting out the kana->hepburn map, as there are no tests yet

c839582

Split up the kana mapping into multiple files

e5200c2

Add additional options

b6f4741

Little ke another oversight?

30d0f92

Geggles and others added 12 commits November 2, 2017 19:20

why should small tsu not transliterate?

e33c4f6

According to [https://en.wikipedia.org/wiki/Hepburn_romanization#Syll…

ad0d133

…abic_n], `ちゃっちゃ` should be transliterated as `chatcha`

Why do you not want to convert the ー?

6012c90

Add tests for custom mapping

69753ed

Make toRomaji more similar to toKana

7103978

Add post processing to toKana

d9468de

Add getKanaPostProcessing function

e49cc6d

Complete the hepburn mapping

13c9f7d

add romanization method option

8ce3acc

expose the function to create a custom mapping from an object and the…

31dea71

… romanization methods enum

built

a55a7f7

Merge branch 'master' into customMappings

3c8c291

DJTB mentioned this pull request Nov 3, 2017

MacOS native IME input composition #71

Closed

DJTB closed this Dec 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom mappings #70

Custom mappings #70

Geggles commented Oct 22, 2017

DJTB commented Oct 23, 2017

DJTB commented Oct 23, 2017

Geggles commented Oct 23, 2017

DJTB commented Nov 14, 2017

Geggles commented Nov 17, 2017

DJTB commented Nov 26, 2017 •

edited

Loading

DJTB commented Dec 12, 2017

Custom mappings #70

Custom mappings #70

Conversation

Geggles commented Oct 22, 2017

High level descriptions of the most important changes:

Some issues:

DJTB commented Oct 23, 2017

DJTB commented Oct 23, 2017

Geggles commented Oct 23, 2017

DJTB commented Nov 14, 2017

Geggles commented Nov 17, 2017

DJTB commented Nov 26, 2017 • edited Loading

DJTB commented Dec 12, 2017

DJTB commented Nov 26, 2017 •

edited

Loading