Wednesday, May 5, 2010

Kuten code to Unicode

So how can I get Unicode from this Ku-Ten code?

In this case, Ku is 16 and Ten is 01. Ku-Ten system represent characters in 94 by 94 matrix.
Ku-Ten code is not identical to JIS code, you need mapping.
JIS code avoids 0x00-0x20 appear in the encoding, so the mapping is to add 0x20 to Ku and Ten each.
For 16-01, JIS code is not 3601 because Ku-Ten is usually represented in decimal numbers, I always make this mistake and ends up with a wrong character.
So the first step is to convert them to Hex 10-01 then add 0x20 so the result is 3021, that is the JIS code value.

perl -e 'printf("%x%x\n", 16+32, 1+32)'

Once you get JIS code, there is a mapping available to Unicode (JIS X 0213 to Unciode).

3-3021 U+4E9C #

The prefix '3' is a plane followed by JIS code then you get the Unicode.
At the top of the above page, there is an explanation of plains.

## 0-XX ISO/IEC 646 IRV (designated by '1b 28 42')
## 3-XXXX JIS X 0213:2004 plane 1 (designated by '1b 24 28 51')
## 4-XXXX JIS X 0213:2000 plane 2 (designated by '1b 24 28 50')

Somehow, '3' means plain '1' in this table.
From Wikipedia:
Plane 1 is a superset of JIS X 0208 containing kanji sets level 1 to 3 and non-kanji characters such as Hiragana, Katakana (including letters used to write the Ainu language), Latin, Greek and Cyrillic alphabets, digits, symbols and so on. Plane 2 contains only level 4 kanji set. Total number of the defined characters is 11,233.

So generic characters are included in plane 1, and not frequently used characters like this also.
This is Level 3 character but still Plane 1, Ku 90, Ten 17.
JIS code is 0x7A31 and U+7E11.
3-7A31 U+7E11 # [2000]

Some of the characters are outside of Unicode BMP (i.e. >0xFFFF).
3-776C U+247F1 # [2000] [Unicode3.1]
4-2177 U+20381 # [2000] [Unicode3.1]