Wednesday, February 27, 2013

Surrogate pair from a code point

Search for a code point U+23103 for example.

Tap on the screen to get to the detail info of the character.

utf-16be shows the surrogate pair code points, D84C DD03.

Wednesday, January 30, 2013

Version 1.13 is on the App Store.

Update for iPhone 5.

Tuesday, March 6, 2012

Version 1.12 is on the App Store.

Update for Unicode 6.1 , new blocks and characters, glyphs are not available on iOS5 yet.

Monday, September 19, 2011

Wednesday, February 23, 2011

Version 1.9 is on the App Store.

Update for Unicode 6.0

Tuesday, July 20, 2010

Wednesday, May 5, 2010

Kuten code to Unicode

So how can I get Unicode from this Ku-Ten code?

In this case, Ku is 16 and Ten is 01. Ku-Ten system represent characters in 94 by 94 matrix.
Ku-Ten code is not identical to JIS code, you need mapping.
JIS code avoids 0x00-0x20 appear in the encoding, so the mapping is to add 0x20 to Ku and Ten each.
For 16-01, JIS code is not 3601 because Ku-Ten is usually represented in decimal numbers, I always make this mistake and ends up with a wrong character.
So the first step is to convert them to Hex 10-01 then add 0x20 so the result is 3021, that is the JIS code value.

perl -e 'printf("%x%x\n", 16+32, 1+32)'

Once you get JIS code, there is a mapping available to Unicode (JIS X 0213 to Unciode).

3-3021 U+4E9C #

The prefix '3' is a plane followed by JIS code then you get the Unicode.
At the top of the above page, there is an explanation of plains.

## 0-XX ISO/IEC 646 IRV (designated by '1b 28 42')
## 3-XXXX JIS X 0213:2004 plane 1 (designated by '1b 24 28 51')
## 4-XXXX JIS X 0213:2000 plane 2 (designated by '1b 24 28 50')

Somehow, '3' means plain '1' in this table.
From Wikipedia:
Plane 1 is a superset of JIS X 0208 containing kanji sets level 1 to 3 and non-kanji characters such as Hiragana, Katakana (including letters used to write the Ainu language), Latin, Greek and Cyrillic alphabets, digits, symbols and so on. Plane 2 contains only level 4 kanji set. Total number of the defined characters is 11,233.

So generic characters are included in plane 1, and not frequently used characters like this also.
This is Level 3 character but still Plane 1, Ku 90, Ten 17.
JIS code is 0x7A31 and U+7E11.
3-7A31 U+7E11 # [2000]

Some of the characters are outside of Unicode BMP (i.e. >0xFFFF).
3-776C U+247F1 # [2000] [Unicode3.1]
4-2177 U+20381 # [2000] [Unicode3.1]