Date: 2010-11-09 22:50:00
kindle 3 and chinese text
A few weeks ago I started a beginner level Chinese (Mandarin) evening class at the local community college. We're about four weeks in and I'm really enjoying it, will definitely do more.

I thought it would be an interesting project to load a Chinese-English dictionary onto my Kindle for reference. I've already played with kindlegen, which takes a collection of HTML files along with some additional metadata and creates a MobiPocket .mobi format file for the Kindle. (I've got several ideas for reference documents that could be loaded onto the Kindle, more on that later.)

Anyway, I've now got a prototype Chinese dictionary loaded onto the Kindle. However, although many of the characters are displayed correctly, quite a few are just empty boxes (in the form of I ⬜ Unicode). I couldn't find any particular pattern to the missing characters, for example 他 (he) displayed correctly but 她 (she) did not.

Some research with Google showed that some people had hacked the Kindle and found a way to load different font files onto the device. I didn't try doing that, becuase I thought there should be a better solution. After all, Chinese text is displayed correctly in the Kindle web browser! Furthermore, if I copy a UTF-8 format file directly to the Kindle for display, it will have the missing character problem. But if I email the file to Amazon where they automatically reformat it to a .azw book file and deliver it to the Kindle, the characters show up correctly. If I convert to a MobiPocket book file myself, missing characters.

After laboriously paging through dozens of threads on mobileread.com (typical internet "forum" software really is awful), I finally found a solution. It uses an undocumented debug feature of the Kindle. Press <Home> <Search> and enter the following commands:

    ;debugOn <Enter>
            
~setLocale zh-CN <Enter>


That's it, that is all that was required. I have no idea what this actually does or how it changes the Kindle's interpretation of UTF-8 documents (UTF-8 is supposed to be an unambiguous encoding of Unicode). But I now have a very basic Chinese-English dictionary in my Kindle.
[info]pne
2010-11-09T10:31:35Z
a way to load different font files onto the device. I didn't try doing that, becuase I thought there should be a better solution. After all, Chinese text is displayed correctly in the Kindle web browser!

I wonder whether the reason is that the Kindle is not displaying Chinese text, but rather Japanese text!

That would explain why characters that are used in Japanese (such as 他) would display but but characters that aren't used in Japanese (such as 她) would not.

And might explain why setting the locale to P.R. China would work: it might select a different default font, one with coverage of Simplified Chinese characters rather than Japanese ones.
[info]ghewgill
2010-11-09T18:09:03Z
That's an interesting theory. The Japanese class I took years ago didn't introduce any kanji, just hiragana and some katakana. As a result, I know next to nothing about Japanese kanji and didn't realise this difference!
[info]lithiana
2010-11-09T10:48:35Z
My understanding is that Unicode is not unambiguous for CJK characters -- it has a single codepoint that refers to a character written slightly different in each language, and the application is supposed to detect which character to display based on the user or document's indicated locale. However, I don't see why that would result in a character not displaying at all...

[info]pne
2010-11-09T12:31:18Z
My understanding is that Unicode is not unambiguous for CJK characters -- it has a single codepoint that refers to a character written slightly different in each language, and the application is supposed to detect which character to display based on the user or document's indicated locale.

That's true, too, due to "Han unification".

Most shapes are identical or at least very similar, but in some cases, there are noticeable differences. (Perhaps a comparison in Latin script terms, which allows for different character shapes, too, is hand-written vs. printed lower-case "a" or "g".)

However, I don't see why that would result in a character not displaying at all...

Because the font it uses doesn't cover every single Unicode character but only the ones it thinks will be needed?

If that were the reason, though, then it would mean that Amazon-provided eBooks (where the characters show up) must use a different font than self-made ones or plain-text UTF8 files (where certain characters don't).
[info]ghewgill
2010-11-09T18:18:15Z
I have a vague understanding of Han unification, the 28 MB CJK Unified Ideographs PDF at unicode.org shows six columns with slightly different renderings for each code point (and many places where a code point might be shown for just one or a few languages).

Perhaps the Kindle prefers to show characters in a consistent style based on the language setting, rather than mixing up different styles because not all characters are available in each style. Since it's a reading device, I can appreciate that.
[info]goulo
2010-11-10T09:50:07Z
Surely it's better to see some of the characters in a different style/font/whatever than to see an empty box...? (Like sometimes I visit a poorly designed webpage that has text in one font, except all the Esperanto characters are in some other font. Ugly, but far better than not even showing the Esperanto characters.)

I don't know much about this particular issue. It sounds like a disaster, if there are Unicode characters that don't have a single meaning but depend on having barely documented settings set correctly in one's device. Was there really not a better way for Unicode to be designed? I thought the whole point of it was to avoid this code of alternate character set problem...?
[info]mskala
2010-11-10T17:22:49Z
Han unification has always been controversial, but there are some reasons for it. The claim of its proponents is that having two different styles of the same character is a much different thing from having two actually different characters. Look, for instance, at the lowercase letter "a" in English. In some typefaces that's a basically circular shape with a line added along the right-hand side (single-decker); a vertical line down the middle passes through the letter twice. In other typefaces, the loop is pushed to the bottom and the line along the right-hand side also curves up and over, forming a second, open, loop (double-decker). A vertical line through that kind of "a" passes through the letter three times. Single-decker is more common in handwriting and double-decker is more common in print. This is a bigger difference than other differences (like that between C and G) that are considered to represent "different letters" by users of English; nonetheless, the two versions of "a" are considered to be "the same letter" by users of English, and learners of the language just have to learn both shapes.

What if a Chinese person decreed that because it sounds like a disaster if there are characters that don't have a single meaning, everybody in the world should be required to use different code points for those "two" "different" letters?

Now, the counter-argument would be that because the differences among styles of Han characters correlate with language, there's a bigger gap involved - as might be the situation if English nearly always used the single-decker "a," French nearly always used the double-decker, and many people routinely learned only the one for their own language and didn't recognize the other. It's apparent that there is a continuum on which you can draw a line to say "things on this side represent really different pairs of characters that should get different code points; things on that side represent different styles that should be handled by the font and not the code." Maybe Unicode drew their line in the wrong place. But my point is that it's not by any means a no-brainer.
[info]goulo
2010-11-10T17:58:58Z
Thanks for the explanation! That is helpful, but also leaves me confused. I always thought (perhaps quite erroneously) that at its core, Unicode was not so much about mere appearance as about the logical function/purpose of the characters. The zillions of different possible appearances of the letters "a" (depending on which typeface one uses) are all still the same character, surely. I cannot imagine the disaster if every distinct "a" from every typeface (new ones being created daily) should be considered a distinct character...

I agree it's blurry though, e.g. there are Unicode symbols for "arrow pointing right" or whatever which have some inherent definitional restrictions on how they could reasonably look.

But thinking about this Han unification thing (about which I know virtually nothing but what I've read here), if it's analogous to the different appearances of "a" in different typefaces, then again, I don't see what the problem was in Greg's Kindle. It would be as if "a" showed up in Times Roman but not in Arial - we'd say "WTF, that font is broken, it doesn't have all the characters it should have!" :) Sure, you can't expect every font to have every character, but when they're kind of related in that way. E.g. a Polish font still has "v" and "q" even though the Polish language doesn't use those letters. Ah well, I can see it's one of those annoying "the real world language situation is too chaotic and messy to be captured in an elegant mathematical model" type deals. :)
[info]mskala
2010-11-10T18:34:28Z
Unicode was not so much about mere appearance as about the logical function/purpose of the characters. - Yes, and that's why they did Han unification; to give the same numbers to characters they decided were "the same" in some important way even if they looked different in different languages. The consequence is that if you use a font designed for one language to write another (within CJK), it ends up looking wrong even if it has glyphs for all the necessary code points; then there's the separate issue of whether it DOES have glyphs for all the necessary code points.

It sounds like Greg's Kindle, for whatever reason, was looking for a Japanese font first, and then when that font didn't have glyphs for some of the code points in the Chinese text, it was failing to substitute them from some other font but just showing missing characters. I don't know why; he proposes it's to avoid mixing styles. Another possibility might be that it (or that particular piece of software) simply doesn't have font substitution implemented at all, as a matter of corner-cutting rather than intelligent design.

The number of characters used by Chinese and not Japanese is in the thousands, so we can't really expect a font intended for use with Japanese (which will contain the Japanese styles of the shared characters, and thus look wrong for Chinese whether it has full character coverage or not) to also include all the Chinese characters just for completeness; it's a much taller order than hoping for a Polish font to contain "v' and "q."
Greg Hewgill <greg@hewgill.com>