Yuccablog

Saturday, September 02, 2006

The paradox of Unicode adoption

Unicode works in casual memos but not in books


Everyone can use Unicode these days, as long as you work with a reasonably new computer system and use software like a common word processor (that’s PC phrase for MS Word) or even Notepad. Everyone and his dog can also create web pages using Unicode without trying hard. You can compose E-mail in Unicode, though the odds are that many recipients will see it as distorted by the software they use; webmail systems are particularly primitive as regards to using anything beyond Ascii characters. Among Internet discussion forums, there are already many alternatives that let you write and read Unicode easily, as long as you’ve learned how to type the characters.

But if you try to write a book or an article for a printed publication, you will typically be in a deep trouble if you try to use anything beyond the Latin 1 repertoire. Everything works fine in your text processor, but as soon as it reaches the publisher’s system, characters will get munged in imaginative ways. Widely used publishing software like FrameMaker or InDesign just don’t grok Unicode yet. Troubles are also ahead if you try to enter characters beyond Latin 1 into a database in the naïve expectation that databases are generally Unicode-enabled.

In practice, you should probably accept the fact that anything beyond Latin 1 needs to be expressed using images, in a printed publication. This is fairly stupid especially if you write about extended character repertoires, as I often do. You cannot show examples of special characters in running text.

I guess there is a possible solution in many cases, but it’s not acceptable to many publishing houses and typographers: the author prepares the entire material in MS Word and converts it to PDF format. That way he can check the result easily and fix it as needed. You may need to create the PDF file using font embedding techniques, so that the file contains the fonts it needs. And there are probably pitfalls, and many authors wouldn’t know how to handle the process, but I think the real main objection is that such approaches are “primitive.” The real primitiveness, however, is in the limitations of current publishing software. Software that cannot handle characters beyond an 8-bit set in any reasonable way is comparable to a system that cannot handle letters “x” and "y,” since to many languages and cultures, some “special” or “extra” characters are just as essential as “x” and "y” are in English.