To paraphrase Jane Austen: “It is a truth universally acknowledged, that a successful application in possession of a good customer base must be in want of an internationalization strategy.”
All joking aside, internationalization and localization are key parts of the maturation process of an application, whether it is deployed throughout a single enterprise or through a diverse customer base.
The Problem
The history of computing is littered with good ideas for encoding characters from various languages. Even for something as “straightforward” as English, I’ve had to deal with several different encodings in my career: ASCII, EBCDIC, and FIELDATA. When you expand your domain of interest to Asian languages—for example, Japanese—you find a wide variety of choices, including SJIS and JEUC.
The basic problem is the same “glyph” (or character) has a different representation (coding) depending on your language locale. The Unicode manifesto offers a helpful starting place to avoid the mire of this alphabet soup: “Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”
More or less, this means a unique 16-bit number as opposed to an 8-bit number, though there is a lot more to it than just that. This article looks at how the International Component for Unicode (ICU) library can keep you from being hopelessly mired in the glyph alphabet soup.