September 22, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Internationalize and Localize Your C/C++ Code with ICU

  • February 10, 2006
  • By Victor Volkman
  • Send Email »
  • More Articles »

To paraphrase Jane Austen: "It is a truth universally acknowledged, that a successful application in possession of a good customer base must be in want of an internationalization strategy."

All joking aside, internationalization and localization are key parts of the maturation process of an application, whether it is deployed throughout a single enterprise or through a diverse customer base.

The Problem

The history of computing is littered with good ideas for encoding characters from various languages. Even for something as "straightforward" as English, I've had to deal with several different encodings in my career: ASCII, EBCDIC, and FIELDATA. When you expand your domain of interest to Asian languages—for example, Japanese—you find a wide variety of choices, including SJIS and JEUC.

The basic problem is the same "glyph" (or character) has a different representation (coding) depending on your language locale. The Unicode manifesto offers a helpful starting place to avoid the mire of this alphabet soup: "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."

More or less, this means a unique 16-bit number as opposed to an 8-bit number, though there is a lot more to it than just that. This article looks at how the International Component for Unicode (ICU) library can keep you from being hopelessly mired in the glyph alphabet soup.

ICU: International Component for Unicode

ICU is a mature, portable set of open source C/C++ and Java libraries for Unicode support and software internationalization. (Don't be put off by the fact that IBM provides the funding and support for ICU; it really is an open source initiative in the best sense.) It gives applications the same results on all platforms. Its major features revolve around locale-sensitive string comparison, formatting, text boundary detection, and character set conversion. So, what exactly does ICU offer?

  • Text: Unicode text handling, full character properties, and character set conversions (500+ code pages)
  • Analysis: Unicode regular expressions, full Unicode sets, and character, word, and line boundaries
  • Comparison: Language-sensitive collation and searching
  • Transformations: Normalization, upper/lowercase, and script transliterations
  • Locales: Comprehensive data and resource bundle architecture (200+ locales)
  • Complex Text Layout: Arabic, Hebrew, Indic, and Thai
  • Formatting and Parsing: Multi-calendar and time zone, dates, times, numbers, currencies, and messages

Examples in this article require that you download a version of ICU. Although they show code in Win32 environments, rest assured that ICU has precompiled binaries for AIX 5.2, HPUX 11.11, Red Hat Linux 3.0, Solaris 9, and Visual Studio .NET 2003. With source code in hand, you can build your own binaries for these environments plus Mac OS X, Cygwin, MinGW, BSD, QNX, and many other popular platforms. Running the built-in test suite is highly recommended whenever creating your own binaries.

ICU Text Boundary Analysis

Altlthough many people will use ICU just for codeset conversion and localization, both ideas are difficult to demonstrate in a short demo. Instead, this article delves into another useful functionality that ICU provides: text boundary analysis.

Text boundary analysis is the process of locating linguistic indicators when formatting or parsing text. For example, if the user double-clicks into the middle of a word in a word processor, you have to be able to figure out where the word starts and ends to highlight it. Other applications might include automatic capitalization, word counts, or constructing a concordance. The demo shows how a BreakIterator object can solve these problems independent of language encoding. An ICU BreakIterator can locate boundaries of characters, words, line-breaks, and sentences.

To begin, install ICU into a convenient location; this example uses D:\ICU. The Visual Studio .NET 2003 command-line to build the application looks like this:

cl wrap.cpp /EHsc /Z7 /ID:\icu\include /link
            /LIBPATH:d:\icu\lib icuuc.lib /debug

And, of course, you must placate your old friend, the DOS PATH:

path=%path%;d:\icu\bin

The first thing the demo program will need to do is establish a locale. A "locale" includes information about the user's language, his or her country, and possibly other preferences. For example, "en_US" specifies English and USA conventions (for collation, currency, and calendar format), whereas "en_IE_PREEURO" specifies English in Ireland ("IE" is the ISO-639 abbreviation) with non-Euro (for example, GBP) currency. Simply hardcode the Locale constructor, although the proper method would be to read it out of the machine environment:

Locale myLoc("en", "US");

Although you probably are used to thinking in terms of one character = one glyph (displayable symbol), this is not the case when you wander very far from English. For example, the glyph "ö" (lowercase "o" with an umlaut) could legally be represented by a single Unicode character or less obviously by the "o" followed by a second code for the umlaut. Indeed, this method of construction is what allows many Asian languages to be represented without their permutations overrunning the 64,000 characters available in 16-bit representations.

This demo's code uses a character-based iterator to figure out how to word-wrap a paragraph:

#include "unicode/uchar.h"
#include "unicode/brkiter.h"
#include <iostream>     // for cout
using namespace std;    // for cout
int32_t wrapParagraph(const UnicodeString& s,
                   const Locale& locale,
                   int32_t lineStarts[],
                   int32_t trailingwhitespace[],
                   int32_t maxLines,
                   int32_t maxCharsPerLine,
                   UErrorCode &status) {
    int32_t        numLines = 0;
    int32_t        p=0, q;
    UChar          c;
    BreakIterator *bi =
       BreakIterator::createLineInstance(locale, status);
    if (U_FAILURE(status)) {
        delete bi;
        return 0;
    }
    bi->setText(s);
    while (p < s.length()) {
        // jump ahead in the paragraph by the maximum number
        // of characters that will fit
        q = p + maxCharsPerLine;
        // if this puts us on a white space character, a
        // control character (which includes newlines),
        // or a non-spacing mark, seek forward and stop on
        // the next character that is not any of these
        // things since none of these characters will be
        // visible at the end of a line, we can ignore them
        // for the purposes of figuring out how many
        // characters will fit on the line)
        if (q < s.length()) {
            c = s[q];
            while (q < s.length() && (u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK)) {
                ++q;
                c = s[q];
            }
        }
        // then locate the last legal line-break decision
        // at or before the current position
        // ("at or before" is what causes the "+ 1")
        q = bi->preceding(q + 1);
        // if this causes us to wind back to where we
        // started, then the line has no legal
        // line-break positions. Break the line at the
        // maximum number of characters
        if (q == p) {
            p += maxCharsPerLine;
            lineStarts[numLines] = p;
            trailingwhitespace[numLines] = 0;
            ++numLines;
        }
        // otherwise, we got a good line-break position.
        // Record the start of this line (p) and then seek
        // back from the end of this line (q) until you find
        // a non-white space character (same criteria as
        // above) and record the number of white space
        // characters at the end of the line in the other
        // results array
        else {
            lineStarts[numLines] = p;
            int32_t nextLineStart = q;
            for (q--; q > p; q--) {
                c = s[q];
                if (!(u_isspace(c)
                       || u_charType(c) == U_CONTROL_CHAR
                       || u_charType(c) == U_NON_SPACING_MARK)) {
                    break;
                }
            }
            trailingwhitespace[numLines] = nextLineStart - q -1;
            p = nextLineStart;
           ++numLines;
        }
        if (numLines >= maxLines) {
            break;
        }
    }
    delete bi;
    return numLines;
}

int main(int argc, char **argv)
{
  const int MAX_LINES=255;
  int32_t numLines, maxLines=MAX_LINES, lineStarts[MAX_LINES],
     trailingwhitespace[MAX_LINES];
  UErrorCode status=U_ZERO_ERROR;
  UnicodeString s1 = "Eschew obfuscation intentionally
                      recreating penultimate epistemological
                      valuation";
  UnicodeString s2 = s1 + s1 + s1;    // create a somewhat
                                      // longer string for
                                      // testing
  Locale myLoc("en", "US");
  numLines = wrapParagraph(s2, myLoc, lineStarts,
                           trailingwhitespace, maxLines,
                           70, status);
  for (int ii=0; ii<numLines; ii++)
    cout << "Line " << ii << " starts at pos "
         << lineStarts[ii] << endl;
}

The main() program simply builds up a string called s2, which contains the nonsense phrase that you are going to format into word-wrapped lines. Next, you create your locale as mentioned previously. Then, you call the wrapParagraph() function, which contains a lot of inline comments to explain its purpose.

You can see from the output below that the line lengths vary quite a bit from the requested 70 characters. This is because the shortest word used was six letters:

D:\icu>wrap
Line 0 starts at pos 0
Line 1 starts at pos 56
Line 2 starts at pos 125
Line 3 starts at pos 195

Just the Tip of the ICU-berg

The ICU Library can do a lot more than, of course. It has a powerful regular expression library, a system for optimizing localization resources, a way to produce locale-friendly messages and numeric strings, normalization and translation services, and a whole lot more. Check it out.

Book Recommendation

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard by Richard Gillam is a great way to start your education of globalization issues. The book covers the history of Unicode, normalization forms, storage, and serialization. It pays particular attention to implementation techniques such as conversion, searching and sorting, and rendering. Of course, no Unicode book would be complete without covering salient aspects of the world languages of Europe, the Middle East, Africa, and Asia.

About the Author

Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.






Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel