July 28, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Working with Textual Data: Be Prepared for Unexpected Problems

  • February 8, 2007
  • By Alex Gusev
  • Send Email »
  • More Articles »

Objectives

The objectives of this article are quite straightforward. It combines a few quite different aspects of textual data processing, including conversions between different encodings and BSTR usage.

It's All about UNICODE

As you definitely know, Windows CE OS is built as a native UNICODE system, just like Windows NT. Symbian OS does the same. It means that every character occupies exactly two bytes, so it uses UCS-2 encoding. This isn't the only possible case; for example Linux uses UCS-4 (4 bytes for a single character). How does it relate to your application at all? Well, you have at least two cases here.

The simpler one is when you start developing a brand-new, shiny application for a mobile platform from scratch. When you design it, you need to keep in mind what kind of data will it work with to predict possible clashes. Besides, it is UNICODE inside whether you like it or not.

The second scenario is when you try to compile/port your existing desktop code for Windows Mobile. The very first outcome is that it might cost you a few hours (or even maybe days) to get it working if you weren't lucky enough to embrace all your text in encoding-safe macros, such as _T(), TEXT(), or whatever else you use for this purpose. The same is true for character-based types like LPSTR or char, which have to be replaced by LPTSTR and TCHAR respectively.

Another aspect of the same "type-change" problem is that, in most cases, there are two versions of the same Win32 API function: one for wide character arguments (with "A" suffix) and one for multibyte ones (with "W" at the end). It shouldn't be a big problem, though. A number of fixes here and there—and everything takes off happily. But, what if you handle textual data, such as ASCII? In this case, you have no choice but to either perform the required conversions or continue with the same data processing. The latter might cause you to spend additional time to adjust the code to make it happen. Here is the point to and from UNICODE for different types of text encodings.

ASCII, UTF-8, What's More?

Consider the scenario when your application should somehow process something different than UNICODE text input. This is quite a common case because data may be prepared on a PC and be a plain aand simple UTF-8 (or ASCII) file, or a field in a database was declared as single char type, or you may think out an indefinite number of your own examples. Moreover, if you use files, they may have localized names. As a bottom line, you have to decide how to deal with all this stuff.

Because a binary tree pattern for use-cases seems to be employed in this article, you might consider two possible solutions for conversions: to utilize existing CRT libraries or Win32 APIs. The former may be a natural choice for your application, but the real problem is that CRT functions such as mbstowcs() don't work correctly with all code pages; for example, with some narrow end part of a Katakana table. Hence, if you have to target such languages, Win32 APIs are the only choice left:

WINBASEAPI
int
WINAPI
MultiByteToWideChar(
   IN UINT     CodePage,
   IN DWORD    dwFlags,
   IN LPCSTR   lpMultiByteStr,
   IN int      cbMultiByte,
   OUT LPWSTR  lpWideCharStr,
   IN int      cchWideChar);

WINBASEAPI
int
WINAPI
WideCharToMultiByte(
   IN UINT     CodePage,
   IN DWORD    dwFlags,
   IN LPCWSTR  lpWideCharStr,
   IN int      cchWideChar,
   OUT LPSTR   lpMultiByteStr,
   IN int      cbMultiByte,
   IN LPCSTR   lpDefaultChar,
   OUT LPBOOL  lpUsedDefaultChar);

You will focus on the first two parameters of the functions above. CodePage defines the desired codepage you're going to convert to or from. For most cases, you may choose CP_UTF8. As the documentation says, both functions work faster if dwFlags are not set. Unless you have to work with more complex characters, such as "h", that's all you need for back and forth text conversions. One more useful feature of those functions is that they return the buffer size required for conversion if specified. So, a typical call may look like this:

// First, get required buffer length
DWORD dwSize = ::MultiByteToWideChar(CP_UTF8,0,pMultibyteBuffer,
                                     -1, NULL, 0);
// Allocate it
TCHAR pWideBuffer = new TCHAR[dwSize];
// Second, make the conversion
::MultiByteToWideChar(CP_UTF8,0,pMultibyteBuffer, -1, pWideBuffer,
                      dwSize);

The second parameter, dwFlags, controls how those functions work with composite characters (like "h", which consists of "e" as the base character and a "grave accent" character as a nonspacing character) and invalid characters. You can play with the dwFlags value on your own.

Verifying Text Input

Now, consider the following situation: You have some multibyte text input, which should be treated as UTF8, but you need to check it to reject other encodings. This is not so uncommon; you may have a file with some national encoding like Japanese (Shift-JIS). Here comes a natural requirement: to verify an input prior to passing it to conversion functions. According to the UTF-8 table, it can be done similarlu to the following code snippet:

BOOL IsCorrectUTF8Buffer(LPSTR pMultiByteBuf, DWORD dwNumBytes)
{
   for (DWORD i = 1; i < dwNumBytes; i++)
   {
      // 1. check if the uppermost bit in current byte is set
      if ( (pMultiByteBuf[i] & 0xC0) == 0x80 )
      {
         // 2. if previous byte has it reset
         if ( (pMultiByteBuf[i-1] & 0x80) == 0x00 )
         {
            return FALSE;
         }
      }
      else
      {
         // 3. another case:
         // lead-byte of a 2 byte sequence, but code point <= 0x7F
         if ( (pMultiByteBuf[i-1] & 0xC0) == 0xC0 )
         {
            return FALSE;
         }
      }
   }

   return TRUE;
}

The function above simply scans the input text buffer until it finds the incorrect UTF-8 character. Such an approach may help you detect an input error at the early stages and respond appropriately.





Page 1 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel