Application servers must be able to process international text. That is, they must be able to process non-English content that is encoded using standards other than ISO 8859-1 or US-ASCII. The HTTP 1.1, Servlet 2.3, and JSP 1.2 specifications each have specific requirements that address character encoding issues such that non-ISO 8859-1 encoded characters will be handled properly. This article looks at these requirements and discusses how servlet and JSP containers, referred to as servlet containers, process arbitrarily encoded JSPs, and how servlets produce arbitrarily encoded content.
Even before the World Wide Web and Web-based applications, traditional client-server systems presented unique internationalization challenges for software engineers. For today’s Web-based J2EE applications, aimed at marketplaces that are increasingly global in scope, the internationalization challenges are exponentially greater. For example, the client and the server may well be operating in different geographical locations, each with different linguistic and cultural conventions. Yet, the two (or more) pieces of the application are expected to interoperate smoothly. Developers of Web-based applications must be able to ensure that a server can simultaneously service multiple clients, each in distinct locales, or a community of users with the same linguistic and cultural expectations.
Increasing the level of complexity for application developers, Web content (HTML) can be stored and delivered in diverse “character encoding” schemes beyond the familiar US-ASCII encoding. Fortunately, major Web browsers support the detection of these diverse character encodings and can display them properly. The Unicode Universal Character Set (UCS) and its associated character encodings expand the versatility of dynamically generated content by enabling the creation of multi-lingual applications.
Character encoding is a method or algorithm for presenting characters in digital form by mapping sequences of character code points into sequences of bytes. Character encoding schemes result in varying storage requirements, depending on the character sets that they encode. For example, in the Unicode UTF-16 encoding, the size of an encoded character is two bytes. In the ISO 8859-1 encoding (often referred to as “Latin-1” — the default character encoding in many Web-related specifications), the size of an encoded character is one byte. In the Unicode UTF-8 multi-byte encoding, however, the size of a character can be one, two, or three bytes (for the 16-bit Basic Multilingual Plane).
Unfortunately, the term “character set” and the common contraction “charset” are often overused, overloaded, and ambiguous. The most common meaning, typically used in Request For Comment (RFC) specifications, is “character encoding.” In the context of Java Server Page (JSP) and servlet development, “charset” means “character encoding.”
To be compliant with the JSP 1.2, Servlet 2.3, and J2EE 1.3 specifications, servlet containers must be able to handle the different character encoding schemes that are used to represent data. For example, a JSP that contains or generates Korean-encoded text must be processed such that the content it produces is also encoded properly, so that the browser can display it correctly.
There are two distinct areas in which character encoding must be handled properly: sending data to the client and receiving data from the client. This paper discusses the I18N requirements of servlet-container implementations in these areas.
Sending Data to the Client
Application developers create applications using a mixture of servlets and JSPs. Handling character encodings in servlets is straightforward. The servlet specification provides two APIs to set the character encoding of the data that is being returned to the client: ServletResponse.setLocale() and ServletResponse.setContentType(). Using these methods causes the HTTP Content-Type header of the response message to be set with the appropriate charset attribute, thus allowing the browser to display the content correctly.
JSPs are much trickier. A JSP is transformed into a servlet and then executed to produce content. The steps for processing a JSP are:
- Translate JSP to XML
- Translate XML to Java
- Compile Java to Class
- Execute Class to produce content (HTML, for example)
Prior to processing the JSP, however, the character encoding of the JSP itself must be determined; in other words, the developer needs to find out the character encoding the author specified when the file was created. Prior to the JSP 1.2 specification, the contentType attribute of the <@page…> directive was used to indicate both the encoding of the page and the encoding of the output response. This overloaded use of contentType for both the page encoding and the output response encoding is resolved in the JSP 1.2 specification with the introduction of the new “pageEncoding” attribute. “pageEncoding” allows the developer to indicate the encoding of the JSP file separately from the contentType.
The JSP 1.2 specification, sections 3.1 and 3.2, deal specifically with the character encoding of the JSP file. One algorithm for determining a JSP’s character encoding is as follows:
- Assume that the encoding is ISO 8859-1 until the appearance of either the pageEncoding or contentType attributes of the JSP page directive.
- The pageEncoding attribute defines the encoding of the page. If it is not present, the behavior defaults to pre-JSP 1.2: The contentType attribute’s “charset” defines the encoding of the page. If neither is present, the encoding is ISO 8859-1.
- The contentType attribute indicates the encoding of the entire response. If it is not present, the response encoding is ISO 8859-1.
The following three examples show JSP character encoding statements.
Example 1: The page will be read as UTF-8 and the response will be UTF-8 encoded.
<%@ page language=”java” import=”com.myco.*” pageEncoding=”utf-8″ contentType=”text/html;charset=utf-8″ %>
Example 2: The page will be read as ISO 8859-1 and the response will be ISO 8859-1 encoded.
<%@ page language=”java” import=”com.myco.*” contentType=”text/html” %>
Example 3: The page will be read as EUC-KR and the response will be EUC-KR encoded.
<%@ page language=”java” import=”com.myco.*” contentType=”text/html;charset=EUC-KR” %>
Processing a JSP
Step 1 — Translate the JSP to XML
The JSP is initially read in via a stream using the Java platform’s default character encoding. When the JSP page directive is encountered, determine the page character encoding according to the algorithm stated above. If a character encoding other than ISO 8859-1 has been specified, then re-read the JSP page using that character encoding.
- FileReader fr = new FileReader( “foo.jsp” );
- FileReader fr = new FileReader( “foo.jsp” );
- String encoding = “ISO-8859-1”;
- InputStreamReader ir = new InputStreamReader( fr, encoding );
Included JSPs present an interesting problem, especially if there is more than one occurrence of the pageEncoding attribute due to multiple occurrences of the page directive from included JSPs. If pageEncoding were a per-page attribute, not a per-translation unit attribute, then you could conceivably have included pages in differing encodings. However, according to the JSP 1.2 specification, section 2.10.1:
“The page directive defines a number of page-dependent properties and communicates these to the JSP container. A translation unit (JSP source file and any files included via the include directive) can contain more than one instance of the page directive, all the attributes will apply to the complete translation unit (i.e., page directives are position independent). However, there shall be only one occurrence of any attribute/value defined by this directive in a given translation unit with the exception of the “import” attribute; multiple uses of this attribute are cumulative (with ordered set union semantics). Other such multiple attribute/value (re)definitions result in a fatal translation error.”
Thus, the JSP 1.2 specification does not permit multiple pageEncodings in a translation unit. An exception must be thrown if any attributes other than the “import” attribute are specified in a JSP that is included by another JSP.
Once the XML is generated from the JSP, you need to decide what character encoding should be used to save the generated XML to the file system. In processing the JSP file from its native encoding to an internal Java String representation, a conversion to Unicode was performed implicitly. Therefore, using UTF-8 may be the best choice (<?XML encoding=”UTF-8″?>). This will alleviate the step of converting to the native encoding. Since most trans-coding operations involve table lookups, they will generally be slower than conversion to UTF-8, which is algorithmic.
Step 2 — Translate the XML to Java
When the Java source file is produced from the intermediate XML, the issue of the encoding of the output file comes into play again. With Java source files, however, the situation is complicated by the varying capabilities of Java compilers, as well as the “default” encoding of the platform on which the compiler is executed. Not all Java compilers can handle Java source files that have been created using encodings other than ISO 8859-1 or the native encoding of the platform (e.g. EBCDIC).
To mitigate the risk that a less capable compiler may be used, all characters in the Java source file that fall outside the range of readable US-ASCII characters (less than 0x0020 and greater than 0x007E) can be converted to their Java “Unicode escape sequence” equivalent (i.e. “uXXXX”). This applies to static content (HTML) as well as Java code, so that string literals, character literals, and Java code itself that contains non-US-ASCII characters are converted properly.
Special characters such as quotes (single and double), tabs, and new lines must be considered when generating the Java code. When we process static content and create Java code for it, we are placing the content in a Java “out.print()” statement. When we process scriptlet code, we are creating (echoing) Java statements. The quotes and new lines must be handled differently in each case. If we are handling a Java scriptlet, we must echo the character; but if we are handling static content, we must “backslash” escape the character. Here is an example.
Start with the following JSP:
<%@page %><HTML> <BODY>”Hello”<% String a = “”hello””; out.println( a );%></BODY></HTML>
When we convert this to XML, we get something like:
<?xml version=”1.0″?><jsp:root> <jsp:directive.page … /> <jsp:cdata><![CDATA[<HTML>rn <BODY>”Hello”rn]]></jsp:cdata> <jsp:scriptlet>String a=””hello””;rn out.println( a );rn</jsp:scriptlet> <jsp:cdata><![CDATA[rn</BODY>rn</HTML>]]></jsp:cdata></jsp:root>
When we produce the Java code, we should get something like:
out.write( “<HTML>rn”Hello”<BODY>rn” );String a=””hello””;out.println( a );out.print( “rn</BODY>rn</HTML>” );
Note that even though it looks like we escaped the quotes in the statement that assigned “hello” to the variable a (String a=””hello””;), we actually only echoed out the original scriptlet. Also note that we backslash escaped the quotes around the “Hello” in the HTML content in the out.print statement.
This is similar in concept to double indirection in C and C++. The HTML content is translated into Java code that produces HTML content as its output, whereas the Java scriptlet code is only one level of indirection as it translates directly to Java code. The set of characters that must be backslash escaped contains:
- ‘ (single quote)
- ” (double quote)
- (back slash)
Step 3 — Compile Java to Class
As we mentioned previously, there are two ways to compile the generated Java code to a class file. One way is to use a compiler that supports a range of encodings for Java source files and specify the encoding in which the source file was generated. Because not all compilers support this feature, a better way is to use the platform’s default encoding and have all the strings in the source rendered as Java Unicode escape sequences. This is the more portable way, as it should work on all platforms regardless of the character encoding that was used in the original JSP.
Step 4 — Execute Class to Produce HTML
If the translation and compilation processes described in the previous sections are implemented correctly, the HTML output from the executing servlet will be encoded properly. The HTTP Content-Type header will be set to the encoding that was specified in the “contentType” attribute of the page directive (recall that the “pageEncoding” attribute has no affect on the HTTP Content-Type header). Of course, the HTTP Content-Type header can be over-ridden by using standard servlet APIs.
Receiving Data from the Client
HTTP POST Requests
One of the “dark corners” of JSP/servlet programming is the handling of form data from HTTP POST requests. Many Web-based applications are form-driven and use POST operations to submit the form data. When the user submits the form data, the browser packages the form data for a POST message and sends it to the server. Form data values are URL (%HH) encoded (“application/x-www-form-urlencoded”) for transport, but typically no information (i.e., no “charset” specification in the HTTP Content-Type header) is sent describing the “underlying” encoding of the form data as entered by the user. Most browsers take a hint from the server and will use the same encoding for form POST data as the encoding in which the form page was delivered.
Without any specification of the character encoding used for the form data, by the time the application server has processed the request and the data is accessed via the HttpServletRequest.getParameter() method, it has already been converted to String values from the underlying byte stream with an assumed ISO 8859-1 encoding — which may very likely be the wrong encoding. Prior to the Servlet 2.3 specification, the only recourse the developer had was to process the raw POST request directly. Attempts to convert the data from ISO 8859-1 to another encoding may not have worked.
The Servlet 2.3 specification provides an API specifically addressing this shortcoming in browsers. The ServletRequest.setCharacterEncoding() method must be called by the servlet/JSP developer prior to reading request parameters or reading input using the getReader() method (i.e., prior to any calling any APIs which process the HTTP message) in order for the data to be read in the proper character encoding.
Existing Web-based applications (applications that were written before the Servlet 2.3 specification) present a challenge to servlet container implementers. In order to accommodate existing web applications that do not make use of the setCharacterEncoding() method, servlet containers should provide the ability to specify a default character encoding for incoming POST data. Robust J2EE application servers such as the HP Bluestone Total-E-Server and the HP Internet Server/HP Application Server provide configurable options to address this issue on a per-application basis.
UTF-8 encoded URLs (HTTP GET Requests)
Over time, the use of URLs has evolved from a simple scheme to describe a path to a resource to a generic mechanism for passing request data to a service via the HTTP protocol. In order to increase the usefulness of the URL mechanism, some Web-server vendors now provide support for UTF-8 encoded URLs. UTF-8 URLs are supported by Microsoft IIS and Apache.
As an example of the use of UTF-8 encoded URLs, consider the following URL:
Here a UTF-8 encoded sequence of characters is being sent to an application using the %HH “URL encoding” format. Upon decoding the URL, the Web server would recognize the characters as outside of the US-ASCII range and treat them as UTF-8 encoded Unicode characters. In this case, there are two characters, each encoded with three bytes.
One problem with earlier versions of the Unicode UTF-8 specification was that it did not explicitly disallow multiple encodings for the same Unicode character. Because UTF-8 is a multi-byte encoding, a given Unicode character like “” (U+005C REVERSE SOLIDUS BACKSLASH) could be encoded several ways — using the correct single byte encoding, a double byte encoding or a triple byte encoding:
The double byte and triple byte encodings decode as additional leading zero-value characters, which are ignored.
The ability to encode Unicode characters several ways with UTF-8 proved to be a security vulnerability. Normally, Web servers check for “malicious” URLs, which, for example, might be requests for documents outside of the server’s document root, such as in the following URL:
This type of URL may pass the Web server’s Intrusion Detection System (IDS) checks but upon UTF-8 decoding, would allow access to “…./path/somefile.exe” — which is outside of the document root.
Web server and commercial IDS vendors have had to bolster their IDSs against these types of attacks. The Unicode Consortium also contributed to the effort by tightening the UTF-8 specification to clearly define illegal UTF-8 sequences.
RFC 2718 (Guidelines for new URL Schemes,) suggests the following in Section 2.2.5:
“When describing URL schemes in which (some of) the elements of the URL are actually representations of sequences of characters, care should be taken not to introduce unnecessary variety in the ways in which characters are encoded into octets and then into URL characters. Unless there is some compelling reason for a particular scheme to do otherwise, translating character sequences into UTF-8 (RFC 2279)  and then subsequently using the %HH encoding for unsafe octets is recommended.”
Because UTF-8 is the recommended encoding scheme for URLs, this feature (along with the requisite IDS implementation) will have to be provided by all commercial Web servers.
The ServletResponse.setLocale() method is unclear on how the “charset” attribute of the HTTP Content-Type header is supposed to be set. According to the Servlet 2.3 specification, when the setLocale() method is called, the Content-Type’s charset must be updated “appropriately.” However, the specification does not recommend or mandate a mapping of particular character encoding schemes for Locales, and the Locale class does not provide methods for getting and setting character encodings.
The servlet container should provide a mechanism that allows the application server system administrator to define a Locale-to-character encoding mapping. A properties file or an XML file can be used to define the country code and language code that map to a particular character encoding. The country code should be an uppercase ISO 3166 2-letter code and the language code should be a lowercase ISO 639 code. The character encoding should be an IANA encoding name. When the servlet executes the setLocale() method, the servlet container will consult the mapping and set the HTTP Content-Type header appropriately.
This article has presented the internationalization issues faced by J2EE Web application developers and servlet container implementers. JSP/servlet container implementations must be able to handle arbitrary character encodings in both the processing of JSP files and the creation of an appropriate output response stream, as well as in the handling of form data and encoded URLs. Robust, full-featured JSP/servlet containers, such as those provided in the HP Application Server (HP-AS), provide these internationalization features to enable the development of internationalized Web applications. Web application developers can create and save JSP files in their native encodings and multi-lingual applications can be developed for international electronic commerce sites that tailor their content to the locale of the client.
Developers dealing with internationalization challenges will find additional resources from Web sites or developer portals that some vendors support. For example, the developer resource guide at the HP developers site is one such portal, providing various reports and subject-specific presentations by HP developers that can facilitate development projects. Also, HP Middleware contains more application-server information.
- The list of Java’s supported encodings and canonical names: Java’s encoding site
- A description of how Java handles encodings and their aliases: Java’s packaging site
- The Unicode Consortium Web Site, here.
- The UTF-8 Corrigendum to Unicode 3.0.1, at this location.
- Hacker, Eric, “IDS Evasion with Unicode,” which can be found here.
- The following specifications contain specific requirements concerning character encoding:
- RFC 2616: Hypertext Transfer Protocol — HTTP/1.1. Sections 3.4.1, 3.7.1, 14.2, 14.17
- RFC 822 section 3.1.2
- JSP 1.2 sections 3.1, 3.2, 3.3
- Servlet 2.3 sections 4.9, 5.4 and the following API’s:
About the Authors
Scott W. Ruch is an Architect in the HP OpenView Research and Development organization. During his tenure as I18n Architect for the HP Middleware Division, Scott’s work included the development of I18n trail maps and best practices for HP software engineers. Scott has over ten years of software design and development experience in operating systems and telecommunications network management systems. Scott holds a Bachelor of Science in Electrical Engineering from Drexel University and a Master of Science in Software Engineering from Monmouth University.
JJ Snyder is a Senior Architect in the HP Middleware Division. He has been programming in C/C++ and Java for over 15 years. He was the Chief Architect on Bluestone’s award winning Total-e-Mobile 1.0 and HP Application Server concentrating on the Servlet 2.3 container. Currently he is a working as a Senior Architect in HP Middleware’s Rich Media group. JJ holds a Bachelor of Science in Computer Science from the Pennsylvania State University.
In addition to Project.net’s Roger Bly, the authors would also like to thank the following HP Middleware engineers for their contributions: Jessica Sant, Jason Kinner, Erik Bergenholtz, and Ryan Moquin.
See page 4 of this article for information on Project.net Builds Winning Internation Solution.
Project.net Builds Winning Internation Solution
What began as a kind of super project-management system for a large defense contractor has grown into an online business system for any organization needing enterprise-wide project-management and collaboration tools. Built on the solid foundation provided by HP’s application server (HP-AS), Project.net enables executives, project managers, even project-team members to see the big picture and collaborate more effectively than ever before. One client in Japan is using Project.net as the backbone of a collaborative sales-automation system, using i-Mode phones to communicate, check inventory, and book orders.
Project.net provides software for project-based companies. Project.net’s products enable project management, team collaboration, project portals, project extranets, project offices, portfolio management, and executive dashboards. Created in 1999, the privately held company is based in San Diego, CA.
A Japanese customer asked Project.net to help it develop a collaborative sales-force automation solution based on the Project.net J2EE collaboration engine. The client wanted its sales representatives to be able to access the system through i-Mode wireless phones and to be able to view information in the system in Japanese character sets.
Using HP-AS features, XSL style sheets, and Tag libraries, Project.net was able to develop just the solution that its customer required. Project.net successfully created its solution even using i-Mode wireless phones, which present particular technical challenges because, unlike WAP phones, they do not present unique session identifiers.
The flexibility of HP-AS made it possible for Project.net developers to create a new session-management server that could work with custom tags and custom tag libraries to manage individual i-Mode phone interactions. And because HP-AS pushes XML messages out to its client devices, the use of XSL style sheets enabled Project.net’s client to present the information to end users using i-Mode phones. Moreover, the use of TagLibs in the presenting JSPs enabled Project.net to add Japanese language support, which in turn enabled Project.net to deliver a breakthrough solution that was precisely what its client needed.
Internationalizing the Solution
Customizing Project.net’s solution to accommodate the Japanese language requirement was also a challenge Project.net solved through the use of TagLibs and through close work with HP. “We created a token lookup that manages the translation of any piece of text that may appear in a different language,” says Roger Bly, CEO of Project.net. “We don’t put the words themselves on the Java server page (JSP), just the tag for the lookup. So when the JSP displays the page, it contains the user’s selected language and character set — in this case Japanese.”
To ensure a successful implementation, Project.net worked closely with HP to help internationalize HP-AS’s predecessor, Total-E-Server version 7.3. HPAS 8.0 benefited from this collaboration when it was debuted with a robust, configurable, JSP1.2/Servlet2.3 I18n fully compliant servlet container.
“When we initially decided to bring our product to market,” recalls Bly, “we knew we wanted to use the best technology we could get. Hewlett-Packard was the technical leader. The way HP’s Application Server handles XML messages is perfectly consistent with our strategy for handling them. It was a good match at the beginning and remains a good match today.”