Internationalization Requirements for Servlet Containers
Application servers must be able to process international text. That is, they must be able to process non-English content that is encoded using standards other than ISO 8859-1 or US-ASCII. The HTTP 1.1, Servlet 2.3, and JSP 1.2 specifications each have specific requirements that address character encoding issues such that non-ISO 8859-1 encoded characters will be handled properly. This article looks at these requirements and discusses how servlet and JSP containers, referred to as servlet containers, process arbitrarily encoded JSPs, and how servlets produce arbitrarily encoded content.
Increasing the level of complexity for application developers, Web content (HTML) can be stored and delivered in diverse "character encoding" schemes beyond the familiar US-ASCII encoding. Fortunately, major Web browsers support the detection of these diverse character encodings and can display them properly. The Unicode Universal Character Set (UCS) and its associated character encodings expand the versatility of dynamically generated content by enabling the creation of multi-lingual applications.
Character encoding is a method or algorithm for presenting characters in digital form by mapping sequences of character code points into sequences of bytes. Character encoding schemes result in varying storage requirements, depending on the character sets that they encode. For example, in the Unicode UTF-16 encoding, the size of an encoded character is two bytes. In the ISO 8859-1 encoding (often referred to as "Latin-1" -- the default character encoding in many Web-related specifications), the size of an encoded character is one byte. In the Unicode UTF-8 multi-byte encoding, however, the size of a character can be one, two, or three bytes (for the 16-bit Basic Multilingual Plane).
Unfortunately, the term "character set" and the common contraction "charset" are often overused, overloaded, and ambiguous. The most common meaning, typically used in Request For Comment (RFC) specifications, is "character encoding." In the context of Java Server Page (JSP) and servlet development, "charset" means "character encoding."
To be compliant with the JSP 1.2, Servlet 2.3, and J2EE 1.3 specifications, servlet containers must be able to handle the different character encoding schemes that are used to represent data. For example, a JSP that contains or generates Korean-encoded text must be processed such that the content it produces is also encoded properly, so that the browser can display it correctly.
There are two distinct areas in which character encoding must be handled properly: sending data to the client and receiving data from the client. This paper discusses the I18N requirements of servlet-container implementations in these areas.
Sending Data to the Client
Application developers create applications using a mixture of servlets and JSPs. Handling character encodings in servlets is straightforward. The servlet specification provides two APIs to set the character encoding of the data that is being returned to the client: ServletResponse.setLocale() and ServletResponse.setContentType(). Using these methods causes the HTTP Content-Type header of the response message to be set with the appropriate charset attribute, thus allowing the browser to display the content correctly.
JSPs are much trickier. A JSP is transformed into a servlet and then executed to produce content. The steps for processing a JSP are:
- Translate JSP to XML
- Translate XML to Java
- Compile Java to Class
- Execute Class to produce content (HTML, for example)
Prior to processing the JSP, however, the character encoding of the JSP itself must be determined; in other words, the developer needs to find out the character encoding the author specified when the file was created. Prior to the JSP 1.2 specification, the contentType attribute of the <@page...> directive was used to indicate both the encoding of the page and the encoding of the output response. This overloaded use of contentType for both the page encoding and the output response encoding is resolved in the JSP 1.2 specification with the introduction of the new "pageEncoding" attribute. "pageEncoding" allows the developer to indicate the encoding of the JSP file separately from the contentType.
The JSP 1.2 specification, sections 3.1 and 3.2, deal specifically with the character encoding of the JSP file. One algorithm for determining a JSP's character encoding is as follows:
- Assume that the encoding is ISO 8859-1 until the appearance of either the pageEncoding or contentType attributes of the JSP page directive.
- The pageEncoding attribute defines the encoding of the page. If it is not present, the behavior defaults to pre-JSP 1.2: The contentType attribute's "charset" defines the encoding of the page. If neither is present, the encoding is ISO 8859-1.
- The contentType attribute indicates the encoding of the entire response. If it is not present, the response encoding is ISO 8859-1.
The following three examples show JSP character encoding statements.
Example 1: The page will be read as UTF-8 and the response will be UTF-8 encoded.
<%@ page language="java" import="com.myco.*" pageEncoding="utf-8" contentType="text/html;charset=utf-8" %>
Example 2: The page will be read as ISO 8859-1 and the response will be ISO 8859-1 encoded.
<%@ page language="java" import="com.myco.*" contentType="text/html" %>
Example 3: The page will be read as EUC-KR and the response will be EUC-KR encoded.
<%@ page language="java" import="com.myco.*" contentType="text/html;charset=EUC-KR" %>
Page 1 of 4
