Internationalization Requirements for Servlet Containers, Page 3
Without any specification of the character encoding used for the form data, by the time the application server has processed the request and the data is accessed via the HttpServletRequest.getParameter() method, it has already been converted to String values from the underlying byte stream with an assumed ISO 8859-1 encoding -- which may very likely be the wrong encoding. Prior to the Servlet 2.3 specification, the only recourse the developer had was to process the raw POST request directly. Attempts to convert the data from ISO 8859-1 to another encoding may not have worked.
The Servlet 2.3 specification provides an API specifically addressing this shortcoming in browsers. The ServletRequest.setCharacterEncoding() method must be called by the servlet/JSP developer prior to reading request parameters or reading input using the getReader() method (i.e., prior to any calling any APIs which process the HTTP message) in order for the data to be read in the proper character encoding.
Existing Web-based applications (applications that were written before the Servlet 2.3 specification) present a challenge to servlet container implementers. In order to accommodate existing web applications that do not make use of the setCharacterEncoding() method, servlet containers should provide the ability to specify a default character encoding for incoming POST data. Robust J2EE application servers such as the HP Bluestone Total-E-Server and the HP Internet Server/HP Application Server provide configurable options to address this issue on a per-application basis.
UTF-8 encoded URLs (HTTP GET Requests)
Over time, the use of URLs has evolved from a simple scheme to describe a path to a resource to a generic mechanism for passing request data to a service via the HTTP protocol. In order to increase the usefulness of the URL mechanism, some Web-server vendors now provide support for UTF-8 encoded URLs. UTF-8 URLs are supported by Microsoft IIS and Apache.
As an example of the use of UTF-8 encoded URLs, consider the following URL:
Here a UTF-8 encoded sequence of characters is being sent to an application using the %HH "URL encoding" format. Upon decoding the URL, the Web server would recognize the characters as outside of the US-ASCII range and treat them as UTF-8 encoded Unicode characters. In this case, there are two characters, each encoded with three bytes.
One problem with earlier versions of the Unicode UTF-8 specification was that it did not explicitly disallow multiple encodings for the same Unicode character. Because UTF-8 is a multi-byte encoding, a given Unicode character like "\" (U+005C REVERSE SOLIDUS BACKSLASH) could be encoded several ways -- using the correct single byte encoding, a double byte encoding or a triple byte encoding:
The double byte and triple byte encodings decode as additional leading zero-value characters, which are ignored.
The ability to encode Unicode characters several ways with UTF-8 proved to be a security vulnerability. Normally, Web servers check for "malicious" URLs, which, for example, might be requests for documents outside of the server's document root, such as in the following URL:
This type of URL may pass the Web server's Intrusion Detection System (IDS) checks but upon UTF-8 decoding, would allow access to "..\../path/somefile.exe" -- which is outside of the document root.
Web server and commercial IDS vendors have had to bolster their IDSs against these types of attacks. The Unicode Consortium also contributed to the effort by tightening the UTF-8 specification to clearly define illegal UTF-8 sequences.
RFC 2718 (Guidelines for new URL Schemes,) suggests the following in Section 2.2.5:
"When describing URL schemes in which (some of) the elements of the URL are actually representations of sequences of characters, care should be taken not to introduce unnecessary variety in the ways in which characters are encoded into octets and then into URL characters. Unless there is some compelling reason for a particular scheme to do otherwise, translating character sequences into UTF-8 (RFC 2279)  and then subsequently using the %HH encoding for unsafe octets is recommended."
Because UTF-8 is the recommended encoding scheme for URLs, this feature (along with the requisite IDS implementation) will have to be provided by all commercial Web servers.
The ServletResponse.setLocale() method is unclear on how the "charset" attribute of the HTTP Content-Type header is supposed to be set. According to the Servlet 2.3 specification, when the setLocale() method is called, the Content-Type's charset must be updated "appropriately." However, the specification does not recommend or mandate a mapping of particular character encoding schemes for Locales, and the Locale class does not provide methods for getting and setting character encodings.
The servlet container should provide a mechanism that allows the application server system administrator to define a Locale-to-character encoding mapping. A properties file or an XML file can be used to define the country code and language code that map to a particular character encoding. The country code should be an uppercase ISO 3166 2-letter code and the language code should be a lowercase ISO 639 code. The character encoding should be an IANA encoding name. When the servlet executes the setLocale() method, the servlet container will consult the mapping and set the HTTP Content-Type header appropriately.
This article has presented the internationalization issues faced by J2EE Web application developers and servlet container implementers. JSP/servlet container implementations must be able to handle arbitrary character encodings in both the processing of JSP files and the creation of an appropriate output response stream, as well as in the handling of form data and encoded URLs. Robust, full-featured JSP/servlet containers, such as those provided in the HP Application Server (HP-AS), provide these internationalization features to enable the development of internationalized Web applications. Web application developers can create and save JSP files in their native encodings and multi-lingual applications can be developed for international electronic commerce sites that tailor their content to the locale of the client.
Developers dealing with internationalization challenges will find additional resources from Web sites or developer portals that some vendors support. For example, the developer resource guide at the HP developers site is one such portal, providing various reports and subject-specific presentations by HP developers that can facilitate development projects. Also, HP Middleware contains more application-server information.
- The list of Java's supported encodings and canonical names: Java's encoding site
- A description of how Java handles encodings and their aliases: Java's packaging site
- The Unicode Consortium Web Site, here.
- The UTF-8 Corrigendum to Unicode 3.0.1, at this location.
- Hacker, Eric, "IDS Evasion with Unicode," which can be found here.
- The following specifications contain specific requirements concerning character encoding:
- RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1. Sections 3.4.1, 3.7.1, 14.2, 14.17
- RFC 822 section 3.1.2
- JSP 1.2 sections 3.1, 3.2, 3.3
- Servlet 2.3 sections 4.9, 5.4 and the following API's:
About the Authors
Scott W. Ruch is an Architect in the HP OpenView Research and Development organization. During his tenure as I18n Architect for the HP Middleware Division, Scott's work included the development of I18n trail maps and best practices for HP software engineers. Scott has over ten years of software design and development experience in operating systems and telecommunications network management systems. Scott holds a Bachelor of Science in Electrical Engineering from Drexel University and a Master of Science in Software Engineering from Monmouth University.
JJ Snyder is a Senior Architect in the HP Middleware Division. He has been programming in C/C++ and Java for over 15 years. He was the Chief Architect on Bluestone's award winning Total-e-Mobile 1.0 and HP Application Server concentrating on the Servlet 2.3 container. Currently he is a working as a Senior Architect in HP Middleware's Rich Media group. JJ holds a Bachelor of Science in Computer Science from the Pennsylvania State University.
In addition to Project.net's Roger Bly, the authors would also like to thank the following HP Middleware engineers for their contributions: Jessica Sant, Jason Kinner, Erik Bergenholtz, and Ryan Moquin.
See page 4 of this article for information on Project.net Builds Winning Internation Solution.
Page 3 of 4