Internationalization Requirements for Servlet Containers, Page 2
Processing a JSP
Step 1 -- Translate the JSP to XML
The JSP is initially read in via a stream using the Java platform's default character encoding. When the JSP page directive is encountered, determine the page character encoding according to the algorithm stated above. If a character encoding other than ISO 8859-1 has been specified, then re-read the JSP page using that character encoding.
- FileReader fr = new FileReader( "foo.jsp" );
- FileReader fr = new FileReader( "foo.jsp" );
- String encoding = "ISO-8859-1";
- InputStreamReader ir = new InputStreamReader( fr, encoding );
Included JSPs present an interesting problem, especially if there is more than one occurrence of the pageEncoding attribute due to multiple occurrences of the page directive from included JSPs. If pageEncoding were a per-page attribute, not a per-translation unit attribute, then you could conceivably have included pages in differing encodings. However, according to the JSP 1.2 specification, section 2.10.1:
"The page directive defines a number of page-dependent properties and communicates these to the JSP container. A translation unit (JSP source file and any files included via the include directive) can contain more than one instance of the page directive, all the attributes will apply to the complete translation unit (i.e., page directives are position independent). However, there shall be only one occurrence of any attribute/value defined by this directive in a given translation unit with the exception of the "import" attribute; multiple uses of this attribute are cumulative (with ordered set union semantics). Other such multiple attribute/value (re)definitions result in a fatal translation error."
Thus, the JSP 1.2 specification does not permit multiple pageEncodings in a translation unit. An exception must be thrown if any attributes other than the "import" attribute are specified in a JSP that is included by another JSP.
Once the XML is generated from the JSP, you need to decide what character encoding should be used to save the generated XML to the file system. In processing the JSP file from its native encoding to an internal Java String representation, a conversion to Unicode was performed implicitly. Therefore, using UTF-8 may be the best choice (<?XML encoding="UTF-8"?>). This will alleviate the step of converting to the native encoding. Since most trans-coding operations involve table lookups, they will generally be slower than conversion to UTF-8, which is algorithmic.
Step 2 -- Translate the XML to Java
When the Java source file is produced from the intermediate XML, the issue of the encoding of the output file comes into play again. With Java source files, however, the situation is complicated by the varying capabilities of Java compilers, as well as the "default" encoding of the platform on which the compiler is executed. Not all Java compilers can handle Java source files that have been created using encodings other than ISO 8859-1 or the native encoding of the platform (e.g. EBCDIC).
To mitigate the risk that a less capable compiler may be used, all characters in the Java source file that fall outside the range of readable US-ASCII characters (less than 0x0020 and greater than 0x007E) can be converted to their Java "Unicode escape sequence" equivalent (i.e. "\uXXXX"). This applies to static content (HTML) as well as Java code, so that string literals, character literals, and Java code itself that contains non-US-ASCII characters are converted properly.
Special characters such as quotes (single and double), tabs, and new lines must be considered when generating the Java code. When we process static content and create Java code for it, we are placing the content in a Java "out.print()" statement. When we process scriptlet code, we are creating (echoing) Java statements. The quotes and new lines must be handled differently in each case. If we are handling a Java scriptlet, we must echo the character; but if we are handling static content, we must "backslash" escape the character. Here is an example.
Start with the following JSP:
<%@page %><HTML> <BODY>"Hello"<% String a = "\"hello\""; out.println( a );%></BODY></HTML>
When we convert this to XML, we get something like:
<?xml version="1.0"?><jsp:root> <jsp:directive.page ... /> <jsp:cdata><![CDATA[<HTML>\r\n <BODY>"Hello"\r\n]]></jsp:cdata> <jsp:scriptlet>String a="\"hello\"";\r\n out.println( a );\r\n</jsp:scriptlet> <jsp:cdata><![CDATA[\r\n</BODY>\r\n</HTML>]]></jsp:cdata></jsp:root>
When we produce the Java code, we should get something like:
out.write( "<HTML>\r\n\"Hello\"<BODY>\r\n" );String a="\"hello\"";out.println( a );out.print( "\r\n</BODY>\r\n</HTML>" );
Note that even though it looks like we escaped the quotes in the statement that assigned "hello" to the variable a (String a="\"hello\"";), we actually only echoed out the original scriptlet. Also note that we backslash escaped the quotes around the "Hello" in the HTML content in the out.print statement.
This is similar in concept to double indirection in C and C++. The HTML content is translated into Java code that produces HTML content as its output, whereas the Java scriptlet code is only one level of indirection as it translates directly to Java code. The set of characters that must be backslash escaped contains:
- ' (single quote)
- " (double quote)
- \ (back slash)
Step 3 -- Compile Java to Class
As we mentioned previously, there are two ways to compile the generated Java code to a class file. One way is to use a compiler that supports a range of encodings for Java source files and specify the encoding in which the source file was generated. Because not all compilers support this feature, a better way is to use the platform's default encoding and have all the strings in the source rendered as Java Unicode escape sequences. This is the more portable way, as it should work on all platforms regardless of the character encoding that was used in the original JSP.
Step 4 -- Execute Class to Produce HTML
If the translation and compilation processes described in the previous sections are implemented correctly, the HTML output from the executing servlet will be encoded properly. The HTTP Content-Type header will be set to the encoding that was specified in the "contentType" attribute of the page directive (recall that the "pageEncoding" attribute has no affect on the HTTP Content-Type header). Of course, the HTTP Content-Type header can be over-ridden by using standard servlet APIs.
Receiving Data from the Client
HTTP POST Requests
One of the "dark corners" of JSP/servlet programming is the handling of form data from HTTP POST requests. Many Web-based applications are form-driven and use POST operations to submit the form data. When the user submits the form data, the browser packages the form data for a POST message and sends it to the server. Form data values are URL (%HH) encoded ("application/x-www-form-urlencoded") for transport, but typically no information (i.e., no "charset" specification in the HTTP Content-Type header) is sent describing the "underlying" encoding of the form data as entered by the user. Most browsers take a hint from the server and will use the same encoding for form POST data as the encoding in which the form page was delivered.