GuidesPDF and Java

PDF and Java

"Java and PDF provide a nice solution for many types of applications. "

HTML continues to be the leading format for creating Web content for various reasons including its relative simplicity. For many types of content, HTML offers a sufficient set of tags for an effective presentation. There are, however, document types, that are too rich for HTML. Documents where positioning of various text and non-text elements is important are usually not good candidates for HTML. For example, it would be rather difficult to create a typical IRS Tax form via HTML. The Portable Document Format (PDF) is often used for creating and displaying rich content. The Acrobat Reader plug-in software from Adobe, allows browsers to effectively display PDF files.

Java servlets are an effective mechanism for creating Web applications. Such applications often require manipulation of HTML documents before serving them to the browser. Such manipulations are quite common for servlets, CGI and other server-side technologies and often require data extraction using HTML tags as delimiters. Text-processing algorithms and utility programs (e.g., AWK, scripting languages, and regular expressions) can be used to complement the capabilities of servlets and CGI programs. But what about PDF? This article is an overview of using Java to interact with PDF files.

"Despite additions and advancements of HTML, PDF continues to be the most popular mean for sharing rich documents. "

What is PDF?

PDF uses objects to describe a page. Everything you see (and some things that you don’t see) in a PDF page is an object. The objects making up a document are expressed in a sequential manner. At the end, there is a cross-reference table that lists the byte offset of each object within the file. The trailing piece of a PDF document also indicates which object is the “root” object. The trailer also contains a byte offset, which points to the beginning of the cross-reference table. The structure, once mapped out, is somewhat similar to an XML document with a “containment” hierarchy; that is, the document is composed of “page” objects, the page objects are composed of other objects like fonts, streams of text, etc.

If you have not done so, use a text editor to take a look at a PDF file (for simplicity, try a document that contains no images). You’ll see that the instructions are expressed in plain text. The PDF language specification describes the syntax of all the instructions and can be found along with other documents from the Adobe site. The specification is a fairly large document, which is testimony to the relative complexity of PDF.

PDF documents typically use a compression algorithm (such as LZW) to reduce the size of text and binary streams in the document. That’s why you will most likely see unreadable characters instead of the text contained in the document. One way to extract information from a PDF file is by simply reading the “text-based” instructions and extracting the appropriate data. This requires a fair amount of understanding of the PDF language specification and given the format of PDF, I doubt that was the intended mechanism for manipulating PDF files.

Adobe provides a variety of tools for creating and reading PDF documents. It also provides a Development Kit with an API to programmatically interact with PDF documents. I looked and searched the Adobe site hoping to find a Java API, but could not find any mention of it. They do provide a Java API for Form Documentation Format (FDF) but not for PDF. I suppose you could use JNI to use the C++ API from within a Java program, but that would certainly be too complex and cumbersome.

A Java-based PDF API

I discovered a Java library for PDF from Etymon Consulting. Although it does not cover the full specification, it does provide a convenient approach for reading, changing and writing PDF files from within Java programs. As with any Java library, the API is organized into packages. The main package is

com.etymon.pj.object
. Here, you’ll find an object representation of all PDF core objects, which are arrays, boolean, dictionary, name, null, number, reference, stream, and string. Where the Java language provides an equivalent object, it is used but with a wrapper around it for consistency purposes. So, for example, the string object is represented by PjString.

When you read a PDF file, the Java equivalents of the PDF objects are created. You can then manipulate the objects using their methods and write the result back to the PDF file. You do need knowledge of PDF language to effectively do some of the manipulations. The following lines, for example, create a Font object:

PjFontType1 font = new PjFontType1(); font.setBaseFont(new PjName(“Helvetica-Bold”)); font.setEncoding(new PjName(“PDFDocEncoding”)); int fontId = pdf.registerObject(font);

where

pdf
is the object pointer to a PDF file.

One thing, I wanted to do was to change parts of the text in the PDF file to create “customized” PDF. While I have access to the PjStream object, the bytearray containing the text is compressed and the current library does not support decompression of LZW. It does support decompression of Flate algorithm.

Despite some limitations, you can still do many useful things. If you need to append a number of PDF documents programmatically, you can create a page and then append the page to the existing PDF documents, all from Java. The API also provide you with information about the document like number of pages, author, keyword, and title. This would allow for a Java servlet to dynamically create a page containing the document information with a link to the actual PDF files. As new PDF files are added and old ones deleted, the servlet would update the page to reflect the latest collection.

Listing 1 shows a simple program that uses the pj library to extract information from a PDF file and print that information to the console.

Listing 1.

import com.etymon.pj.*;import com.etymon.pj.object.*;public class GetPDFInfo { public static void main (String args[]) { try {         Pdf pdf = new Pdf(args[0]);           System.out.println(“# of pages is ” + pdf.getPageCount());     int y = pdf.getMaxObjectNumber();     for (int x=1; x <= y; x++) {  PjObject obj = pdf.getObject(x);      if (obj instanceof PjInfo) {    System.out.println("Author: " + ((PjInfo)                                                        obj).getAuthor());    System.out.println("Creator: " + ((PjInfo)                                                        obj).getCreator());    System.out.println("Subject: " + ((PjInfo)                                                        obj).getSubject());    System.out.println("Keywords: " + ((PjInfo)                                                         obj).getKeywords());      }     } } catch (java.io.IOException ex) {     System.out.println(ex); } catch (com.etymon.pj.exception.PjException  ex) {     System.out.println(ex); }  }}

Before you compile the above program, you need to download the pj library, which includes the pj.jar file. Make sure your CLASSPATH includes the pj.jar file.

The program reads the PDF file specified at the command-line and parses it using the following line:

           Pdf pdf = new Pdf(args[0]);

It then goes through all the objects that were created as a result of parsing the PDF file and searches for a

PjInfo
object. That object encapsulates information such as the author, subject, and keywords, which are extracted using the appropriate methods. You can also “set” those values, which saves them permanently in the PDF file.

There are a number of sample programs that ship with the pj library, along with the standard javadoc-style documentation. The library is distributed under GNU General Public License.

Conclusion

Despite additions and advancements of HTML, PDF continues to be the most popular mean for sharing rich documents. As a programming language, Java needs to be able to interact with data. The pj library shown here, is a preview of how PDF objects can be modeled in Java and then use Java’s familiar constructs to manipulate the seemingly complex PDF documents. With this type of interaction, applications that need to serve rich documents can actually “personalize” the content before sending out the document. This scenario can be applied, for example, to many legal forms where a hand signature is still required and the form is too complex to be drawn entirely in HTML. Java and PDF provide a nice solution for these types of applications.

About the Author

Piroz Mohseni is president of Bita Technologies focusing on business improvement through effective usage of technology. His area of interest include enterprise Java, XML, and e-commerce applications.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories