Open Office.org Writer, a no-cost, open-source answer to otherwise pricey, commercial word processing applications, stores its files with an “odt” extension. Even users with casual familiarity with Writer may be surprised to know this file is nothing more than a standard “zip” file full of XML files. The implication behind this fact is this: armed with a little knowledge of these internal files, you can programmatically create and edit them.
In this article, I will discuss some of the basic concepts relating to the ODT file itself. I will not discuss how to actually use OpenOffice.org Writer itself—the lessons involved to gain efficient competency with any word processing application would provide ample material to fill a book.
Opening the ODT File
To get started, you of course need an ODT file. Once you have one, unzipping the file gives you (among other things) four XML files:
- content.xml: The actual content of a document.
- meta.xml: Meta-data such as creation date, editor, and statistics (word count, and so forth).
- settings.xml: OpenOffice.org program settings and preferences local to the document itself.
- styles.xml: Formatting styles (for paragraphs, characters, and so on) defined by OpenOffice.org and by the author.
Additional files and directories will turn up (some of them depending on just what is in the document, such as a possible “Pictures” directory). However, for the purposes of this article, I will mainly discuss the content.xml and styles.xml files. The main reason for dismissing the presence of the additional files is this: If you’re reading or editing material in an existing ODT file, their presence does not generally matter anyway, and if you are creating a new document programmatically, the simplest way is to “edit” an empty template file and save it to a new name. This is perhaps the safest strategy to use because it lets you focus entirely on the content of the document you want to process.
The content.xml file, as mentioned above, is the real meat of an ODT file: Your actual document material is stored in this file. Feel free to open the content.xml file in any text or, better yet, XML editor you have available. You should see something similar to this:
<office:document-content ... > <office:automatic-styles ... > <office:body> <office:text> <!-- YOUR STUFF HERE --> </office:text> </office:body> </office:document-content>
Note that I am taking some liberties to omit the elements I won’t discuss, and I am sure you will forgive me for not listing all twenty-two namespace declarations in the root element. Once you have navigated to this point, note that the XML structure of your actual document material is stored in just a handful of “block-level” elements:
<text:h ... >Your Heading Here</text:h> <text:p ... >Some paragraph of stuff here.</text:p> <text:list ... > <text:list-item ... > <text:p ... >First list item text here.</text:p> </text:list-item> </text:list>
The “inline-level” (that is, the character-level) formatting entities occur in span elements:
<text:p ... >Some <text:span ... >fancy text</text:span> here.</text:p>
Aside from the container elements for tables and images, you know enough of the basic document structure to read or edit an OpenOffice.org file. Granted, there are still a few things worth knowing. Notably, in the above snippet examples, the ellipses are omitting the rather important “text:style-name” attribute that defines the formatting information for the material contained in that element.
Before focusing attention on formatting styles, however, I want to mention a couple other pieces of information. As in all other word processors, Writer allows a “line break” that puts the cursor at the beginning of the next line without starting a new paragraph. The element for this is simply:
<text:p ... >Blah blah blah<text:line-break/>yadda yadda.</text:p>
You can get a line break in Writer by pressing SHIFT+ENTER instead of just ENTER.
Also, the “list item” element above shows a single paragraph element. However, multiple paragraph elements are possible here. (Visually, you get a separated paragraph without an additional bullet or number.) You can get these in Writer in the following way: When you type text in a given list item, pressing ENTER takes you to the next list item; however, immediately pressing BACKSPACE once returns you in the previous list item but leaves you in a new paragraph. (And immediately pressing BACKSPACE a second time takes you out of the list mode entirely, returning you to a new standard paragraph.)
Finally, as far as the content.xml file is concerned, bulleted and numbered lists are both just lists. The formatting style of the given list and list items determines whether a bullet or a number is used to represent it in the interface to the author.
One thing that is not obvious to casual users of Writer is this: The most “correct” formatting your content in OpenOffice.org involves the use of pre-defined or user-defined styles in the “Styles and Formatting” tool. From here on, I will simply say “defined style” to mean both pre-defined and user-defined styles. There are style families for page-level entities (such as page and margin sizes), paragraph-level entities (which includes headers and titles), character-level entities (such as emphasis on just one word in a paragraph), and list entities. Note there are toolbar buttons in the Styles and Formatting tool window of OpenOffice.org that correspond to these families. The actual definitions of these defined styles—which font to use, what color the background is, and the like—is stored in the styles.xml file.
Moreover, defined styles behave in an inheritance fashion within their family: A given defined style is related to a parent style. In nearly object-oriented fashion, a change to a top-level defined style formatting entity (such as text color) propagates downward through related descendant styles until those descendant styles override that formatting entity. Direct and controlled use of defined styles can give some nice control and semantic meaning to content that is otherwise totally lost in the hapless application of the usual formatting buttons. Sometimes, it is important to know “why” something is in a red font face, and a defined style preserves and conveys this meaning … not to mention gives you one central location to change all the instances in the document to a blue font face instead.
The name of the defined style in the styles.xml file generally matches the name the author sees in OpenOffice.org’s interface. The biggest caveat to that statement is that special characters, including the space character, are converted to their hex-code value representation, which is then surrounded by underscores. (There is also the notable exception that the style seen as “Default” in the user interface is labeled as “Standard”.) In other words, if an author applied the “Heading 1” defined style to a paragraph of text, the content.xml file would include:
<text:h text:style-name="Heading_20_1" ... >Your Heading Here</text: h>
The “Heading_20_1” style itself, as mentioned above, is detailed in the styles.xml file itself. These details include:
- That it belongs to the “paragraph” family.
- That its parent style is a style simply called “Heading” (which is a valid style to be used in the user interface, but which functions mostly like an abstract base class in programming parlance).
- That, when the author presses ENTER, the style for the next section of text will automatically be set to the Text Body style.
- That its text properties involve being bold and 115% larger than whatever the parent style (Heading, in this case) is set to.
… and so on. In other words, all the values and settings necessary to reconstitute the look, feel, and behavior of that style (and its parent styles all the way back to the Standard/Default style) are present in the styles.xml file.
Here’s the minor rub: Any time a document author drags through a paragraph or series of characters and uses a toolbar button—the Bold button, for example—they are not applying a defined style but are instead implicitly creating an “automatic style.” Conceptually speaking, they store the same information. However, while the pre-defined and user-defined styles are stored in the styles.xml file, the automatic styles are stored in the content.xml file. (Go back to the XML snippet I showed for the content.xml file above and you’ll see where the automatic styles element is located.) Even so, at some point (which is likely immediately), the automatic style references a defined “parent” style which is, as I’ve mentioned, present in the styles.xml file.
As for their style names, automatic styles are created by combining a letter (such as “P” for paragraph style) with some integer. There is nothing particularly “special” about the naming convention … it just seems to be a simple thing OpenOffice.org itself does when it saves information out to a file. You can carefully change the “P1” style to “foo_bar” and, so long as you update all such instances, the file should work. However, OpenOffice.org will likely rename it for you next time. (Beware the assumption that you can programmatically create an arbitrarily named automatic style and have its name survive a few editing sessions.)
What this means is that, if you want to (programmatically) understand the formatting applied to some piece of text in the content.xml file, you’ll probably have to first read through the automatic style and then trace back into the defined styles.
Digging Beyond the Surface
As may be rather evident, the above discussion barely scratches the surface of the ODT file format. However, with the knowledge I have hopefully imparted thus far, you should have little trouble “reverse engineering” the parts you need for yourself: Simply start with a blank document, add (only) the element you would like to understand, such as a table, and then save the document, unzip the ODT file, and open the content.xml file in an editor. Search through the file for a piece of text you inserted and then pick things apart. I assembled a thin folder of printouts relating to how an ODT file implements various aspects of a document; I have found the ODT file to be remarkably accessible.
At my workplace, the content material we manage, which turns into deliverable PDF files for customers, is stored in DITA XML topic files. (DITA stands for Darwin Information Typing Architecture.) This topic-based storage and management has served our in-house tech editors rather well, affording such things as content re-use, single-sourcing, and conditional content filtering. However, the content owners, those technical people closer to the product itself who own and author the raw content material, are not proficient with the DITA schemas—nor should they be.
To address the gap, we implemented an XSLT-driven process that converts the DITA material that was sent into the PDF files into a fresh “content.xml” file. This lone content.xml file is then zipped up into an otherwise blank ODT file and is made available for the content owners as part of the nightly build process. The content owners then are able to easily make edits to their content and, thanks to Writer’s change-tracking feature, the tech editors can locate these changes, finesse the grammar and wording, and incorporate the changes back into the DITA XML files. I will admit right now that this particular XSLT was very complex to write—it involved step-debugging a highly recursive transform—and does not fully support the look and feel of the material as seen in the PDF file. But, for in-house editing purposes, this isn’t strictly relevant: the content owners can review and approve the finalized appearance from the same PDF files that are delivered to customers.
Alternatively, you may have a process that you need to take in the other direction. Another use case involves creating a true “template” in Writer (an “ott” file) that can be given to a working group as a rather fancy fill-in form. The material in the resulting content.xml file can be scanned for (either by direct location, or by some applied style name, or via other mechanisms) and converted out to some other format. Consider the possibility of turning a “requirements” Writer document into a working skeleton for some test cases.
Hopefully, I have shown you just enough of OpenOffice.org’s Writer file format to open up some possibilities for you to use it in new ways. By taking a blank document or document template, you can edit or replace the body section of the embedded content.xml file. By taking an existing document, you can find the content and transform it for other purposes. As the material inside the ODT file is readable text and reasonably well-structured XML, it is wide open to full, external, programmatic assault. Being able to take control of your own documents in this fashion is nothing short of powerful.
About the Author
Rob Lybarger works in a small IT shop in the greater Houston, TX area. Among other duties related to Ant and Java, he has written XSLT mentioned in the Real-World Applications section of this article and also performed various in-house customizations of the stock DITA processing and formatting stylesheets. At home, Rob enjoys spending time with his four month old daughter and being a highly satisfied owner of a Mac computer.