A Brief Introduction to OpenOffice.org Writer Files
Open Office.org Writer, a no-cost, open-source answer to otherwise pricey, commercial word processing applications, stores its files with an "odt" extension. Even users with casual familiarity with Writer may be surprised to know this file is nothing more than a standard "zip" file full of XML files. The implication behind this fact is this: armed with a little knowledge of these internal files, you can programmatically create and edit them.
In this article, I will discuss some of the basic concepts relating to the ODT file itself. I will not discuss how to actually use OpenOffice.org Writer itself—the lessons involved to gain efficient competency with any word processing application would provide ample material to fill a book.
Opening the ODT File
To get started, you of course need an ODT file. Once you have one, unzipping the file gives you (among other things) four XML files:
- content.xml: The actual content of a document.
- meta.xml: Meta-data such as creation date, editor, and statistics (word count, and so forth).
- settings.xml: OpenOffice.org program settings and preferences local to the document itself.
- styles.xml: Formatting styles (for paragraphs, characters, and so on) defined by OpenOffice.org and by the author.
Additional files and directories will turn up (some of them depending on just what is in the document, such as a possible "Pictures" directory). However, for the purposes of this article, I will mainly discuss the content.xml and styles.xml files. The main reason for dismissing the presence of the additional files is this: If you're reading or editing material in an existing ODT file, their presence does not generally matter anyway, and if you are creating a new document programmatically, the simplest way is to "edit" an empty template file and save it to a new name. This is perhaps the safest strategy to use because it lets you focus entirely on the content of the document you want to process.
The content.xml file, as mentioned above, is the real meat of an ODT file: Your actual document material is stored in this file. Feel free to open the content.xml file in any text or, better yet, XML editor you have available. You should see something similar to this:
<office:document-content ... > <office:automatic-styles ... > <office:body> <office:text> <!-- YOUR STUFF HERE --> </office:text> </office:body> </office:document-content>
Note that I am taking some liberties to omit the elements I won't discuss, and I am sure you will forgive me for not listing all twenty-two namespace declarations in the root element. Once you have navigated to this point, note that the XML structure of your actual document material is stored in just a handful of "block-level" elements:
<text:h ... >Your Heading Here</text:h> <text:p ... >Some paragraph of stuff here.</text:p> <text:list ... > <text:list-item ... > <text:p ... >First list item text here.</text:p> </text:list-item> </text:list>
The "inline-level" (that is, the character-level) formatting entities occur in span elements:
<text:p ... >Some <text:span ... >fancy text</text:span> here.</text:p>
Aside from the container elements for tables and images, you know enough of the basic document structure to read or edit an OpenOffice.org file. Granted, there are still a few things worth knowing. Notably, in the above snippet examples, the ellipses are omitting the rather important "text:style-name" attribute that defines the formatting information for the material contained in that element.
Before focusing attention on formatting styles, however, I want to mention a couple other pieces of information. As in all other word processors, Writer allows a "line break" that puts the cursor at the beginning of the next line without starting a new paragraph. The element for this is simply:
<text:p ... >Blah blah blah<text:line-break/>yadda yadda.</text:p>
You can get a line break in Writer by pressing SHIFT+ENTER instead of just ENTER.
Also, the "list item" element above shows a single paragraph element. However, multiple paragraph elements are possible here. (Visually, you get a separated paragraph without an additional bullet or number.) You can get these in Writer in the following way: When you type text in a given list item, pressing ENTER takes you to the next list item; however, immediately pressing BACKSPACE once returns you in the previous list item but leaves you in a new paragraph. (And immediately pressing BACKSPACE a second time takes you out of the list mode entirely, returning you to a new standard paragraph.)
Finally, as far as the content.xml file is concerned, bulleted and numbered lists are both just lists. The formatting style of the given list and list items determines whether a bullet or a number is used to represent it in the interface to the author.
One thing that is not obvious to casual users of Writer is this: The most "correct" formatting your content in OpenOffice.org involves the use of pre-defined or user-defined styles in the "Styles and Formatting" tool. From here on, I will simply say "defined style" to mean both pre-defined and user-defined styles. There are style families for page-level entities (such as page and margin sizes), paragraph-level entities (which includes headers and titles), character-level entities (such as emphasis on just one word in a paragraph), and list entities. Note there are toolbar buttons in the Styles and Formatting tool window of OpenOffice.org that correspond to these families. The actual definitions of these defined styles—which font to use, what color the background is, and the like—is stored in the styles.xml file.
Moreover, defined styles behave in an inheritance fashion within their family: A given defined style is related to a parent style. In nearly object-oriented fashion, a change to a top-level defined style formatting entity (such as text color) propagates downward through related descendant styles until those descendant styles override that formatting entity. Direct and controlled use of defined styles can give some nice control and semantic meaning to content that is otherwise totally lost in the hapless application of the usual formatting buttons. Sometimes, it is important to know "why" something is in a red font face, and a defined style preserves and conveys this meaning ... not to mention gives you one central location to change all the instances in the document to a blue font face instead.
The name of the defined style in the styles.xml file generally matches the name the author sees in OpenOffice.org's interface. The biggest caveat to that statement is that special characters, including the space character, are converted to their hex-code value representation, which is then surrounded by underscores. (There is also the notable exception that the style seen as "Default" in the user interface is labeled as "Standard".) In other words, if an author applied the "Heading 1" defined style to a paragraph of text, the content.xml file would include:
<text:h text:style-name="Heading_20_1" ... >Your Heading Here</text: h>
The "Heading_20_1" style itself, as mentioned above, is detailed in the styles.xml file itself. These details include:
- That it belongs to the "paragraph" family.
- That its parent style is a style simply called "Heading" (which is a valid style to be used in the user interface, but which functions mostly like an abstract base class in programming parlance).
- That, when the author presses ENTER, the style for the next section of text will automatically be set to the Text Body style.
- That its text properties involve being bold and 115% larger than whatever the parent style (Heading, in this case) is set to.
... and so on. In other words, all the values and settings necessary to reconstitute the look, feel, and behavior of that style (and its parent styles all the way back to the Standard/Default style) are present in the styles.xml file.
Here's the minor rub: Any time a document author drags through a paragraph or series of characters and uses a toolbar button—the Bold button, for example—they are not applying a defined style but are instead implicitly creating an "automatic style." Conceptually speaking, they store the same information. However, while the pre-defined and user-defined styles are stored in the styles.xml file, the automatic styles are stored in the content.xml file. (Go back to the XML snippet I showed for the content.xml file above and you'll see where the automatic styles element is located.) Even so, at some point (which is likely immediately), the automatic style references a defined "parent" style which is, as I've mentioned, present in the styles.xml file.
As for their style names, automatic styles are created by combining a letter (such as "P" for paragraph style) with some integer. There is nothing particularly "special" about the naming convention ... it just seems to be a simple thing OpenOffice.org itself does when it saves information out to a file. You can carefully change the "P1" style to "foo_bar" and, so long as you update all such instances, the file should work. However, OpenOffice.org will likely rename it for you next time. (Beware the assumption that you can programmatically create an arbitrarily named automatic style and have its name survive a few editing sessions.)
What this means is that, if you want to (programmatically) understand the formatting applied to some piece of text in the content.xml file, you'll probably have to first read through the automatic style and then trace back into the defined styles.