developer.com
Search EarthWeb
CodeGuru | Gamelan | Jars | Wireless | Discussions
Navigate developer.com
Architecture & Design  
Database  
Java
Languages & Tools
Microsoft & .NET
Open Source  
Project Management  
Security  
Techniques  
Voice  
Web Services  
Wireless/Mobile
XML  
New
 
Technology Jobs  

   Developer.com Webcasts:
  The Impact of Coding Standards and Code Reviews

  Project Management for the Developer

  Defining Your Own Software Development Methodology

  more Webcasts...




Nominate the Best Products or Technologies for Developer.com Product of the Year!




Developer Jobs

Be a Commerce Partner














 


Related Article -
Taking on Custom Ant Logging
More on Custom Ant Tasks
Introduction to Custom Ant Tasks
Developer News -
Microsoft Shows Some Ankle With Visual Studio    September 29, 2008
Gentoo Linux Cancels Distribution    September 26, 2008
It's Official: Windows 7 at PDC, WinHEC    September 25, 2008
Oracle Keeps Building on Spoils From BEA    September 24, 2008
Free Tech Newsletter -

A Brief Introduction to OpenOffice.org Writer Files
By Rob Lybarger

Go to page: 1  2  Next  

Open Office.org Writer, a no-cost, open-source answer to otherwise pricey, commercial word processing applications, stores its files with an "odt" extension. Even users with casual familiarity with Writer may be surprised to know this file is nothing more than a standard "zip" file full of XML files. The implication behind this fact is this: armed with a little knowledge of these internal files, you can programmatically create and edit them.

In this article, I will discuss some of the basic concepts relating to the ODT file itself. I will not discuss how to actually use OpenOffice.org Writer itself—the lessons involved to gain efficient competency with any word processing application would provide ample material to fill a book.

Opening the ODT File

To get started, you of course need an ODT file. Once you have one, unzipping the file gives you (among other things) four XML files:

  • content.xml: The actual content of a document.
  • meta.xml: Meta-data such as creation date, editor, and statistics (word count, and so forth).
  • settings.xml: OpenOffice.org program settings and preferences local to the document itself.
  • styles.xml: Formatting styles (for paragraphs, characters, and so on) defined by OpenOffice.org and by the author.

Additional files and directories will turn up (some of them depending on just what is in the document, such as a possible "Pictures" directory). However, for the purposes of this article, I will mainly discuss the content.xml and styles.xml files. The main reason for dismissing the presence of the additional files is this: If you're reading or editing material in an existing ODT file, their presence does not generally matter anyway, and if you are creating a new document programmatically, the simplest way is to "edit" an empty template file and save it to a new name. This is perhaps the safest strategy to use because it lets you focus entirely on the content of the document you want to process.

Content

The content.xml file, as mentioned above, is the real meat of an ODT file: Your actual document material is stored in this file. Feel free to open the content.xml file in any text or, better yet, XML editor you have available. You should see something similar to this:

<office:document-content ... >
   <office:automatic-styles ... >
      <office:body>
         <office:text>
            <!-- YOUR STUFF HERE -->
         </office:text>
      </office:body>
</office:document-content>

Note that I am taking some liberties to omit the elements I won't discuss, and I am sure you will forgive me for not listing all twenty-two namespace declarations in the root element. Once you have navigated to this point, note that the XML structure of your actual document material is stored in just a handful of "block-level" elements:

<text:h ... >Your Heading Here</text:h>
   <text:p ... >Some paragraph of stuff here.</text:p>
      <text:list ... >
         <text:list-item ... >
            <text:p ... >First list item text here.</text:p>
         </text:list-item>
      </text:list>

The "inline-level" (that is, the character-level) formatting entities occur in span elements:

<text:p ... >Some <text:span ... >fancy text</text:span> here.</text:p>

Aside from the container elements for tables and images, you know enough of the basic document structure to read or edit an OpenOffice.org file. Granted, there are still a few things worth knowing. Notably, in the above snippet examples, the ellipses are omitting the rather important "text:style-name" attribute that defines the formatting information for the material contained in that element.

Before focusing attention on formatting styles, however, I want to mention a couple other pieces of information. As in all other word processors, Writer allows a "line break" that puts the cursor at the beginning of the next line without starting a new paragraph. The element for this is simply:

<text:p ... >Blah blah blah<text:line-break/>yadda yadda.</text:p>

You can get a line break in Writer by pressing SHIFT+ENTER instead of just ENTER.

Also, the "list item" element above shows a single paragraph element. However, multiple paragraph elements are possible here. (Visually, you get a separated paragraph without an additional bullet or number.) You can get these in Writer in the following way: When you type text in a given list item, pressing ENTER takes you to the next list item; however, immediately pressing BACKSPACE once returns you in the previous list item but leaves you in a new paragraph. (And immediately pressing BACKSPACE a second time takes you out of the list mode entirely, returning you to a new standard paragraph.)

Finally, as far as the content.xml file is concerned, bulleted and numbered lists are both just lists. The formatting style of the given list and list items determines whether a bullet or a number is used to represent it in the interface to the author.

Formatting Styles

Defined styles

One thing that is not obvious to casual users of Writer is this: The most "correct" formatting your content in OpenOffice.org involves the use of pre-defined or user-defined styles in the "Styles and Formatting" tool. From here on, I will simply say "defined style" to mean both pre-defined and user-defined styles. There are style families for page-level entities (such as page and margin sizes), paragraph-level entities (which includes headers and titles), character-level entities (such as emphasis on just one word in a paragraph), and list entities. Note there are toolbar buttons in the Styles and Formatting tool window of OpenOffice.org that correspond to these families. The actual definitions of these defined styles—which font to use, what color the background is, and the like—is stored in the styles.xml file.

Moreover, defined styles behave in an inheritance fashion within their family: A given defined style is related to a parent style. In nearly object-oriented fashion, a change to a top-level defined style formatting entity (such as text color) propagates downward through related descendant styles until those descendant styles override that formatting entity. Direct and controlled use of defined styles can give some nice control and semantic meaning to content that is otherwise totally lost in the hapless application of the usual formatting buttons. Sometimes, it is important to know "why" something is in a red font face, and a defined style preserves and conveys this meaning ... not to mention gives you one central location to change all the instances in the document to a blue font face instead.

The name of the defined style in the styles.xml file generally matches the name the author sees in OpenOffice.org's interface. The biggest caveat to that statement is that special characters, including the space character, are converted to their hex-code value representation, which is then surrounded by underscores. (There is also the notable exception that the style seen as "Default" in the user interface is labeled as "Standard".) In other words, if an author applied the "Heading 1" defined style to a paragraph of text, the content.xml file would include:

<text:h text:style-name="Heading_20_1" ... >Your Heading Here</text: h>

The "Heading_20_1" style itself, as mentioned above, is detailed in the styles.xml file itself. These details include:

  • That it belongs to the "paragraph" family.
  • That its parent style is a style simply called "Heading" (which is a valid style to be used in the user interface, but which functions mostly like an abstract base class in programming parlance).
  • That, when the author presses ENTER, the style for the next section of text will automatically be set to the Text Body style.
  • That its text properties involve being bold and 115% larger than whatever the parent style (Heading, in this case) is set to.

... and so on. In other words, all the values and settings necessary to reconstitute the look, feel, and behavior of that style (and its parent styles all the way back to the Standard/Default style) are present in the styles.xml file.

Automatic styles

Here's the minor rub: Any time a document author drags through a paragraph or series of characters and uses a toolbar button—the Bold button, for example—they are not applying a defined style but are instead implicitly creating an "automatic style." Conceptually speaking, they store the same information. However, while the pre-defined and user-defined styles are stored in the styles.xml file, the automatic styles are stored in the content.xml file. (Go back to the XML snippet I showed for the content.xml file above and you'll see where the automatic styles element is located.) Even so, at some point (which is likely immediately), the automatic style references a defined "parent" style which is, as I've mentioned, present in the styles.xml file.

As for their style names, automatic styles are created by combining a letter (such as "P" for paragraph style) with some integer. There is nothing particularly "special" about the naming convention ... it just seems to be a simple thing OpenOffice.org itself does when it saves information out to a file. You can carefully change the "P1" style to "foo_bar" and, so long as you update all such instances, the file should work. However, OpenOffice.org will likely rename it for you next time. (Beware the assumption that you can programmatically create an arbitrarily named automatic style and have its name survive a few editing sessions.)

What this means is that, if you want to (programmatically) understand the formatting applied to some piece of text in the content.xml file, you'll probably have to first read through the automatic style and then trace back into the defined styles.

Go to page: 1  2  Next  


Tools:
Add www.developer.com to your favorites
Add www.developer.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed


XML Archives








JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
IBM Whitepaper: Innovative Collaboration to Advance Your Business
Internet.com eBook: Real Life Rails
Avaya Article: Call Control XML - Powerful, Standards-Based Call Control
Tripwire Whitepaper: Seven Practical Steps to Mitigate Virtualization Security Risks
Internet.com eBook: The Pros and Cons of Outsourcing
Go Parallel Article: Scalable Parallelism with Intel(R) Threading Building Blocks
Internet.com eBook: Best Practices for Developing a Web Site
IBM CXO Whitepaper: The 2008 Global CEO Study "The Enterprise of the Future"
Avaya Article: Call Control XML in Action - A CCXML Auto Attendant
Go Parallel Article: James Reinders on the Intel Parallel Studio Beta Program
IBM CXO Whitepaper: Unlocking the DNA of the Adaptable Workforce--The Global Human Capital Study 2008
Adobe Acrobat Connect Pro: Web Conferencing and eLearning Whitepapers
Go Parallel Article: Getting Started with TBB on Windows
HP eBook: Storage Networking , Part 1
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Go Parallel Video: Intel(R) Threading Building Blocks: A New Method for Threading in C++
HP Video: Is Your Data Center Ready for a Real World Disaster?
Microsoft Partner Portal Video: Microsoft Gold Certified Partners Build Successful Practices
HP On Demand Webcast: Virtualization in Action
Go Parallel Video: Performance and Threading Tools for Game Developers
Rackspace Hosting Center: Customer Videos
Intel vPro Developer Virtual Bootcamp
HP Disaster-Proof Solutions eSeminar
HP On Demand Webcast: Discover the Benefits of Virtualization
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Microsoft Download: Silverlight 2 Software Development Kit Beta 2
30-Day Trial: SPAMfighter Exchange Module
Red Gate Download: SQL Toolbelt
Iron Speed Designer Application Generator
Microsoft Download: Silverlight 2 Beta 2 Runtime
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
IBM IT Innovation Article: Green Servers Provide a Competitive Advantage
Microsoft Article: Expression Web 2 for PHP Developers--Simplify Your PHP Applications
Featured Algorithm: Intel Threading Building Blocks - parallel_reduce
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES