Figure 2: DOM tree example
Specifically, you’ll dissect the tree1.c example program and identify some common programming paradigms used. The purpose of this program is to parse a file to a tree, use xmlDocGetRootElement() to get the root element, and then walk the document and print all the element names in document order. This is about the easiest non-trivial sort of thing you can do in XML. For simplicity’s sake, you’ll assume that the XML file you want to parse is the first argument on the command line and output will go to stdout (console). Program listing follows:
1 #include <stdio.h> 2 #include <libxml/parser.h> 3 #include <libxml/tree.h> 4 5 static void print_element_names(xmlNode * a_node) 6 { 7 xmlNode *cur_node = NULL; 8 9 for (cur_node = a_node; cur_node; cur_node = cur_node->next) { 10 if (cur_node->type == XML_ELEMENT_NODE) { 11 printf("node type: Element, name: %sn", cur_node->name); 12 } 13 print_element_names(cur_node->children); 14 } 15 } 16 17 int main(int argc, char **argv) 18 { 19 xmlDoc *doc = NULL; 20 xmlNode *root_element = NULL; 21 22 if (argc != 2) return(1); 23 24 LIBXML_TEST_VERSION // Macro to check API for match with // the DLL we are using 25 26 /*parse the file and get the DOM */ 27 if (doc = xmlReadFile(argv[1], NULL, 0)) == NULL){ 28 printf("error: could not parse file %sn", argv[1]); 29 exit(-1); 30 } 31 32 /*Get the root element node */ 33 root_element = xmlDocGetRootElement(doc); 34 print_element_names(root_element); 35 xmlFreeDoc(doc); // free document 36 xmlCleanupParser(); // Free globals 37 return 0; 38 }
To run the program, you assume that libxml.dll, iconv.dll, and zlib.dll are all locatable in the path or current directory. To compile your test program, you use the following command line:
cl tree1.c /MD /Id:iconv-1.9.2.win32include /Id:libxml2-2.6.30+.win32include /link /libpath:d:libxml2-2.6.30+.win32lib libxml2.lib
You feed it the test data file as shown below as input:
<!DOCTYPE doc [ <!ELEMENT doc (src | dest)*> <!ELEMENT src EMPTY> <!ELEMENT dest EMPTY> <!ATTLIST src ref IDREF #IMPLIED> <!ATTLIST dest id ID #IMPLIED> ]> <doc> <src ref="foo"/> <dest id="foo"/> <src ref="foo"/> </doc>
Which yields the following as output:
node type: Element, name: doc node type: Element, name: src node type: Element, name: dest node type: Element, name: src
The program starts on Line #17 in the main() function. The LIBXML_TEST_VERSION is a safety check to make sure the libxml.dll you are using is in fact compatible with the version you are compiled for (in other words, the headers you used).
The actual parsing takes place inside xmlReadFile() on Line #27, which returns an xmlDoc object if successful. The first parameter is the local filename or an HTTP document path (URL). The second parameter refers to the encoding, which defaults to NONE. The last parameter is a concatenation of option flags which are one or more of the following:
XML_PARSE_RECOVER | Recover on errors |
XML_PARSE_NOENT | Substitute entities |
XML_PARSE_DTDLOAD | Load the external subset |
XML_PARSE_DTDATTR | Default DTD attributes |
XML_PARSE_DTDVALID | Validate with the DTD |
XML_PARSE_NOERROR | Suppress error reports |
XML_PARSE_NOWARNING | Suppress warning reports |
XML_PARSE_PEDANTIC | Pedantic error reporting |
XML_PARSE_NOBLANKS | Remove blank nodes |
XML_PARSE_SAX1 | Use the SAX1 interface internally |
XML_PARSE_XINCLUDE | Implement xinclude substitution |
XML_PARSE_NONET | Forbid network access |
XML_PARSE_NODICT | Do not reuse the context dictionary |
XML_PARSE_NSCLEAN | Remove redundant namespaces declarations |
XML_PARSE_NOCDATA | Merge CDATA as text nodes |
XML_PARSE_NOXINCNODE | Do not generate XINCLUDE START/END nodes |
XML_PARSE_COMPACT | Compact small text nodes; no modification of the tree allowed afterwards (will possibly crash if you try to modify the tree) |
Most notably, XML_PARSE_COMPACT can make up for some of the memory performance hitS that DOM parsers are known for if you need the XML tree only for read-only purposes. Note also the ability to turn on DTD validation at this point as well.
Next, in Line #33, you call xmlDocGetRootElement(doc) which, as you would expect, gives you the top of the tree that you then can traverse easily using the recursive print_element_names() function. Inside print_element_names(), Lines #5-15, there are only two things to do: Either the node is an XML_ELEMENT_NODE and you print itl otherwise, you call yourself again, this time with the children of the current node. There are actually 21 different node types, so it’s worth seeing the complete list of choices:
XML_ELEMENT_NODE = 1 XML_ATTRIBUTE_NODE = 2 XML_TEXT_NODE = 3 XML_CDATA_SECTION_NODE = 4 XML_ENTITY_REF_NODE = 5 XML_ENTITY_NODE = 6 XML_PI_NODE = 7 XML_COMMENT_NODE = 8 XML_DOCUMENT_NODE = 9 XML_DOCUMENT_TYPE_NODE = 10 |
XML_DOCUMENT_FRAG_NODE = 11 XML_NOTATION_NODE = 12 XML_HTML_DOCUMENT_NODE = 13 XML_DTD_NODE = 14 XML_ELEMENT_DECL = 15 XML_ATTRIBUTE_DECL = 16 XML_ENTITY_DECL = 17 XML_NAMESPACE_DECL = 18 XML_XINCLUDE_START = 19 XML_XINCLUDE_END = 20 XML_DOCB_DOCUMENT_NODE = 21 |
In addition to the “type” value, the xmlNode contains all the critical information about each node in the tree, including navigational pointers (next, prev parent, children), pointers to the namespace, properties list, node name, and of course the content itself (if any).
Conclusion
In an introductory article such as this, you can only hope to scratch the surface of what a versatile tool such as libxml2 can do for you. Libxml2 supports DTD, Schemas, XPath, internationalization support, and lots more that can make your application XML standards-compliant. As mentioned before, libxml2 comes with many language bindings so you can work in C, C++, C#, Python, Perl, or whatever you need to get the job done. Best of all, it’s freely available to integrate into your apps today.
About the Author
Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries; just drop an e-mail to sysop@HAL9K.com.