Introduction
VTD-XML is a suite of innovative XML processing technologies centered around a non-extractive XML parsing technique called Virtual Token Descriptor (VTD). VTD-XML provides interfaces for C, C#, and Java. VTD-XML solves a number of problems inherent with existing DOM and SAX models in a way that makes it ideal for Service Oriented Architecture (SOA) applications.
Depending on the perspective, VTD-XML can be viewed as one of the following:
- A “document-centric” XML parser
- A native XML indexer or a file format that uses binary data to enhance the text XML
- An incremental XML content modifier
- An XML slicer/splitter/assembler
- An XML editor/eraser
- A way to port XML processing onto a chip
VTD-XML is highly memory efficient; benchmarks show a typical overhead of only 1.3 to 1.5 times the size of an XML document (in bytes) to achieve random access. VTD-XML beats SAX parsers in benchmarks by a margin of 1.5 to 2.0 times parsing speed. A report comparing the performance of C, C#, and Java interfaces is available. In this article, you’ll look at the theory behind VTD-XML and some sample apps that show off its best features.
How It Works
A typical DOM parser allocates one unit of memory for each token in the XML input file tree. This is costly in both memory performance (due to fragmentation) and time because of the sheer quantity of allocation requests. VTD-XML simply stores a verbatim copy of the XML in-memory unparsed and then adds an “index” in front of it to allow for simple navigation and access. Because reading an XML file is by definition a read-only process, it makes sense that you need not have the flexibility of variable-allocation at this point in the parsing. Last, keep in mind that VTD-XML is technically a processing model rather than an API and you can build your own API on top of a VTD-XML model. Through the remainder of this article, I’ll demonstrate the XimpleWare implementation available from SourceForge.
Making libvtd-xml.lib for Visual Studio 2005
VTD-XML 2.2.1 for C (as released on 10/27/2007) includes only makefiles designed for Gnu CC (GCC). To build for Visual Studio 2005, I just improvised by compiling all the code in the directory and shoving it into a library, like this:
del *.lib lib /out:libvtd-xml.lib arrayList.obj autoPilot.obj binaryExpr.obj bookMark.obj contextBuffer.obj decoder.obj elementFragmentNs.obj fastIntBuffer.obj fastLongBuffer.obj filterExpr.obj funcExpr.obj helper.obj indexHandler.obj intHash.obj l8.tab.obj lex.yy.obj literalExpr.obj locationPathExpr.obj nodeRecorder.obj numberExpr.obj pathExpr.obj RSSReader.obj textIter.obj unaryExpr.obj unionExpr.obj vtdGen.obj vtdNav.obj XMLChar.obj XMLModifier.obj
Hello, VTD-XML!
In the longstanding Computer Science tradition of showing the simplest possible example first, you’ll start off by looking at the shortest C program you can reasonably write to parse an XML file. Its task: Echo all the nodes in an XML file to a stdout stream.
1 #include "everything.h" 2 struct exception_context the_exception_context[1]; 3 int main(){ 4 exception e; 5 VTDGen *vg = NULL; 6 VTDNav *vn = NULL; 7 UCSChar *string = NULL; 8 Try{ 9 vg = createVTDGen(); 10 if (parseFile(vg,TRUE,"input.xml")){ 11 vn = getNav(vg); 12 if (toElementNS(vn,FIRST_CHILD,L"someURL",L"b")){ 13 int i = getText(vn); 14 if (i!=-1){ 15 string = toString(vn,i); 16 wprintf(L"the text node value is %d ==> %s n", i,string); 17 free(string); 18 } 19 } 20 free(vn->XMLDoc); 21 } else { 22 free(vg->XMLDoc); 23 } 24 }Catch(e){ // handle various types of exceptions here 25 } 26 freeVTDGen(vg); 27 freeVTDNav(vn); 28 return 0; 29 }
The first step is to create yourself a VTD Generator instance via createVTDGen() as you do in Line 9. The VTDGen object parses the DTD but doesn’t resolve declared entities. Next, you use the VTDGen object to parse the XML input file. Your input file for this test is appropriately simple:
<ns1:a xmlns_ns1="someURL"> <ns1:b> hello world! </ns1:b> </ns1:a>
Listing 1: Input.xml for the hello program
You then pass in the input filename and a flag indicating whether the parser should be namespace-aware to parseFile(). For Internet feeds, such as RSS files, you can use parseHttpUrl() in a similar manner, except that you pass in the “http://..” URI.
Next, you initialize a VTDnav navigation cursor object with getNav(); now, you can use the toElementNS() method to begin traversal. The first parameter in toElementNS() signifies the direction of travel that can be an enumerated value ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING, or PREV_SIBLING. The second parameter is an URL, which is irrelevant for this example. The third and final parameter is the namespace of interest, which in the example is namespace “b” (for example, <ns1:b>)
Assuming the navigation worked, you then can call getText() to get the VTD index of the text node and then toString() to pull out the actual node data. In accordance with your input.xml, above, the output when run from the DOS prompt is:
C:ximplewaredemo> hello_world.exe the text node value is 5 ==> hello world!
The whole code block is surrounded by a Try/Catch macro, an approximation of C++ style exception handling, courtesy of Adam M. Costello’s cexcept.
Inserting an Attribute
Of course, to do anything really useful in XML, you need more than the ability to simply read XML. As mentioned in the intro, VTD-XML can splice, dice, insert, and even build templates. Because space is limited here, you’ll just push into one more example of inserting an attribute to an input XML file and dumping out an output XML file. Basically, your goal is to transform input.xml
<a > <b> hello world! </b> </a>
into new.xml
<a > <b attr1='val'> hello world! </b> </a>
Just to make things interesting, you’re going to work with a pre-indexed XML file, which you will denote as .VXL (think “VTD-XML”). The advantage of indexing a file is that you save tons of time by only parsing the input file just once in scenarios that might involve multiple passes through the input.
1 #include "everything.h" 2 struct exception_context the_exception_context[1]; 3 4 int makeIndexedFile(char *inputName, char *outputName){ 5 exception e; 6 VTDGen *vg = NULL; 7 Try{ 8 vg = createVTDGen(); 9 if (parseFile(vg, TRUE, inputName)){ 10 writeIndex2(vg, outputName); 11 } 12 free(vg->XMLDoc); 13 }Catch(e){ 14 } 15 freeVTDGen(vg); 16 } 17 18 int main(){ 19 exception e; 20 VTDGen *vg = NULL; 21 VTDNav *vn = NULL; 22 AutoPilot *ap = NULL; 23 XMLModifier *xm = NULL; 24 FILE *f = NULL; 25 UCSChar *string = NULL; 26 int i; 27 28 makeIndexedFile("input.xml", "output.vxl"); 29 if ((f = fopen("output.vxl","rb")) == NULL) 30 return 0; 31 Try{ 32 xm = createXMLModifier(); 33 ap = createAutoPilot2(); 34 selectXPath(ap,L"/a/b"); 35 vg = createVTDGen(); 36 vn = loadIndex (vg,f); 37 bind(ap,vn); 38 bind4XMLModifier(xm,vn); 39 while((i=evalXPath(ap))!=-1){ 40 insertAttribute(xm,L" attr1='val'"); 41 } 42 output2(xm,"new.xml"); 43 free(vn->XMLDoc); 44 }Catch(e){ // handle various types of exceptions here 45 } 46 fclose(f); 47 freeAutoPilot(ap); 48 freeXMLModifier(xm); 49 freeVTDGen(vg); 50 freeVTDNav(vn); 51 return 0; 52 }
Briefly, you convert input.xml into a VTD indexed XML file in the makeIndexedFile() function up top. Later, in Line 29, you can just open it with a plain old filehandle and begin using it with just a single call to loadIndex() in Line 36. The interesting parts are where you set up an autopilot to traverse the XML input via an XPath created by selectXPath() in Line 34. Finally, you are able to call insertAttribute() each time evalXPath() gives you a hit and then, when you are done, you just call output2() to dump out the newly transformed XML file.
Conclusion
My hope is that you will be sufficiently convinced of its ease of use to try more. I confess that VTD-XML has been one of the easier libraries to build and use of many I have tried lately. I am indebted to Jimmy Zhang for his assistance in providing all the examples of this article.
About the Author
Victor Volkman has been writing for C/C++ Users Journal and other programming journals since the late 1980s. He is a graduate of Michigan Tech and a faculty advisor board member for Washtenaw Community College CIS department. Volkman is the editor of numerous books, including C/C++ Treasure Chest and is the owner of Loving Healing Press. He can help you in your quest for open source tools and libraries, just drop an e-mail to sysop@HAL9K.com.