http://www.developer.com/

Back to article

A Flexible, Compile Time, Configurable XML Parser


October 1, 2004

Who Needs Another Parser?

XML parsing is very standard business; there are now parsers written in every possible language, for every possible platform. Very often, these parsers are less suited for simple, repetitive tasks, when complexity usually takes its toll on performance—and cautious users tend to avoid monolithic, "catch-all" approaches when not required.

For a particular project, I needed a simple and highly configurable library, capable of parsing subsets of XML without paying the cost of extra features. And, as usually happens, I couldn't find an exact match; that's why another parser steps into the limelight.

It's All About State

XML parsing can be reduced to a small collection of states:

STATE REPRESENTATION OBSERVATIONS
START <  
END >  
SPECIAL ! Used for special behavior (comments, CDATA sections, and so forth)
END_FOUND / Used in conjunction wih START & END
EXIT_LOOP   Artificial

A simple finite state machine can be designed to take advantage of this observation, being practically encapsulated by a class like:

namespace XMLParserState
{
   enum STATE {EXIT_LOOP, END_FOUND, END, START, SPECIAL, LAST_STATE};
}
class NativeXMLParser
{
private:
   typedef XMLParserState::STATE (XMLParser::*FUNC_TYPE)() ;
   typedef std::map<int, FUNC_TYPE> States;
   States states_;
   //other private data
public:
   void parseSource()
   {
      XMLParserState::STATE state = XMLParserState::START;
      FUNC_TYPE f =0;
      while (state != XMLParserState::EXIT_LOOP)
      {
         f = states_[state];
         state = f ? (*this.*f)() : XMLParserState::START;
      }
   }
public:
   XMLParserState::STATE findStart();
   XMLParserState::STATE findEnd();
   XMLParserState::STATE endFound();
   XMLParserState::STATE specialFound();
private:
   //other private members
};

The state functions can be expressed like:

findStart
  1. if START is followed by '/', then endFound
  2. if START is followed by '!', then specialFound
  3. else findEnd
findEnd
  1. if END is preceded by '/', then findStart
  2. else specialFound
endFound
  1. process
  2. goto findStart
specialFound
  1. process
  2. goto findStart

This looks pretty straightforward and some readers might be inclined to yawn or express their boredom in different ways at this point. Please be assured that the interesting stuff follows!

SAX or DOM? Static or Dynamic Polymorphism?

As per design, NativeXMLParser is intended to be small, simple, and flexible. This means that:

  1. It has no ambition in supporting the whole XML standard (even if it is easy to extend it to add additional features)
  2. It is highly configurable; for example, supporting CDATA but not Comments, and so on. This point deserves some additional explanation. Suppose we have to parse a XML-based grammar centered on attributes, like this:
    <if i="1">
       <method name="some1"/>
    <else/>
       <method name="some"/>
    </endif>
    

    In this case, there is definitely no interest in more than the tags themselves. But what about:

    <if i="1">
       <method> someOne </method>
    <else/>
       <method> some </method>
    </endif>
    
    This time, the content has to be taken into account—and similar for CDATA and comments.
  3. It loosely supports both events-based and document model (not conformant).

Let's talk about implementing these features for one moment. The usual way of varying properties and behavior is runtime polymorphism. Many well-documented patterns and idioms can be used in such an endeavor [1]. Our perspective is a little bit different: NativeXMLParser may be used for highly repetitive tasks (see the XML grammar example mentioned earlier), where the virtual mechanism doesn't respond well in terms of performance [2]. As a result, NativeXMLParser is built using template-based techniques and static polymorphism.

Implementation

NativeXMLParser takes two template parameters:

template <class MODEL_TYPE, class SPECIAL_CONFIGURATION_TYPE>
class XMLParser;

MODEL_TYPE is the model (SAX, DOM, or something else) and the implementation simply takes advantage of the template Strategy pattern to vary the nature of the parser:

void parseXMLEvt(const std::string& s)
{
   typedef XMLParser<XMLEvtHandler, SpecialDataHandler> XI;
      //use XI
}
void parseXMLDOM(const std::string& s)
{
   typedef XMLParser<XMLDOM, SpecialDataHandler> XI;
      //use XI
}

XMLDOM and XMLEvtHandler are just specializations of an interface with the following signature:

struct I
{
   void setName(const std::string& name)
   {}
   void setContent(const std::string& content)
   {}
   XMLBit* getRoot()
   {}
   void addNewXMLBit()
   {}
   void setParent()
   {}
   void setParentToBit()
   {}
};

SPECIAL_CONFIGURATION_TYPE is by far the most interesting thing. As mentioned before, the functionality of this parser can be easily adjusted and SPECIAL_CONFIGURATION_TYPE does the trick.

The idea is simple: Having defined a set of properties to be adjusted at compile time, it is natural to look for a compile-time, type-safe mechanism to download them. Actually, the mechanism has been known for a long time [3] and what has been used in this implementation is a loop traversing the properties and downloading values, accordingly to the property specialization.

This is the set of properties:

namespace XMLPARSER_SPECIAL
{
   enum COMMANDS{
      CDATA=XML_DICTIONARY::CDATA_START/*'['*/,
      COMMENT=XML_DICTIONARY::COMMENT_BIT/* '-'*/, CONTENT
   };
};

and an index is used to quickly loop through them:

namespace XMLPARSER_SPECIAL_INDEX
{
   enum COMMANDS{
      CDATA, COMMENT, CONTENT, LAST_COMMAND
   };
};

Let's see how this works:

template < template <int Y> class E,  int T  >
struct XMLParserHandler
{
   template<class V>
   static int idx(V& v)
   {
      typedef  E<T> B;
      SpecialData sd;
      SpecialHandler<B::value>::getName(sd.name);
      SpecialHandler<B::value>::getEnd(sd.end);
      sd.offset = SpecialHandler<B::value>::getStartOffset();
      v[T] = sd;
      return XMLParserHandler< E, T+1>::idx(v);
   }
};
template <  template <int Y> class E>
struct XMLParserHandler<E, XMLPARSER_SPECIAL_INDEX::LAST_COMMAND  >
{
   template<class V>
   static int idx(V& v)
   {
      return 0;
   }
};

XMLParserHandler is a loop in disguise that's executed recursively until the exit condition is met (XMLParserHandler specialization returning zero). Interesting things happen during the looping process:

  • T is incremented.
  • There is a E<T> that changes at every iteration.
  • A SpecialHandler depending on E<T> collects information in a very simple structure of type SpecialData.
  • Finally, a container of type V stores a copy of the SpecialData instance.

The last step is to understand who E and SpecialHandler are expected to be:

template < int I>
struct SpecialHandler
{
   static void getName(std::string& name)
   { }
   static void getEnd(std::string& end)
   { }
   static int getStartOffset()
   {
      return 0;
   }
};
template < int Y>
struct Int2Type
{
   enum {
      value= Y
   };
};
template <int Y>
struct EXPER_ALL : public Int2Type<Y>
{};

SpecialHandler is specialized for all COMMANDS (actually XMLPARSER_SPECIAL_INDEX::COMMANDS). EXPER is a simple way to express whether or not a property should be included. For example:

As expected, EXPER_ALL considers all properties (COMMANDS) to be taken into account. But:

template <int Y>
struct EXPER_NO_CDATA: public EXPER_ALL<Y>
{};
template <>
struct EXPER_NO_CDATA<XMLPARSER_SPECIAL_INDEX::CDATA> :
       public EXPER_ALL<XMLPARSER_SPECIAL_INDEX::LAST_COMMAND>
{};

EXPER_NO_CDATA, as the name says, doesn't consider CDATA; that's why CDATA index is switched to LAST_COMMAND (a command that's never actually executed). Based on this extensible mechanism, we now can write code like this:

void parseXMLEvt(const std::string& s)
{
   typedef XMLParser<XMLEvtHandler, SpecialDataHandler> XI;
      XI::parse<XMLParserHandlerInit<EXPER_NO_CDATA> >(s);
}
void parseXMLDOM(const std::string& s)
{
   typedef XMLParser<XMLDOM, SpecialDataHandler> XI;
      XI::parse<XMLParserHandlerInit<EXPER_NO_CDATA> >(s);
}

EXPER family contains the following variations:

EXPER_ALL
EXPER_NO_CDATA
EXPER_CONTENT_ONLY
EXPER_BASIC (no CDATA, no comments, no content)
EXPER_NO_CONTENT
EXPER_NO_COMMENT

The code was tested with both VC++ 7.1 and g++ 3.3 compilers.

Download the Code

Download the code that accompanies this article here.

References

[1] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley 1994.

[2] Dov Bulka and David Mayhew. Efficient C++: Performance Programming Technique, Addison-Wesley 1999.

[3] T. Veldhuizen. "Using C++ template metaprograms," C++ Report Vol. 7 No. 4 (May 1995).

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date