LanguagesXMLProcessing XML With Perl Regular Expressions

Processing XML With Perl Regular Expressions

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.


XML is the hot item in the Internet world. There is movement in almost every language arena to incorporate XML into the language in some form. The reality is that you don’t have to wait for native XML support to be built into your favorite language to effectively use XML. This article describes how to use regular expressions to process XML. The code examples are all in Perl, but can be applied to any language that has regular expression support built in.

There are several ways to process XML using Perl. These include:

  • Processing the output of a parser, such as NSGMLS
  • Using the Perl XML::Parser module
  • Using regular expression searching to extract tags and data content

This article addresses using regular expression searching to process XML data.

Background

XML stands for eXtensible Markup Language. It is a meta language to define markup languages. HTML is a markup language. There are some general XML rules (discussed below) and there are many ways to process XML. XML has become popular because processing XML markup languages is generally straight forward. The key is the explicit delimiting of data. All of this adds up to a very good, system neutral data format that is ideally suited for data exchange. Most vertical market XML initiatives are centered around data exchange. Example XML markup languages include:

  • Microsoft’s Channel Data Format (CDF),
  • Resource Description Framework (RDF),
  • Mathematical Markup Language (MathML),
  • Synchronized Multimedia Integration Language (SMIL), and
  • Scalable Vector Graphics (SVG)

This article presents a quick overview of XML, the minimum XML format rules you need to know to process XML data, Perl regular expression code to process XML files, and an example using the Perl code. The remainder of this section provides a quick XML overview.

WHAT is XML

Let’s take a quick look at the explicit delimiting of data angle on XML. In pre-XML days, a popular data exchange format was called comma delimited. As the name implies, each comma implied the end of one field and the start of the next. This works reasonably well when the structure of the data is relatively simple, like a row in a database, and there are no errors. The more complicated the data structure, the more likely there will be an error. The problem is that the only generic error checking that can be done is checking that the right number of commas (fields) are present in the record. There is no way to check to see if the fields are present in the correct order and when a field is missing there is no way to tell which field is missing.

XML, on the other hand, explicitly marks each field. Each field, or element in XML terms, is made up of a start tag, a data content, and an end tag. The start tag is the field name in "<" and ">". The start tag for a field named AUTHOR would be <AUTHOR>. (I use field and element interchangeably in this article, but element is the correct terminology.) The end tag is marked with "</" and ">". So the AUTHOR end tag would be </AUTHOR>. The data content is everything in between. An example is:

<AUTHOR>Norm Smith</AUTHOR>

An element can also group like elements together. This is commonly referred to as a "wrapper" element. For example, breaking the <AUTHOR> element into first and last names would result in something like:

<AUTHOR>
<FIRST>Norm</FIRST>
<LAST>Smith</LAST>
</AUTHOR>

This explicit marking of each field makes not only detecting a missing field, but detecting explicitly what field is missing practical. The generic tools that can be used to perform this checking or validation are called validating parsers.

XML is also a subset of SGML. There are several differences between the two that I believe are significant:

  • The portions of SGML that make parsers hard to implement were left out of XML
  • The Document Type Definition is optional
  • XML tags are case sensitive
  • XML files must be well formed
  • Application emphasis is on data rather than documents

The Standard Generalized Markup Language (SGML) has been around for over 20 years. It became an ISO standard in 1979. A lot of SGML features, like allowing end and sometimes start tags to be optional, were included in the specification to make it easier to manually markup documents because there were no tools in the beginning. SGML tools have matured to the point that many of the features that support data entry short cut are no longer necessary. Now there are a large number of SGML tools (see http://www.oasis-open.org/cover/sgml-xml.html for SGML and XML tools), so manual SGML data entry is rare. The XML standard is very close to what "most" SGML developers used in real SGML applications by eliminating rarely used features. If you need to use a feature left out of XML, then you simply fall back to SGML.

Well Formed XML

XML introduces the concept of well formed files. Well formed means that every element must have both a start tag and an end tag. A valid XML document has a DTD and passes validation using a validating parser. A well formed document may or may not be valid. XML applications that do not use a validating parser assume that the data is valid from an XML structural point of view.

One of the big differences from an application stand point between SGML and XML is the concept of well formed documents. SGML does not require well formed documents. Start and end tags may be optional, depending on how an SGML DTD is written. SGML does required that SGML documents be valid. This implies that an SGML validating parser be part of every application.

XML Document Type Definitions

The way an XML markup language is defined is a Document Type Definition (DTD). The DTD is optional in XML and there is a proposal for representing the DTD as an XML file. We won’t go into the DTD in any detail here. A DTD:

  • Defines the elements allowed in the XML document
  • The order of the elements in the document
  • Whether each element is optional or required
  • The minimum number of times an item can occur

XML applications that are "validating" read an XML DTD, then process the XML file against the DTD. A validating parser performs this processing. Both validating parsers and non-validating parsers generally has an interface that allows the application to simply process XML events, like the occurrence of a start tag, the data content, and the end end tag. One popular parser interface is SAX. (See http://www.megginson.com/SAX/index.html for more details on SAX.)

As an old time SGML guy, I always write a DTD. However, I rarely use it in XML applications. I write the DTD for a number of reasons:

  • If forces me to carefully work out the relationship of elements.
  • I can parse XML files during development as a quick quality check to verify they are being generated correctly.
  • When an XML file does become corrupted, parsing the bad file usually speed up tracking down data problems.
  • DTDs for data exchange generally process data from a database. Therefore, the DTD can be derived from the database schema. This makes writing the DTD a relatively easy task.

While I believe one of the leaps that XML makes from SGML is making the DTD optional, knowing what they are and when to use them is important.

XML Rules

The XML DTD is optional. This does not mean that there are rules for XML file formats. An XML file is made up of:

  • The XML processing instruction
  • Optionally, a document type declaration
  • One or more elements
  • External references called entities

Each is described briefly below.

Processing Instructions

Processing instructions start with "<?" and end in "?>". They describe or pass information to the processing program. The XML processing takes the form:

<?xml version="1.0" ?>

This must be the first thing in an XML file. There are several name="value" pairs of attributes defined. The processing instruction shown above is the minimum valid one.

Document Type Declaration

The document type declaration is optional and necessary only when using a validating XML parser. It tells the parser where the DTD is located. The following is an example document type declaration:

<!DOCTYPE BOOKS SYSTEM "c:dtdsbook.dtd">

Where BOOKS is the root element and the DTD is contained in the file c:dtdsbook.dtd.

Elements

Each element is made up of a start tag, data content, and an end tag. Both the start and end tags are required, but the data content may be empty. The rules for element tags are:

  • The start tag is the element name in "<" and ">".
  • The end tag is marked with "</" and ">".
  • The data content is everything in between the start and end tags.
  • Tag names are case sensitive.
  • Start tags may contain attributes, which describe something about the element content.
  • A attribute takes the form of name="value" pairs. The " are required.
  • Elements that can contain no content are represented as the element name enclosed in "<" and "/>".
  • The first line of an XML file must contain the XML processing instruction.

The first element in an XML file is called the root element.

Comments

XML comments start with "<!–" and end with "–>". They may occur anywhere within an XML data file. An example is:

<!– This is an XML comment –>

Entities

Entities are data representation short cuts. Entities may be used both in a DTD or in an element’s data content. Basically an internal entity is like #define in C and an external entity is like #include in C. Parsers usually automatically make the data substitution/insertion. Entities begin with an "&" and end in a ";" with the entity name (case sensitive) in between. For example:

&
<
>

These three entities should be handled by all XML applications because they are reserved characters in XML. & is entity for the "&" character. < should be used for "<". The > represents ">". There is a fairly large number of predefined XML character entities defined in the XML standard.

XML Data

The previous sections describe the pieces of an XML data file or document. Here is a complete XML file. It is used in the example later in this article.

<?xml version="1.0">
<BOOK-FILE>
<BOOK>
<TITLE>Practical Guide To SGML/XML Filters</TITLE>
<AUTHOR><FIRST>Norman</FIRST><LAST>Smith</LAST>
</AUTHOR>
<ISBN>1-55622-587-3</ISBN>
<DESCRIPTION>This book covers writing filter programs for manipulating
XML data. There are 8 case studies, all of which are solved using
multiple programming languages, including Perl, OmniMark, Balise, and C.
This book will get you started if you have to use or manipulate XML data!
</DESCRIPTION>
</BOOK>
</BOOK-FILE>

Summary

The rules described in this section represent the minimum that you must know to effectively process XML data. There is a lot more to XML when you begin to dig a little deeper. Advanced areas include DTDs, the sister XSL (eXtensible Style Sheet) standard, and XML parsers.

The Perl Code

Given the basic knowledge that an XML element has the form:

<TAG>Data Content</TAG>

It is possible to access XML elements using regular expressions. The discussion here should be implementable in any programming language that supports the same features that Perl’s regular processing supports. This section describes the basic regular expression searching in Perl to extract tags and data content from XML data.

A regular expression search, isolating the start tag, the data content, and the end tag is the basic approach. At the simplest level you have something like:


$string ="<ANY><
TITLE>Processing XML With Perl Regular
Expressions</TITLE></ANY>";
if($string =~ /(<TITLE>)(.*)(</TITLE>)/){
# Process element here
$start = $0;
$data = $1;
$end = $2;
}

Assume that $string has been loaded with XML data. The Perl =~ operator searches the string on the left side of the expression for the regular expression on the right side of the =~. The regular expression looks for three items using the "()" for grouping. Perl has special variables, $0$9 that contain the match for each () grouping in a regular expression. Thus $0 will contain the start tag for <TITLE>, $1 will contain the data content, and $2 will contain the end tag. The safe thing to do is save the special variables in normal values to prevent accidental destruction. Then you can either process the data content in the if statement or call a subroutine. Regular expression processing at this level is very simple.

The trick is getting the right amount into a string to search. Perl has another feature that facilitates processing XML data. Perl uses an "end of line" string rather than simply an end of line character like most languages. The special variable $/ contains the end of line string. There are two ways to take advantage of $/:

  1. Read the entire XML file into memory in a single read
  2. Read a logical record at a time, assuming field oriented XML data

I frequently use small XML files for start-up and configuration data. By undefing $/, Perl reads an entire file in a single read. Then, it’s relatively simple to extract the contents of each field for processing. The code to read an entire file in one read is:



open(XML,"any.xml");
undef $/;
$whole_file = <XML>;
close (XML);

The XML string to process is in $whole_file in the above code segment.

For longer XML files that are record oriented, use $/ to read each logical record with a single read. The following code segments illustrates this technique:


open(XML,"book.xml");
$/ = "</";
while (<XML>){
# The next record is in $_
# process all of the fields in the record
}
close (XML);

I have found that both of these file manipulation techniques useful in conjunction with regular expression processing.

An Example

This section ties the bits and pieces of the previous sections together with a real example. We’ll take an XML file and load the information into a database. The data is a list of XML and SGML books. Each record is in the BOOK-FILE format shown in the XML Data section above. The following element outline illustrates the structure of the XML file:

BOOK-FILE
BOOK
TITLE
AUTHOR
FIRST
LAST
ISBN
DESCRIPTION

The database for this example is a simple Access database, but can be any ODBC compliant database. The example program reads the XML data, does some data manipulation to account for the slightly different database format, and loads each record into the database.

The program, book.pl, is listed below. The code in SECTION 1 of the listing turns on strict variable checking and include the Windows ODBC module. The DBI module can be substituted for the ODBC module if you are running on a Unix system. Some code tweaks will be necessary to account for the difference between the two modules if you specify DBI instead of ODBC. The strict module requires that all variables be declared.

Section 2 of the code opens the database. The secret here is that an DSN of "earth" must be registered with the ODBC driver on your system before this connection will work. This is done by opening the ODBC entry in the Windows Control Panel and adding the entry. A skeleton Access MDB is included to experiment with. If the database open fails, the error message from the ODBC driver will be displayed and processing halted.

Section 3 opens the XML file and sets the end of line string to </BOOK>. This causes a complete logical record to be read each time a read operation happens.

Section 4 is the beginning of the main processing loop. Each time through the loop reads a single loop. This breaks up processing into nicely sized chunks. The line if($rec=~ /</BOOK-FILE/){ tests for end of file. This test automatically stops processing the file on the root element end tag. The two search and replace lines take out line breaks and quote the single quote characters to prevent problems in the SQL statements.

Section 5 code retrieves the data from the XML record. The regular expression searching is isolated in the subroutine &get_data_content(). This code illustrates how nested elements can be handled with the <AUTHOR> element. It contains <FIRST> and <LAST>. The approach to processing nested elements is to first extract the contents of the wrapper element, <AUTHOR> is extracted into $AUTHOR in this case. Then, process $AUTHOR to extract <FIRST> and <LAST>. The subroutine approach may not be quite as efficient as including all of the raw regular expression searching, but the code ends up very readable, which makes it more maintainable.

Section 6 code builds the SQL insert string and executes it. The title of each book added is printed to the console for reference. The XML file and database are closed and the program exits.

The final section of code is the subroutine &get_data_content(). It basically builds the same regular expression that was discussed in the previous section and searches the string passed to it. The data content is returned to the calling program. This subroutine does not work with elements that have attributes, but that can be done with a little work.

# SECTION 1
#
use strict;
use Win32::ODBC;

my $status = "";
my $success;
my $TITLE;
my $AUTHOR;
my $ISBN;
my $DESC;
my $trash;
my $INSERT;
my $rec;
my $first;
my $last;
#
# SECTION 2
#
my $src = "earth";
my($db) = new Win32::ODBC("$src") ||
die qq{Cannot open ODBC Connection to $src: }, Win32::ODBC::Error,"n";
#
# SECTION 3
#
open(XML,"books.xml") or
die "Can’t open file!n";
$/ = "</BOOK>"; # Set the end of line string
$success = 0;
#
# SECTION 4
#
while(<XML>){ # Read a record
$rec = $_; # Move record to buffer
if($rec =~ /</BOOK-FILE/){
last;
}else{
$rec =~ s/n/ /g;
$rec =~ s/’/’/g;

#
# SECTION 5
#
$TITLE = &get_data_content($rec,"TITLE");
$AUTHOR = &get_data_content($rec,"AUTHOR");
$first = &get_data_content($AUTHOR,"FIRST");
$last = &get_data_content($AUTHOR,"LAST");
$AUTHOR = "$first $last";
$ISBN = &get_data_content($rec,"ISBN");
$DESC = &get_data_content($rec,"DESCRIPTION");

if($TITLE eq ""){
exit;
}
#
# SECTION 6
#
$INSERT = "INSERT INTO BOOKS (TITLE, AUTHOR, ISBN) ";
$INSERT .= "VALUES (‘$TITLE’,’$AUTHOR’,’$ISBN’);";

$status = $db->Sql("$INSERT");
if($status){
die "Error from insert: ", $db->Error(), "n";
}
print "$TITLE added…n";
}
}
close(XML);
$db->Close();
exit;
#
# SECTION 7
#
########################+——————+
# | get_data_content |
# +——————+
# Usage:
# $field = &get_data_content($string,"field");
#
sub get_data_content{
my $start;
my $end;
my $line;
my $fld;

($line,$fld) = @_;

$start = "<$fld>";
$end = "</$fld>";
if($line =~ /($start)(.+?)($end)/s){
$fld = $2;
}else{
$fld = "";
}
return $fld;
}

Summary

Processing XML using Perl regular expressions is relatively easy once you know the secret regular expression incantations to isolate the parts of an XML element.. Like all programming tools, this technique has its place in any XML toolbox. The Perl regular expression code described in this article is a preview of an extensive library of Perl XML processing code that is in my next book, Practical Guide to XML.


Norman E. Smith has 25 years of professional programming experience and is currently Senior Systems Analyst for Science Applications International Corp. in Oak Ridge, Tennessee. He is the author of many popular books, including Practical Guide To SGML Filtersand Practical Guide to Intranet Client-Server Applications Using the Web. Norman has been developing WWW and Intranet applications for SAI since 1993. He is considered a industry expert in both SGML and PERL, and hosts our XML and Intranet Management forums.

 

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories