JavaHTML Parsing: The World is Your Database

HTML Parsing: The World is Your Database

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

We’ve all found useful information on the Web. Occassionally, it’s even necessary to retrieve that information in an automated fashion. It could be just for your own amusement, possibly a new Web service that hasn’t yet published an API, or even a critical business partner who only exposes a Web-based interface to you.

Of course, screen scraping Web pages is not the optimal solution to any problem, and I highly advise you to look into APIs or formal Web services that will provide a more consistent and intentional programming interface. Potential problems could arise for a number of reasons.

Step 1: Considerations

The most obvious and annoying problem is you are not guaranteed any form of consistency in the presentation of your data. Web sites are under construction constantly. Even when they look the same, programmers and designers are behind the scenes tweaking little pieces to optimize, straighten, or update. This means that your data is likely to move or disappear entirely. As you can imagine, this can lead to erroneous data or your program failing to complete.

A problem that you might not think of immediately is the impact of your screen scraping on the target’s Web server. During the development phase especially, you should give serious thought about mirroring the Web site, using any number of mirroring applications available on the Web. This will protect against you accidentally finding a Denial of Service the target’s Web site. Once you move to production, out of common courtesy, you should limit the running of your program to as few times as possible to provide you with the accuracy your required. Obviously, if this is a business-to-business transaction, you should keep the other guy in the loop. It won’t be good for your business relationships should you trip the other companies Intrusion Detection System and then have to explain what you’re doing to a defensive security administrator.

Along the same lines, consider the legality of the screen scraping. To a Web server, your traffic could masquerade as 100% interactive, valid traffic, but upon closer inspection, a wise system administrator will likely put the pieces together. Search that company’s Web site for “Acceptable Use Policies” and “Terms of Service.” In some cases, they may not apply but it’s likely that the privilege to access the data is granted only after agreeing to one of the two aforementioned documents.

Step 2: Research

At this point, it’s necessary to dive into the task at hand. Go through the motions manually in a Web browser that supports thorough debugging. My experience with Firefox has always been a positive one. Through the use of tools such as the DOM Inspector, the built-in Javascript Debugger, and extensions such as Web Developer, View Source With .., and Venkman, it’s been one of the best platforms for Web development I’ve encountered. Incidentally, the elements of Web design are critical to the automated extraction of that data. There are two phases to debug to write a good screen scraper.

The request

A Web server is not a mind reader; it has to know what you’re after. HTTP requests tell the Web server what document to serve and how to serve it. The request can be issued through the address bar, a form, or a link. As you navigate the site, take note of the parameters passed in the Query String of the URL. If you need to log in, use the Web Developer Extension to “Display Form Details” and take note of the names of the login prompt and the form objects themselves. Also, it’s important to take note of the “METHOD” the form is going to use, either “GET” or “POST.” As you go through, sketch out the process on a scrap piece of paper with details on the parameters along the way. If you’re clicking on links to get where you need, use the right-click option of “View Link Properties” to get details.

A key thing people often miss when doing Web automation is the effect of client-side scripting. You can use Venkman to step through the entire run of client-side code. You want to pay attention to hidden form fields that are often set “onClick” of the submit button, or through other types of normal user interaction. Without knowing and setting these hidden fields to the correct value, the page will refuse to load or cause problems. Granted, this isn’t good practice on the site designer’s part because a growing number of security-aware Web surfers are limiting or disabling client-side scripting entirely.

The response

After sketching out the path to your data, you’ve finally arrived at the page that contains the data itself. You now need to map out the page in a way that your data can be identified from the rest of the insignificant details, styling, and advertisements! I’ve always believed in syntax highlighting and have become accustomed to vim’s flavor of highlighting. I’ve got the View Source With .. Extension configured to use gvim. So, I right-click and, with any luck, the page source is displayed in the gvim buffer with syntax highlighting enabled. If the page has a weird extension, or no extension, I might have to “set syntax=html” if it’s not presenting the proper page headers. Search through the source file, correlating the visual representations in the browser with the source code that’s generating them. You’ll need to find landmarks in the HTML to use as a means to guide your parser through an obscure landscape of markup language. If you’re having problems, another indispensible tool provided by Firefox is “View Selection Source.To use it, simply highlight some content and then right-click -> “View Selection Source.” A Mozilla Source viewer opens with just the HTML that generated the selected content highlighted with some surrounding HTML to provide context.

You’re going to have to start thinking like a machine. Think Simple, 1’s and 0’s, true and false! I usually start at my data and work back, looking for a unique tag or pattern that I can use to locate the data moving forward. Look not only at the HTML Elements (<b>,<td>, and so forth), but at their attributes (color=”#FF000″, colspan=”3″) to profile the areas containing and surrounding your data.

The lay of the land is changing these days. It should be getting much easier to treat HTML as a data source thanks Web Standards and the alarming number of Web designers pushing whole-heartedly for their adoption. The old table-based layouts, styled by font tags and animated GIFs, is giving way to “Document Object Model”-aware design and styling fueled mostly by Cascading Style Sheets (CSS). CSS works most effectively when the document layout emulates an object. There are “classes,” “ids,” and tags establish relationships. CSS makes it trivial for Web designers with passion and experience in Design Arts to cooperate with Web programmers whose passion is the Art of Programming and whose idea of “progressive design” is white text on a black background! The cues that programmers and designers specify to insure interoperability of content and presentation gives the Screen Scraper a legible road map by which to extract their data. If you see “div,” “span,” “tbody,” and “theader” elements bearing attributes such as “class” and “id,” favor using these elements as landmarks. Although nothing is guaranteed, it’s much more likely that these elements will maintain their relationships because they’re often the result of divisional cooperation than entropy.

One of the simplest ways to keep your bearing is to print out the section of HTML you’re targetting, and sketch out some simple logic to be able to quickly identify it. I use a highlighter and a red pen to make notes on the printout that I can glance at as a sanity check.

Step 3: Automated Retrieval of Your Content

Depending on how complicated the path to your data is, there are a number of tools available. Basic “GET” method requests that don’t require cookies, session management, or form tracking can take advantage of the simple interface provided by the LWP::Simple package.

#!/usr/bin/perl

use strict;
use LWP::Simple;

my $url = q|http://www.weather.com/weather/local/21224|;

my $content = get $url;

print $content;

That’s it. Simple.

More complex problems with cookies and logins will require a more sophisticated tool. WWW::Mechanize offers a simple a solution to a complex path to your data with the ability to store cookies and construct form objects that can intelligently initialize themselves. An example:

#!/usr/bin/perl

use strict;
use WWW::Mechanize;

my $authPage = q|http://www.weather.com|;
my $authForm = 'whatwhere';
my %formVars = (
    where    => '21224',
    what     => 'Weather36HourUndeclared'
);

#
# or optionally, set the fields in visible order
my @visible = qw(21224);

#
# Create a "bot"
my $bot = new WWW::Mechanize();

#
# Masquerade as Mac Firefox
$bot->agent_alias('Mac Mozilla');

#
# Retrieve the page with our "login form"
$bot->get($authPage);

#
# fill out the form!
$bot->form_name($authForm);

while( my ($k,$v) = each %formVars ) {
    $bot->field($k,$v);
}
#
# OR
# $bot->set_visible(@visible);

#
# submit the form!
$bot->submit();

#
# Print the Content
print $bot->content();

Step 4: Data Processing

There are two main ways to parse markup languages such as HTML, XHTML, and XML. I’ve always preferred dealing with the “Event Driven” methodology. Essentially, as the document is parsed, new tags trigger events in the code, calling functions you’ve defined with the attributes of the tag included as arguments. The content between a start and end tag is handled through another callback function that you’ve defined. This method requires that you build your own data structures. The second method parses the entire document, building a tree-like object from it, which it then returns to the programmer as an object. This second method is very useful when you have to process an entire document, modify its contents, and then transform it back into markup language. Usually, a screen scraping program cares very little for the “entire document” and more for the interesting tidbits; everything else can be ignored.

HTML::Parser

HTML::Parser is an event-driven HTML parser module available on CPAN. Using the above content retrieval code snippet, delete the “print $bot->content();” line, and insert this code, with “use” statements at the top for consistency.

use HTML::Parser;

#
# store the content;
my $content = $bot->content();

#
# variables for use in our parsing sub routines:
my $grabText = undef;
my $textStr = '';

#
# Parser Engine
my $parser = new HTML::Parser(
                start_h => [ &tagStart, "tagname, attr" ],
                end_h   => [ &tagStop, "tagname" ],
                text_h  => [ &handleText, "dtext" ]
);

#
# Call the parser!
$parser->parse($content);

#
# Display the results between the tag
print $textStr;

#
# Handle the start tag
sub tagStart {
        my ($tagname,$attr) = @_;
        if((lc $tagname eq 'b') && $attr->{class} eq 'obsTempTextA') {
                $grabText = 1;
        }
}

#
# Handle the end tag
sub tagStop {   $grabText = undef; }

#
# check to see if we're grabbing the text;
sub handleText {
        $textStr .= shift if $grabText;
}

By using this, it’s simple to extract the temperature from the variable $textStr. If you wanted to extract more information, you could use a more complex data structure to hold all the variables. The important thing to remember about the event-based model is that everything happens linearly. It’s good practice to keep state, either through a simple scalar, like the $grabText var above, or in an array or hash. If you’re dealing with data that’s nested in several layers of tags, you might consider something like this:

my @nestedTags = ();

sub tagStart {
   my ($tag,$attr) = @_;

   if($tag eq $tagWeAreLookingFor) {
      push @nestedTags,$tag;
   }
}

sub handleText {
   my $text = shift;

   #
   # In here, we can check where in the @nestedTag array we are,
   # and do different things based on location
   if(scalar @nestedTags == 4) {
      print "Four Tags deep, we found: $text!n";
   }
}

sub tagStop {
   my $tag = shift;
   pop @nestedTags if $tag eq $tagWeAreLookingFor;
}

This model works great for most screen scraping because we’re usually interested in key pieces of data on a page-by-page basis. However, this can quickly turn your program into a mess of handler subroutines and complex tracking variables that make managing your screen scraper closer to voodoo than programming. Thankfully, HTML::Parser is fully prepared to make our lives easier by supporting subclassing.

Step 5: SubClassing for Sanity

I usually like to have one subclassed HTML::Parser class per page. In that class, I’ll include accessors to the relevant data on that page. That way, I can just “use” my class where I’m processing the data for that one page and I can keep the main program relatively clean from unnecessary clutter.

The following script uses a simple interface to pull down the current temperature in Fahrenheit. The accessor method allows the user to specify the units they’d like the temperature back in.

#!/usr/bin/perl

use strict;
use LWP::Simple;

use MyParsers::Weather::Current;

my $parser = new MyParsers::Weather::Current;

my $content = get 'http://www.weather.com/weather/local/21224';

$parser->parse($content);

print $parser->getTemperature, " degrees fahrenheit.n";
print $parser->getTemperature('celsius'), " degrees celsius.n";
print $parser->getTemperature('kelvin'), " degrees kelvin.n";

The script uses a homemade module, “MyParsers::Weather::Current,” to handle all the parsing. The code for that module is provided below.

package MyParsers::Weather::Current;

use strict;
use HTML::Parser;

#
# Inherit
our @ISA = qw(HTML::Parser);

my %ExtraVariables = (
   _found              => undef,
   _grabText           => undef,
   temp_F              => undef,
   temp_C              => undef
);

#
# Class Functions
sub new {
   #
   # Call the Parent Constructor
   my $self = HTML::Parser::new(@_);
   #
   # Call our local initialization function
   $self->_init();
   return $self;
}

#
# Internal Init Function to Set up the Parser.
sub _init {
   my $self = shift;
   #
   # init() is provided by the parent class
   $self->init(
      start_h     =>  [ &_handler_tagStart, 'self, tagname, attr' ],
      end_h       =>  [ &_handler_tagStop, 'self, tagname' ],
      text_h      =>  [ &_handler_text, 'self, dtext' ],
   );

   #
   # Set up the rest of the object
   foreach my $k (keys %ExtraVariables) {
      $self->{$k} = $ExtraVariables{$k};
   }
}

#
# Accessors
sub getTemperature {
   my ($self,$type) = @_;

   unless( $self->{_found} ) {
      print STDERR "either you forgot to call parse, or the temp
                    data was not found!n";
      return;
   }
   $type = 'fahrenheit' unless length $type;

   #
   # Remove the first character from the temperature string
   my $t = 'temp_' . uc substr($type,0,1);

   return $self->{$t} if exists $self->{$t};

   print STDERR "Unknown Temperature Type ($type) !n";
   return undef;
}

#
# Parsing Functions
sub _handler_tagStart {
   my ($self,$tag,$attr) = @_;
   if((lc $tag eq 'b') && $attr->{class} eq 'obsTempTextA') {
      $self->{_grabText} = 1;
      $self->{_found} = 1;
   }
}

sub _handler_tagStop {
   my $self = shift;
   $self->{_grabText} = undef;
}

sub _handler_text {
   my ($self,$text) = @_;
   if($self->{_grabText}) {
      if(my($temp,$forc) = ($text =~ /(d+).*([CF])/)) {
         if($forc eq 'C') {
            $self->{temp_C} = $temp;
            #
            # Fahrenheit doesn't really make decimals places useful
            $self->{temp_F} = int((9/5) * ($temp+32));
         }
         elsif($forc eq 'F') {
            $self->{temp_F} = $temp;
            #
            # Use precision to 2 decimal places
            $self->{temp_C} = sprintf("%.2f", (5/9) * ($temp-32));
         }
      }
   }
}

Wrapping Up

HTML can be an incredibly effective transport mechanism for data, even if the original author hadn’t intended it to be that way. With the advent of Web Services and Standards Compliant designs utilizing Cascading Style Sheets, its becoming more and more interoperable and cooperative. Learning to use screen scraping techniques can provide a wealth of information for the programmer to analyze and format to their heart’s content.

As an exercise, you might want to expand on the “MyParsers::Weather::Current” object to pull additional information from weather.com’s page, and add a few more accessors! If you’d really like a challenge, it’d be kind of fun to write a parser for each of the major weather sites, pull the data for forecasting down, and use a weighted average based on the individual site’s accuracy in the past to get an “educated guess” at the weather conditions!

Feel free to contact me with questions or comments on this article!

About the Author

Brad Lhotsky is a Software Developer whose focus is primarily web based application in Perl and PHP. He has over 5 years experience developing systems for end users and system and network administrators. Brad has been active on Perl beginner’s mailing lists and forums for years, attempting to give something back to the community.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories