August 21, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

HTML Parsing: The World is Your Database

  • April 6, 2005
  • By Brad Lhotsky
  • Send Email »
  • More Articles »

Step 4: Data Processing

There are two main ways to parse markup languages such as HTML, XHTML, and XML. I've always preferred dealing with the "Event Driven" methodology. Essentially, as the document is parsed, new tags trigger events in the code, calling functions you've defined with the attributes of the tag included as arguments. The content between a start and end tag is handled through another callback function that you've defined. This method requires that you build your own data structures. The second method parses the entire document, building a tree-like object from it, which it then returns to the programmer as an object. This second method is very useful when you have to process an entire document, modify its contents, and then transform it back into markup language. Usually, a screen scraping program cares very little for the "entire document" and more for the interesting tidbits; everything else can be ignored.

HTML::Parser

HTML::Parser is an event-driven HTML parser module available on CPAN. Using the above content retrieval code snippet, delete the "print $bot->content();" line, and insert this code, with "use" statements at the top for consistency.

use HTML::Parser;

#
# store the content;
my $content = $bot->content();

#
# variables for use in our parsing sub routines:
my $grabText = undef;
my $textStr = '';

#
# Parser Engine
my $parser = new HTML::Parser(
                start_h => [ \&tagStart, "tagname, attr" ],
                end_h   => [ \&tagStop, "tagname" ],
                text_h  => [ \&handleText, "dtext" ]
);

#
# Call the parser!
$parser->parse($content);

#
# Display the results between the tag
print $textStr;

#
# Handle the start tag
sub tagStart {
        my ($tagname,$attr) = @_;
        if((lc $tagname eq 'b') && $attr->{class} eq 'obsTempTextA') {
                $grabText = 1;
        }
}

#
# Handle the end tag
sub tagStop {   $grabText = undef; }

#
# check to see if we're grabbing the text;
sub handleText {
        $textStr .= shift if $grabText;
}

By using this, it's simple to extract the temperature from the variable $textStr. If you wanted to extract more information, you could use a more complex data structure to hold all the variables. The important thing to remember about the event-based model is that everything happens linearly. It's good practice to keep state, either through a simple scalar, like the $grabText var above, or in an array or hash. If you're dealing with data that's nested in several layers of tags, you might consider something like this:

my @nestedTags = ();

sub tagStart {
   my ($tag,$attr) = @_;

   if($tag eq $tagWeAreLookingFor) {
      push @nestedTags,$tag;
   }
}

sub handleText {
   my $text = shift;

   #
   # In here, we can check where in the @nestedTag array we are,
   # and do different things based on location
   if(scalar @nestedTags == 4) {
      print "Four Tags deep, we found: $text!\n";
   }
}

sub tagStop {
   my $tag = shift;
   pop @nestedTags if $tag eq $tagWeAreLookingFor;
}

This model works great for most screen scraping because we're usually interested in key pieces of data on a page-by-page basis. However, this can quickly turn your program into a mess of handler subroutines and complex tracking variables that make managing your screen scraper closer to voodoo than programming. Thankfully, HTML::Parser is fully prepared to make our lives easier by supporting subclassing.

Step 5: SubClassing for Sanity

I usually like to have one subclassed HTML::Parser class per page. In that class, I'll include accessors to the relevant data on that page. That way, I can just "use" my class where I'm processing the data for that one page and I can keep the main program relatively clean from unnecessary clutter.

The following script uses a simple interface to pull down the current temperature in Fahrenheit. The accessor method allows the user to specify the units they'd like the temperature back in.

#!/usr/bin/perl

use strict;
use LWP::Simple;

use MyParsers::Weather::Current;

my $parser = new MyParsers::Weather::Current;

my $content = get 'http://www.weather.com/weather/local/21224';

$parser->parse($content);

print $parser->getTemperature, " degrees fahrenheit.\n";
print $parser->getTemperature('celsius'), " degrees celsius.\n";
print $parser->getTemperature('kelvin'), " degrees kelvin.\n";

The script uses a homemade module, "MyParsers::Weather::Current," to handle all the parsing. The code for that module is provided below.

package MyParsers::Weather::Current;

use strict;
use HTML::Parser;

#
# Inherit
our @ISA = qw(HTML::Parser);

my %ExtraVariables = (
   _found              => undef,
   _grabText           => undef,
   temp_F              => undef,
   temp_C              => undef
);

#
# Class Functions
sub new {
   #
   # Call the Parent Constructor
   my $self = HTML::Parser::new(@_);
   #
   # Call our local initialization function
   $self->_init();
   return $self;
}

#
# Internal Init Function to Set up the Parser.
sub _init {
   my $self = shift;
   #
   # init() is provided by the parent class
   $self->init(
      start_h     =>  [ \&_handler_tagStart, 'self, tagname, attr' ],
      end_h       =>  [ \&_handler_tagStop, 'self, tagname' ],
      text_h      =>  [ \&_handler_text, 'self, dtext' ],
   );

   #
   # Set up the rest of the object
   foreach my $k (keys %ExtraVariables) {
      $self->{$k} = $ExtraVariables{$k};
   }
}

#
# Accessors
sub getTemperature {
   my ($self,$type) = @_;

   unless( $self->{_found} ) {
      print STDERR "either you forgot to call parse, or the temp
                    data was not found!\n";
      return;
   }
   $type = 'fahrenheit' unless length $type;

   #
   # Remove the first character from the temperature string
   my $t = 'temp_' . uc substr($type,0,1);

   return $self->{$t} if exists $self->{$t};

   print STDERR "Unknown Temperature Type ($type) !\n";
   return undef;
}

#
# Parsing Functions
sub _handler_tagStart {
   my ($self,$tag,$attr) = @_;
   if((lc $tag eq 'b') && $attr->{class} eq 'obsTempTextA') {
      $self->{_grabText} = 1;
      $self->{_found} = 1;
   }
}

sub _handler_tagStop {
   my $self = shift;
   $self->{_grabText} = undef;
}

sub _handler_text {
   my ($self,$text) = @_;
   if($self->{_grabText}) {
      if(my($temp,$forc) = ($text =~ /(\d+).*([CF])/)) {
         if($forc eq 'C') {
            $self->{temp_C} = $temp;
            #
            # Fahrenheit doesn't really make decimals places useful
            $self->{temp_F} = int((9/5) * ($temp+32));
         }
         elsif($forc eq 'F') {
            $self->{temp_F} = $temp;
            #
            # Use precision to 2 decimal places
            $self->{temp_C} = sprintf("%.2f", (5/9) * ($temp-32));
         }
      }
   }
}

Wrapping Up

HTML can be an incredibly effective transport mechanism for data, even if the original author hadn't intended it to be that way. With the advent of Web Services and Standards Compliant designs utilizing Cascading Style Sheets, its becoming more and more interoperable and cooperative. Learning to use screen scraping techniques can provide a wealth of information for the programmer to analyze and format to their heart's content.

As an exercise, you might want to expand on the "MyParsers::Weather::Current" object to pull additional information from weather.com's page, and add a few more accessors! If you'd really like a challenge, it'd be kind of fun to write a parser for each of the major weather sites, pull the data for forecasting down, and use a weighted average based on the individual site's accuracy in the past to get an "educated guess" at the weather conditions!

Feel free to contact me with questions or comments on this article!

About the Author

Brad Lhotsky is a Software Developer whose focus is primarily web based application in Perl and PHP. He has over 5 years experience developing systems for end users and system and network administrators. Brad has been active on Perl beginner's mailing lists and forums for years, attempting to give something back to the community.



Page 2 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel