April 23, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

HTML Parsing: The World is Your Database

  • April 6, 2005
  • By Brad Lhotsky
  • Send Email »
  • More Articles »

We've all found useful information on the Web. Occassionally, it's even necessary to retrieve that information in an automated fashion. It could be just for your own amusement, possibly a new Web service that hasn't yet published an API, or even a critical business partner who only exposes a Web-based interface to you.

Of course, screen scraping Web pages is not the optimal solution to any problem, and I highly advise you to look into APIs or formal Web services that will provide a more consistent and intentional programming interface. Potential problems could arise for a number of reasons.

Step 1: Considerations

The most obvious and annoying problem is you are not guaranteed any form of consistency in the presentation of your data. Web sites are under construction constantly. Even when they look the same, programmers and designers are behind the scenes tweaking little pieces to optimize, straighten, or update. This means that your data is likely to move or disappear entirely. As you can imagine, this can lead to erroneous data or your program failing to complete.

A problem that you might not think of immediately is the impact of your screen scraping on the target's Web server. During the development phase especially, you should give serious thought about mirroring the Web site, using any number of mirroring applications available on the Web. This will protect against you accidentally finding a Denial of Service the target's Web site. Once you move to production, out of common courtesy, you should limit the running of your program to as few times as possible to provide you with the accuracy your required. Obviously, if this is a business-to-business transaction, you should keep the other guy in the loop. It won't be good for your business relationships should you trip the other companies Intrusion Detection System and then have to explain what you're doing to a defensive security administrator.

Along the same lines, consider the legality of the screen scraping. To a Web server, your traffic could masquerade as 100% interactive, valid traffic, but upon closer inspection, a wise system administrator will likely put the pieces together. Search that company's Web site for "Acceptable Use Policies" and "Terms of Service." In some cases, they may not apply but it's likely that the privilege to access the data is granted only after agreeing to one of the two aforementioned documents.

Step 2: Research

At this point, it's necessary to dive into the task at hand. Go through the motions manually in a Web browser that supports thorough debugging. My experience with Firefox has always been a positive one. Through the use of tools such as the DOM Inspector, the built-in Javascript Debugger, and extensions such as Web Developer, View Source With .., and Venkman, it's been one of the best platforms for Web development I've encountered. Incidentally, the elements of Web design are critical to the automated extraction of that data. There are two phases to debug to write a good screen scraper.

The request

A Web server is not a mind reader; it has to know what you're after. HTTP requests tell the Web server what document to serve and how to serve it. The request can be issued through the address bar, a form, or a link. As you navigate the site, take note of the parameters passed in the Query String of the URL. If you need to log in, use the Web Developer Extension to "Display Form Details" and take note of the names of the login prompt and the form objects themselves. Also, it's important to take note of the "METHOD" the form is going to use, either "GET" or "POST." As you go through, sketch out the process on a scrap piece of paper with details on the parameters along the way. If you're clicking on links to get where you need, use the right-click option of "View Link Properties" to get details.

A key thing people often miss when doing Web automation is the effect of client-side scripting. You can use Venkman to step through the entire run of client-side code. You want to pay attention to hidden form fields that are often set "onClick" of the submit button, or through other types of normal user interaction. Without knowing and setting these hidden fields to the correct value, the page will refuse to load or cause problems. Granted, this isn't good practice on the site designer's part because a growing number of security-aware Web surfers are limiting or disabling client-side scripting entirely.

The response

After sketching out the path to your data, you've finally arrived at the page that contains the data itself. You now need to map out the page in a way that your data can be identified from the rest of the insignificant details, styling, and advertisements! I've always believed in syntax highlighting and have become accustomed to vim's flavor of highlighting. I've got the View Source With .. Extension configured to use gvim. So, I right-click and, with any luck, the page source is displayed in the gvim buffer with syntax highlighting enabled. If the page has a weird extension, or no extension, I might have to "set syntax=html" if it's not presenting the proper page headers. Search through the source file, correlating the visual representations in the browser with the source code that's generating them. You'll need to find landmarks in the HTML to use as a means to guide your parser through an obscure landscape of markup language. If you're having problems, another indispensible tool provided by Firefox is "View Selection Source.To use it, simply highlight some content and then right-click -> "View Selection Source." A Mozilla Source viewer opens with just the HTML that generated the selected content highlighted with some surrounding HTML to provide context.

You're going to have to start thinking like a machine. Think Simple, 1's and 0's, true and false! I usually start at my data and work back, looking for a unique tag or pattern that I can use to locate the data moving forward. Look not only at the HTML Elements (<b>,<td>, and so forth), but at their attributes (color="#FF000", colspan="3") to profile the areas containing and surrounding your data.

The lay of the land is changing these days. It should be getting much easier to treat HTML as a data source thanks Web Standards and the alarming number of Web designers pushing whole-heartedly for their adoption. The old table-based layouts, styled by font tags and animated GIFs, is giving way to "Document Object Model"-aware design and styling fueled mostly by Cascading Style Sheets (CSS). CSS works most effectively when the document layout emulates an object. There are "classes," "ids," and tags establish relationships. CSS makes it trivial for Web designers with passion and experience in Design Arts to cooperate with Web programmers whose passion is the Art of Programming and whose idea of "progressive design" is white text on a black background! The cues that programmers and designers specify to insure interoperability of content and presentation gives the Screen Scraper a legible road map by which to extract their data. If you see "div," "span," "tbody," and "theader" elements bearing attributes such as "class" and "id," favor using these elements as landmarks. Although nothing is guaranteed, it's much more likely that these elements will maintain their relationships because they're often the result of divisional cooperation than entropy.

One of the simplest ways to keep your bearing is to print out the section of HTML you're targetting, and sketch out some simple logic to be able to quickly identify it. I use a highlighter and a red pen to make notes on the printout that I can glance at as a sanity check.

Step 3: Automated Retrieval of Your Content

Depending on how complicated the path to your data is, there are a number of tools available. Basic "GET" method requests that don't require cookies, session management, or form tracking can take advantage of the simple interface provided by the LWP::Simple package.

#!/usr/bin/perl

use strict;
use LWP::Simple;

my $url = q|http://www.weather.com/weather/local/21224|;

my $content = get $url;

print $content;

That's it. Simple.

More complex problems with cookies and logins will require a more sophisticated tool. WWW::Mechanize offers a simple a solution to a complex path to your data with the ability to store cookies and construct form objects that can intelligently initialize themselves. An example:

#!/usr/bin/perl

use strict;
use WWW::Mechanize;

my $authPage = q|http://www.weather.com|;
my $authForm = 'whatwhere';
my %formVars = (
    where    => '21224',
    what     => 'Weather36HourUndeclared'
);

#
# or optionally, set the fields in visible order
my @visible = qw(21224);

#
# Create a "bot"
my $bot = new WWW::Mechanize();

#
# Masquerade as Mac Firefox
$bot->agent_alias('Mac Mozilla');

#
# Retrieve the page with our "login form"
$bot->get($authPage);

#
# fill out the form!
$bot->form_name($authForm);

while( my ($k,$v) = each %formVars ) {
    $bot->field($k,$v);
}
#
# OR
# $bot->set_visible(@visible);

#
# submit the form!
$bot->submit();

#
# Print the Content
print $bot->content();




Page 1 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel