http://www.developer.com/

Back to article

Automating Web-based Data Retrieval with Perl


January 4, 2005

At Apress one of my unofficial tasks, and indeed pastimes is writing applications that aggregate and parse a variety of online data including book reviews, sales rankings, and other items of interest that help us to better understand gauge reader interest in our books, our competitors' books, and technology in general. For instance, logically sales through online retailers comprise a significant percentage of revenue for computer book publishers, given that the target audience tends to be quite comfortable with online purchases. Many of these retailers offer information pertinent to the popularity of the books by way of sales ranks, book reviews, and some have recently begun facilitating retrieval of such information by way of Web Services.

The maturation of Web Services signals a giant leap forward in terms of being able to effectively retrieve and conveniently format data residing on remote servers. With high profile companies such as Amazon.com and ebay leading the charge in terms of making Web Services available to the general public, it's only a matter of time before others follow suit. Yet despite all the excitement its fair to say that widespread availability of Web Services for public consumption is still a few years away, meaning that in many cases the only solution is the heretical practice of screen-scraping.

On its surface this isn't such an issue; after all, many programming languages offer some pretty powerful text parsing features, PHP, Perl and Python among them. However, one must contend with a variety of issues in order to effectively automate this process, including:

  • Page retrieval: How will the script initiate the connection to the remote web server and retrieve the file for local storage and manipulation?
  • Authentication: Understandably some information is confidential, and therefore it may be password protected. How do you successfully authenticate without manually supplying the username and password?
  • Cookies: Some websites require a minimal set of user capabilities, and don't care whether the user is a human or robot. One common requirement is the ability to accept and manage cookies. How will the script overcome this seemingly complex issue?

Using the Perl language, it's surprisingly simple to accomplish all three of these tasks. In this tutorial, I'll show you how each is easily tackled with Perl using a popular module, namely WWW::Mechanize.

WWW::Mechanize

WWW::Mechanize is a Perl module capable of interacting with a website. Among other things, it can traverse links, download pages, and even complete and submit forms. It's suitable to a wide variety of applications, such as testing web applications and automating page retrieval for subsequent parsing. Maintained by Andy Lester, you can learn more about WWW::Mechanize at CPAN.

The Data Source

Suppose you are tasked with retrieving a list of best-selling computer books from a website that monitors national book sales. This page might look something like Figure 1-1. The corresponding HTML code is shown in Listing 1-1. In the following examples I'll show you how to retrieve, save, and parse this page in a variety of manners.

Figure 1-1. Bestselling Computer books

Listing 1-1. Bestselling computer books (bestselling.html)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&rt
<html xmlns="http://www.w3.org/1999/xhtml"&rt

<head>
   <title>Computer Books</title>
</head>

<body>

<h3>Sales Data, Week of December 13, 2004</h3>
<table border="1">
     <tr>
          <th>ISBN</th>
          <th>Title</th>
          <th>MTD Sales</th>
          <th>YTD Sales</th>
     </tr>
     <tr>
          <td>1-23456-789-0</td>
          <td><a href="/details/1-23456-789-0/">Cooking with PHP</a></td>
          <td>548</td>
          <td>5,678</td>
     </tr>
     <tr>
          <td>0-98765-432-1</td>
          <td><a href="/details/0-98765-432-1/">Prolific Perl</a></td>
          <td>378</td>
          <td>3,456</td>
     </tr>
     <tr>
          <td>6-78901-234-5</td>
          <td><a href="/details/6-78901-234-5/">Managing Markets with MySQL</a></td>
          <td>973</td>
          <td>7,121</td>
     </tr>
     <tr>
          <td>1-24245-456-9</td>
          <td><a href="/details/1-24245-456-9/">A History of Keyboards</a></td>
          <td>787</td>
          <td>2,290</td>
     </tr>
</table>

</body>

</html>

Retrieving the Web page

For the first exercise, I'd like to demonstrate how simple it is to retrieve a webpage with WWW::Mechanize. Listing 1-2 offers the code for retrieving bestselling.html (the page shown in Listing 1-2), outputting its contents, and saving it locally. Note that if you copy the below code (please do), you'll need to update $url to point to the desired page.

Listing 1-2. Retrieving the bestselling.html web page

#!/usr/bin/perl

# Include the WWW::Mechanize module
use WWW::Mechanize;

# What URL shall we retrieve?
$url = "http://www.example.com/bestselling.html";

# Create a new instance of WWW::Mechanize
# enabling autoheck checks each request to ensure it was successful,
# producing an error if not.
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Retrieve the page
$mechanize->get($url);

# Assign the page content to $page
my $page = $mechanize->content;

# Output the page
print $page;

# Let's also save the page locally
open(FH, ">bestsellers.txt");

print FH $page;

close(FH);

Executing Listing 1-1 outputs the source code of books.html. It doesn't get any easier than that! Once you get() the page, it's possible to retrieve a number of other items, including the page title, content type, a list of all links found on the page, and more. In fact, Listing 1-3 shows you how to output all links found in the retrieved page.

Retrieving Page Links

Referring back to Listing 1-1, you'll see that each book title links to a URL that will presumably present detailed information about that book. It's likely that I would also want to retrieve the detailed data, and therefore would like to retrieve and save those pages. I could hardcode an array of URLs into the script, but what if a book is deleted or another is added? Certainly maintaining the list would be too tedious, therefore I'd like to rely on the information found in the page represented by Listing 1-1, and retrieve only the pages listed there. To do so, I'll need to retrieve each link. Listing 1-3 demonstrates how this is accomplished.

Listing 1-3. Outputting page links

#!/usr/bin/perl

# Include the WWW::Mechanize module
use WWW::Mechanize;

# What URL shall we retrieve?
$url = "http://www.example.com/bestselling.html";

# Create a new instance of WWW::Mechanize
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Retrieve the page
$mechanize->get($url);

# Retrieve the page title
my $title = $mechanize->title;

print "<b>$title</b><br />";

# Place all of the links in an array
my @links = $mechanize->links;

# Loop through and output each link
foreach my $link (@links) {

   # Retrieve the link URL
   my $href = $link->url;

   # Retrieve the link text
   my $name = $link->text;
   
   print "<a href=\"$href\">$name</a>\n";

}

Executing Listing 1-3 produces the following output:

<a href="/details/0-98765-432-1/">Prolific Perl</a>
<a href="/details/6-78901-234-5/">Managing Markets with MySQL</a>
<a href="/details/1-24245-456-9/">A History of Keyboards</a>

Of course, given this ability it would be trivial to modify Listing 1-3 to recurse into each link and retrieve those pages, building a spidering application.

Website Authentication

Logically some websites will be password protected in order to limit access to a select group of individuals, service subscribers for instance. Many sites employ some sort of authentication method, two of most common including Basic and Digest authentication. You'll be pleased to know that presenting the necessary credentials is a breeze with WWW::Mechanize. Note that because WWW::Mechanize is a subclass of LWP::UserAgent, you can use any LWP::UserAgent method. Listing 1-4 demonstrates the authentication process.

Listing 1-4. Logging into a website protected by Basic/Digest authentication

#!/usr/bin/perl

use WWW::Mechanize;

# What site are we connecting to?
my $url = "secret.example.com";

# Username
my $username = "jason";

# Password
my $password = "secret";

# Create a new instance of WWW::Mechanize
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Supply the necessary credentials
$mechanize->credentials($url, $username, $password);

# Retrieve the desired page
$mechanize->get("$url/bestselling.html");

Connecting to Secure websites (https://)

WWW::Mechanize is capable of connecting to secure websites in exactly the same fashion as has already been demonstrated; you'll just need to substitute http:// for https://. It depends upon IO::Socket::SSL for this feature, therefore if an error occurs check whether this module is installed.

Accepting Cookies

Many websites expect the user to possess minimal set of capabilities, such as the ability to accept cookies for instance. If such constraints aren't met, many sites will prevent the user from effectively visiting. That said, if we're going to retrieve information from such sites, we'll need to figure out a way to meet this requirement. This is easily done with WWW::Mechanize, as like the previous authentication feature it has the features of LWP::UserAgent at its disposal, and this module supports cookie management. Listing 1-5 demonstrates just how easily this is accomplished.

Listing 1-5. Cookie Management

#!/usr/bin/perl

use WWW::Mechanize;

# Create a new instance of WWW::Mechanize
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Manage cookies
$agent->cookie_jar(HTTP::Cookies->new);

# Continue with other tasks...

Conclusion

In this brief tutorial, I've showed you only a fraction of WWW:Mechanize's capabilities. I invite you to peruse the WWW::Mechanize documentation for a complete description. Hopefully these examples will suffice for helping you get quickly moving with this great module.

About the Author

W. Jason Gilmore (http://www.wjgilmore.com/) is the Open Source Editorial Director for Apress (http://www.apress.com/). He's the author of Beginning PHP 5 and MySQL: Novice to Professional (Apress, 2004. 748pp.). His work has been featured within many of the computing industry's leading publications, including Linux Magazine, O'Reillynet, Devshed, Zend.com, and Webreview. Jason is also the author of A Programmer's Introduction to PHP 4.0 (453pp., Apress). Along with colleague Jon Shoberg, he's co-author of "Out in the Open," a monthly column published within Linux magazine.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date