Automating Web-based Data Retrieval with Perl
At Apress one of my unofficial tasks, and indeed pastimes is writing applications that aggregate and parse a variety of online data including book reviews, sales rankings, and other items of interest that help us to better understand gauge reader interest in our books, our competitors' books, and technology in general. For instance, logically sales through online retailers comprise a significant percentage of revenue for computer book publishers, given that the target audience tends to be quite comfortable with online purchases. Many of these retailers offer information pertinent to the popularity of the books by way of sales ranks, book reviews, and some have recently begun facilitating retrieval of such information by way of Web Services.
The maturation of Web Services signals a giant leap forward in terms of being able to effectively retrieve and conveniently format data residing on remote servers. With high profile companies such as Amazon.com and ebay leading the charge in terms of making Web Services available to the general public, it's only a matter of time before others follow suit. Yet despite all the excitement its fair to say that widespread availability of Web Services for public consumption is still a few years away, meaning that in many cases the only solution is the heretical practice of screen-scraping.
On its surface this isn't such an issue; after all, many programming languages offer some pretty powerful text parsing features, PHP, Perl and Python among them. However, one must contend with a variety of issues in order to effectively automate this process, including:
- Page retrieval: How will the script initiate the connection to the remote web server and retrieve the file for local storage and manipulation?
- Authentication: Understandably some information is confidential, and therefore it may be password protected. How do you successfully authenticate without manually supplying the username and password?
- Cookies: Some websites require a minimal set of user capabilities, and don't care whether the user is a human or robot. One common requirement is the ability to accept and manage cookies. How will the script overcome this seemingly complex issue?
Using the Perl language, it's surprisingly simple to accomplish all three of these tasks. In this tutorial, I'll show you how each is easily tackled with Perl using a popular module, namely WWW::Mechanize.
WWW::Mechanize is a Perl module capable of interacting with a website. Among other things, it can traverse links, download pages, and even complete and submit forms. It's suitable to a wide variety of applications, such as testing web applications and automating page retrieval for subsequent parsing. Maintained by Andy Lester, you can learn more about WWW::Mechanize at CPAN.
The Data Source
Suppose you are tasked with retrieving a list of best-selling computer books from a website that monitors national book sales. This page might look something like Figure 1-1. The corresponding HTML code is shown in Listing 1-1. In the following examples I'll show you how to retrieve, save, and parse this page in a variety of manners.
Figure 1-1. Bestselling Computer books
Listing 1-1. Bestselling computer books (bestselling.html)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&rt; <html xmlns="http://www.w3.org/1999/xhtml"&rt; <head> <title>Computer Books</title> </head> <body> <h3>Sales Data, Week of December 13, 2004</h3> <table border="1"> <tr> <th>ISBN</th> <th>Title</th> <th>MTD Sales</th> <th>YTD Sales</th> </tr> <tr> <td>1-23456-789-0</td> <td><a href="/details/1-23456-789-0/">Cooking with PHP</a></td> <td>548</td> <td>5,678</td> </tr> <tr> <td>0-98765-432-1</td> <td><a href="/details/0-98765-432-1/">Prolific Perl</a></td> <td>378</td> <td>3,456</td> </tr> <tr> <td>6-78901-234-5</td> <td><a href="/details/6-78901-234-5/">Managing Markets with MySQL</a></td> <td>973</td> <td>7,121</td> </tr> <tr> <td>1-24245-456-9</td> <td><a href="/details/1-24245-456-9/">A History of Keyboards</a></td> <td>787</td> <td>2,290</td> </tr> </table> </body> </html>