November 23, 2014
Hot Topics:

Automating Web-based Data Retrieval with Perl

  • January 4, 2005
  • By W. Jason Gilmore
  • Send Email »
  • More Articles »

Retrieving the Web page

For the first exercise, I'd like to demonstrate how simple it is to retrieve a webpage with WWW::Mechanize. Listing 1-2 offers the code for retrieving bestselling.html (the page shown in Listing 1-2), outputting its contents, and saving it locally. Note that if you copy the below code (please do), you'll need to update $url to point to the desired page.

Listing 1-2. Retrieving the bestselling.html web page

#!/usr/bin/perl

# Include the WWW::Mechanize module
use WWW::Mechanize;

# What URL shall we retrieve?
$url = "http://www.example.com/bestselling.html";

# Create a new instance of WWW::Mechanize
# enabling autoheck checks each request to ensure it was successful,
# producing an error if not.
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Retrieve the page
$mechanize->get($url);

# Assign the page content to $page
my $page = $mechanize->content;

# Output the page
print $page;

# Let's also save the page locally
open(FH, ">bestsellers.txt");

print FH $page;

close(FH);

Executing Listing 1-1 outputs the source code of books.html. It doesn't get any easier than that! Once you get() the page, it's possible to retrieve a number of other items, including the page title, content type, a list of all links found on the page, and more. In fact, Listing 1-3 shows you how to output all links found in the retrieved page.

Retrieving Page Links

Referring back to Listing 1-1, you'll see that each book title links to a URL that will presumably present detailed information about that book. It's likely that I would also want to retrieve the detailed data, and therefore would like to retrieve and save those pages. I could hardcode an array of URLs into the script, but what if a book is deleted or another is added? Certainly maintaining the list would be too tedious, therefore I'd like to rely on the information found in the page represented by Listing 1-1, and retrieve only the pages listed there. To do so, I'll need to retrieve each link. Listing 1-3 demonstrates how this is accomplished.

Listing 1-3. Outputting page links

#!/usr/bin/perl

# Include the WWW::Mechanize module
use WWW::Mechanize;

# What URL shall we retrieve?
$url = "http://www.example.com/bestselling.html";

# Create a new instance of WWW::Mechanize
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Retrieve the page
$mechanize->get($url);

# Retrieve the page title
my $title = $mechanize->title;

print "<b>$title</b><br />";

# Place all of the links in an array
my @links = $mechanize->links;

# Loop through and output each link
foreach my $link (@links) {

   # Retrieve the link URL
   my $href = $link->url;

   # Retrieve the link text
   my $name = $link->text;
   
   print "<a href=\"$href\">$name</a>\n";

}

Executing Listing 1-3 produces the following output:

<a href="/details/0-98765-432-1/">Prolific Perl</a>
<a href="/details/6-78901-234-5/">Managing Markets with MySQL</a>
<a href="/details/1-24245-456-9/">A History of Keyboards</a>

Of course, given this ability it would be trivial to modify Listing 1-3 to recurse into each link and retrieve those pages, building a spidering application.

Website Authentication

Logically some websites will be password protected in order to limit access to a select group of individuals, service subscribers for instance. Many sites employ some sort of authentication method, two of most common including Basic and Digest authentication. You'll be pleased to know that presenting the necessary credentials is a breeze with WWW::Mechanize. Note that because WWW::Mechanize is a subclass of LWP::UserAgent, you can use any LWP::UserAgent method. Listing 1-4 demonstrates the authentication process.

Listing 1-4. Logging into a website protected by Basic/Digest authentication

#!/usr/bin/perl

use WWW::Mechanize;

# What site are we connecting to?
my $url = "secret.example.com";

# Username
my $username = "jason";

# Password
my $password = "secret";

# Create a new instance of WWW::Mechanize
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Supply the necessary credentials
$mechanize->credentials($url, $username, $password);

# Retrieve the desired page
$mechanize->get("$url/bestselling.html");

Connecting to Secure websites (https://)

WWW::Mechanize is capable of connecting to secure websites in exactly the same fashion as has already been demonstrated; you'll just need to substitute http:// for https://. It depends upon IO::Socket::SSL for this feature, therefore if an error occurs check whether this module is installed.

Accepting Cookies

Many websites expect the user to possess minimal set of capabilities, such as the ability to accept cookies for instance. If such constraints aren't met, many sites will prevent the user from effectively visiting. That said, if we're going to retrieve information from such sites, we'll need to figure out a way to meet this requirement. This is easily done with WWW::Mechanize, as like the previous authentication feature it has the features of LWP::UserAgent at its disposal, and this module supports cookie management. Listing 1-5 demonstrates just how easily this is accomplished.

Listing 1-5. Cookie Management

#!/usr/bin/perl

use WWW::Mechanize;

# Create a new instance of WWW::Mechanize
my $mechanize = WWW::Mechanize->new(autocheck => 1);

# Manage cookies
$agent->cookie_jar(HTTP::Cookies->new);

# Continue with other tasks...

Conclusion

In this brief tutorial, I've showed you only a fraction of WWW:Mechanize's capabilities. I invite you to peruse the WWW::Mechanize documentation for a complete description. Hopefully these examples will suffice for helping you get quickly moving with this great module.

About the Author

W. Jason Gilmore (http://www.wjgilmore.com/) is the Open Source Editorial Director for Apress (http://www.apress.com/). He's the author of Beginning PHP 5 and MySQL: Novice to Professional (Apress, 2004. 748pp.). His work has been featured within many of the computing industry's leading publications, including Linux Magazine, O'Reillynet, Devshed, Zend.com, and Webreview. Jason is also the author of A Programmer's Introduction to PHP 4.0 (453pp., Apress). Along with colleague Jon Shoberg, he's co-author of "Out in the Open," a monthly column published within Linux magazine.





Page 2 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Rocket Fuel