gamelan
Search EarthWeb
CodeGuru | Gamelan | Jars | Wireless | Discussions
Navigate developer.com
Architecture & Design  
Database  
Java
Languages & Tools
Microsoft & .NET
Open Source  
Project Management  
Security  
Techniques  
Voice  
Web Services  
Wireless/Mobile
XML  
Technology Jobs  

   Developer.com Webcasts:
  The Impact of Coding Standards and Code Reviews

  Project Management for the Developer

  Defining Your Own Software Development Methodology

  more Webcasts...




See the Winners!


Developer Jobs

Be a Commerce Partner
Condos For Sale
Promos and Premiums
Promotional Products
Build a Server Rack
Boat Donations
Promotional Golf
Shop Online
Online Education
Memory Upgrades
GPS Devices
Televisions
Car Donations
Domain registration
Imprinted Gifts

 
Biz Resources
Contact Management Software
Domain Name Services
Internet Security


  Managing the Modern Network
Sponsored by HP
In a global economy where information crosses the globe in an instant, and where Web-based applications power business, it's more important than ever to ensure your network is safe from threats and optimized to deliver the data your business needs. »
 
  Business Service Management: Generate Revenue Through IT
Sponsored by HP
IT must now help organizations attract, retain and grow customer relationships and increase customer satisfaction. Business service management (BSM) helps lay the foundation by managing services in dynamic support of business requirements. Learn more. »
 
  Evaluating Software as a Service for Your Business
Sponsored by Webroot
Is Software as a Service just hype, or is something really going on here? See if your company can benefit as SaaS tries to change the face of the enterprise. »
 
  Storage Networking: Configuration and Planning
Sponsored by HP
The most critical part of setting up a SAN is configuring each individual disk array. This guide examines configurations for SAN-attached servers and disk arrays, and looks at the future of IP storage. »
 
  Is Your Disaster Recovery Plan Good Enough?
Sponsored by HP
Preparing for a disaster is more often than not part of the storage planning process, and it is one of the most difficult tasks, since it includes local hardware and software, networking equipment, and a test plan. Learn how to get disaster recovery right. »
 
Developer News -
SaaS Tool Offers Custom Database Development    May 9, 2008
Microsoft’s Automated Agent: Can We Talk?    May 7, 2008
Borland Finally Sells CodeGear    May 7, 2008
Red Hat Heads For The JON 2.0    May 7, 2008
Free Tech Newsletter -

Best Practices for Developing a Web Site: Checklists, Tips, Strategies & More. Download Exclusive eBook Now.

HTML Parsing: The World is Your Database
By Brad Lhotsky

Go to page: 1  2  Next  

We've all found useful information on the Web. Occassionally, it's even necessary to retrieve that information in an automated fashion. It could be just for your own amusement, possibly a new Web service that hasn't yet published an API, or even a critical business partner who only exposes a Web-based interface to you.

Of course, screen scraping Web pages is not the optimal solution to any problem, and I highly advise you to look into APIs or formal Web services that will provide a more consistent and intentional programming interface. Potential problems could arise for a number of reasons.

Step 1: Considerations

The most obvious and annoying problem is you are not guaranteed any form of consistency in the presentation of your data. Web sites are under construction constantly. Even when they look the same, programmers and designers are behind the scenes tweaking little pieces to optimize, straighten, or update. This means that your data is likely to move or disappear entirely. As you can imagine, this can lead to erroneous data or your program failing to complete.

A problem that you might not think of immediately is the impact of your screen scraping on the target's Web server. During the development phase especially, you should give serious thought about mirroring the Web site, using any number of mirroring applications available on the Web. This will protect against you accidentally finding a Denial of Service the target's Web site. Once you move to production, out of common courtesy, you should limit the running of your program to as few times as possible to provide you with the accuracy your required. Obviously, if this is a business-to-business transaction, you should keep the other guy in the loop. It won't be good for your business relationships should you trip the other companies Intrusion Detection System and then have to explain what you're doing to a defensive security administrator.

Along the same lines, consider the legality of the screen scraping. To a Web server, your traffic could masquerade as 100% interactive, valid traffic, but upon closer inspection, a wise system administrator will likely put the pieces together. Search that company's Web site for "Acceptable Use Policies" and "Terms of Service." In some cases, they may not apply but it's likely that the privilege to access the data is granted only after agreeing to one of the two aforementioned documents.

Step 2: Research

At this point, it's necessary to dive into the task at hand. Go through the motions manually in a Web browser that supports thorough debugging. My experience with Firefox has always been a positive one. Through the use of tools such as the DOM Inspector, the built-in Javascript Debugger, and extensions such as Web Developer, View Source With .., and Venkman, it's been one of the best platforms for Web development I've encountered. Incidentally, the elements of Web design are critical to the automated extraction of that data. There are two phases to debug to write a good screen scraper.

The request

A Web server is not a mind reader; it has to know what you're after. HTTP requests tell the Web server what document to serve and how to serve it. The request can be issued through the address bar, a form, or a link. As you navigate the site, take note of the parameters passed in the Query String of the URL. If you need to log in, use the Web Developer Extension to "Display Form Details" and take note of the names of the login prompt and the form objects themselves. Also, it's important to take note of the "METHOD" the form is going to use, either "GET" or "POST." As you go through, sketch out the process on a scrap piece of paper with details on the parameters along the way. If you're clicking on links to get where you need, use the right-click option of "View Link Properties" to get details.

A key thing people often miss when doing Web automation is the effect of client-side scripting. You can use Venkman to step through the entire run of client-side code. You want to pay attention to hidden form fields that are often set "onClick" of the submit button, or through other types of normal user interaction. Without knowing and setting these hidden fields to the correct value, the page will refuse to load or cause problems. Granted, this isn't good practice on the site designer's part because a growing number of security-aware Web surfers are limiting or disabling client-side scripting entirely.

The response

After sketching out the path to your data, you've finally arrived at the page that contains the data itself. You now need to map out the page in a way that your data can be identified from the rest of the insignificant details, styling, and advertisements! I've always believed in syntax highlighting and have become accustomed to vim's flavor of highlighting. I've got the View Source With .. Extension configured to use gvim. So, I right-click and, with any luck, the page source is displayed in the gvim buffer with syntax highlighting enabled. If the page has a weird extension, or no extension, I might have to "set syntax=html" if it's not presenting the proper page headers. Search through the source file, correlating the visual representations in the browser with the source code that's generating them. You'll need to find landmarks in the HTML to use as a means to guide your parser through an obscure landscape of markup language. If you're having problems, another indispensible tool provided by Firefox is "View Selection Source.To use it, simply highlight some content and then right-click -> "View Selection Source." A Mozilla Source viewer opens with just the HTML that generated the selected content highlighted with some surrounding HTML to provide context.

You're going to have to start thinking like a machine. Think Simple, 1's and 0's, true and false! I usually start at my data and work back, looking for a unique tag or pattern that I can use to locate the data moving forward. Look not only at the HTML Elements (<b>,<td>, and so forth), but at their attributes (color="#FF000", colspan="3") to profile the areas containing and surrounding your data.

The lay of the land is changing these days. It should be getting much easier to treat HTML as a data source thanks Web Standards and the alarming number of Web designers pushing whole-heartedly for their adoption. The old table-based layouts, styled by font tags and animated GIFs, is giving way to "Document Object Model"-aware design and styling fueled mostly by Cascading Style Sheets (CSS). CSS works most effectively when the document layout emulates an object. There are "classes," "ids," and tags establish relationships. CSS makes it trivial for Web designers with passion and experience in Design Arts to cooperate with Web programmers whose passion is the Art of Programming and whose idea of "progressive design" is white text on a black background! The cues that programmers and designers specify to insure interoperability of content and presentation gives the Screen Scraper a legible road map by which to extract their data. If you see "div," "span," "tbody," and "theader" elements bearing attributes such as "class" and "id," favor using these elements as landmarks. Although nothing is guaranteed, it's much more likely that these elements will maintain their relationships because they're often the result of divisional cooperation than entropy.

One of the simplest ways to keep your bearing is to print out the section of HTML you're targetting, and sketch out some simple logic to be able to quickly identify it. I use a highlighter and a red pen to make notes on the printout that I can glance at as a sanity check.

Step 3: Automated Retrieval of Your Content

Depending on how complicated the path to your data is, there are a number of tools available. Basic "GET" method requests that don't require cookies, session management, or form tracking can take advantage of the simple interface provided by the LWP::Simple package.

#!/usr/bin/perl

use strict;
use LWP::Simple;

my $url = q|http://www.weather.com/weather/local/21224|;

my $content = get $url;

print $content;

That's it. Simple.

More complex problems with cookies and logins will require a more sophisticated tool. WWW::Mechanize offers a simple a solution to a complex path to your data with the ability to store cookies and construct form objects that can intelligently initialize themselves. An example:

#!/usr/bin/perl

use strict;
use WWW::Mechanize;

my $authPage = q|http://www.weather.com|;
my $authForm = 'whatwhere';
my %formVars = (
    where    => '21224',
    what     => 'Weather36HourUndeclared'
);

#
# or optionally, set the fields in visible order
my @visible = qw(21224);

#
# Create a "bot"
my $bot = new WWW::Mechanize();

#
# Masquerade as Mac Firefox
$bot->agent_alias('Mac Mozilla');

#
# Retrieve the page with our "login form"
$bot->get($authPage);

#
# fill out the form!
$bot->form_name($authForm);

while( my ($k,$v) = each %formVars ) {
    $bot->field($k,$v);
}
#
# OR
# $bot->set_visible(@visible);

#
# submit the form!
$bot->submit();

#
# Print the Content
print $bot->content();

Go to page: 1  2  Next  


Tools:
Add www.developer.com to your favorites
Add www.developer.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed


Other Java Archives

Work With InterSystems. Not Separate Systems. Rapidly develop and deploy connectable applications.
Is it time to make your move to the multi-threaded and parallel processing world? Find out!
Learn about expanding business opportunities for the reseller channel. Visit IT Channel Planet.
Flash Demo: Learn how IBM Information Server Blade is easy to manage, highly scalable and efficient.
Whitepaper: Embeddable Content Platform for OEM's



JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
Microsoft Article: HyperV-The Killer Feature in WinServer ‘08
Avaya Article: How to Feed Data into the Avaya Event Processor
Microsoft Article: Install What You Need with Win Server ‘08
HP eBook: Putting the Green into IT
Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
Avaya Article: Setting Up a SIP A/S Development Environment
IBM Article: How Cool Is Your Data Center?
Microsoft Article: Managing Virtual Machines with Microsoft System Center
HP eBook: Storage Networking , Part 1
Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Intel Video: Are Multi-core Processors Here to Stay?
On-Demand Webcast: Five Virtualization Trends to Watch
HP Video: Page Cost Calculator
Intel Video: APIs for Parallel Programming
HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Sun Download: Solaris 8 Migration Assistant
Sybase Download: SQL Anywhere Developer Edition
Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
Red Gate Download: SQL Compare Pro 6
Iron Speed Designer Application Generator
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
IBM Article: Collaborating in the High-Performance Workplace
HP Demo: StorageWorks EVA4400
Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
Microsoft How-to Article: Get Going with Silverlight and Windows Live
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES