Tracking a FedEx Package with an HTML Scraper
This Perl script utilizes the LWP libraries, which allow us to send queries to a remote Web server. This library, along with the CGI library, are both loaded on lines 1 and 2. On line 10, we check for the existence of the track_num parameter. If it exists, we go ahead and send it to the Fedex Web site by executing the &parse_html function on line 11. If an error occurs, meaning that the tracking number was invalid, we output some VoiceXML content that plays an error message and re-runs the script. If the request was successful, the code on lines 26 to 54 is executed, which plays the tracking information that was retrieved.
If the track_num form field was not passed to the script, the VoiceXML content on lines 55 to 79 is sent back to the VoiceXML gateway. This is the dialog that asks the user for the tracking number. So if a number is not sent, this is the dialog that gets executed first.
The guts of the script are in the parse_html subroutine starting on line 81. This is where we send the tracking number to the Fedex Web site and parse the results. This is also where we need to do a little regular expression fancy footwork. On line 83, we build the full contents of the URL that we will be submitting the information to (including the tracking number). Fortunately, the Fedex script allows us to submit data via the GET method (instead of POST), so we can simply call the get function of the LWP library, passing the URL we just built. The function passes the resulting HTML back, which is stored in the $page variable.
So now we have a real big text string that contains an HTML page. How do we get that information out of all that text? Well, by finding a pattern to the results and using that as a mark for where the data starts. To figure this out, I used the Fedex application from a Web browser and viewed the HTML source. I found that a unique block of HTML occurs just before the tracking data begins. The result are the regular expression on line 91 which says, find this text CLASS="resultstableheader" followed by zero or more characters .*? followed by a </tr> tag, followed by a whole bunch of text (.*?) up to a </table> tag. This last bit of text up to the </table> tag is where our data is located. We want to grab that and save it on line 92.
Now that we have the glob of text containing the data, we have to parse it with more regular expressions. To accomplish this in one fatal swoop, I created a compound regular expression on lines 94 and 95 that finds a row of data and splits it into pieces. Lines 97 to 112 split the rows into records and saves the information in an array of hashes. Each array item contains the message, date, time, and notes that are related to the record. Fedex lists not only the current status, but the whole history of a package. While we're only playing back the first record, which is the latest tracking information, you can reuse this data structure and set of regular expressions to play the entire history.
Once the data structure has been built, it returns the array back to the caller. You can see where we reference the first record of this data structure on lines 27 to 31, where we're creating local copies, which are included in the VoiceXML output.
Well, we just created a pretty useful VoiceXML application in not much more than 100 lines of code. That's impressive I think. The opportunity to voice enable any number of Web applications is at hand, and here's some code you can use to get started.
About Jonathan Eisenzopf
Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia that specializes in Voice Web consulting and training. He has also written articles for other online and print publications including WebReference.com and WDVL.com. Feel free to send an email to email@example.com regarding questions or comments about the VoiceXML Developer series, or for more information about training and consulting services.
Page 2 of 2