Need to visit a competitor Web page and parse out the latest rival product prices? Looking to retrieve data from a company that hasn’t yet figured out Web services? Whatever your motives, if you’re looking to grab the HTML of a Web page, the following little function should be able to help.
Just call the following GetPageHTML function, passing in the URL of the page you want to retrieve. It’ll return a string containing the HTML:
Public Function GetPageHTML( _ ByVal URL As String) As String ' Retrieves the HTML from the specified URL Dim objWC As New System.Net.WebClient() Return New System.Text.UTF8Encoding().GetString( _ objWC.DownloadData(URL)) End Function
Here’s an example of its usage:
strHTML = _ GetPageHTML("http://www.karlmoore.com/")
An extremely short function, but incredibly useful.
How to Snatch HTML, with a Timeout
The last function is great for many applications. You pass it a URL, and it’ll work on grabbing the page HTML. The problem is that it will keep trying until it either times out or retrieves the page.
Sometimes, you don’t have that luxury. Say you’re running a Web site that needs to retrieve the HTML, parse it, and display results to a user. You can’t wait two minutes for the server to respond, then download the page and feed it back to your visitor. You need a response within ten seconds—or not at all.
Unfortunately, despite numerous developer claims to the contrary, this cannot be done through the WebClient class. Rather, you need to use some of the more in-depth System.Net classes to handle the situation. Here’s my offering, wrapped into a handy little function:
Public Function GetPageHTML(ByVal URL As String, _ Optional ByVal TimeoutSeconds As Integer = 10) _ As String ' Retrieves the HTML from the specified URL, ' using a default timeout of 10 seconds Dim objRequest As Net.WebRequest Dim objResponse As Net.WebResponse Dim objStreamReceive As System.IO.Stream Dim objEncoding As System.Text.Encoding Dim objStreamRead As System.IO.StreamReader Try ' Setup our Web request objRequest = Net.WebRequest.Create(URL) objRequest.Timeout = TimeoutSeconds * 1000 ' Retrieve data from request objResponse = objRequest.GetResponse objStreamReceive = objResponse.GetResponseStream objEncoding = System.Text.Encoding.GetEncoding( _ "utf-8") objStreamRead = New System.IO.StreamReader( _ objStreamReceive, objEncoding) ' Set function return value GetPageHTML = objStreamRead.ReadToEnd() ' Check if available, then close response If Not objResponse Is Nothing Then objResponse.Close() End If Catch ' Error occured grabbing data, simply return nothing Return "" End Try End Function
Here, the code creates objects to request the data from the Web, setting the absolute server timeout. If the machine responds within the given timeframe, the response is fed into a stream, converted into the UTF8 text format we all understand, and then passed back as the result of the function. You can use it a little like this:
strHTML = GetPageHTML("http://www.karlmoore.com/", 5)
Admittedly, this all seems like a lot of work just to add a timeout. But it does its job—and well. Enjoy!
TOP TIP Remember, the timeout we’ve added is for our request to be acknowledged by the server, rather than for the full HTML to have been received.
About the Author
Karl Moore (MCSD, MVP) is an experience author living in Yorkshire, England. He is author of numerous technology books, including the new Ultimate VB .NET and ASP.NET Code Book (ISBN 1-59059-106-2), plus regularly features at industry conferences and on BBC radio. Moore also runs his own creative consultancy, White Cliff Computing Ltd. Visit his official Web site at www.karlmoore.com.
# # #