So, you’ve retrieved the HTML of that Web page (See How to Snatch HTML Using Visual Basic Code! and now need to parse out all the links to use in your research database. Or maybe you’ve visited the page and want to make a note of all the image links, so you can download them at some later point.
Well, you have two options. You can write your own parsing algorithm, consisting of ten million InStr and Mid statements. They’re often slow and frequently buggy, but they’re a truly great challenge (always my favorite routines to write).
Alternatively, you can write a regular expression in VB.NET. This is where you provide an “expression” that describes how a link looks and what portion you want to retrieve (that is, the bit after <a href=” but before the next “ for a hyperlink). Then, you run the expression and retrieve matches. The problem with these is that they’re difficult to formulate.
So, why not cheat? Following you’ll find two neat little functions I’ve already put together using regular expressions (plug: you’ll find dozens of ready-to-run expressions in my new book, The Ultimate VB .NET and ASP.NET Code Book). Here, just pass in the HTML from your Web page, and it’ll return an ArrayList object containing the link/image matches:
Public Function ParseLinks(ByVal HTML As String) As ArrayList ' Remember to add the following at top of class: ' - Imports System.Text.RegularExpressions Dim objRegEx As System.Text.RegularExpressions.Regex Dim objMatch As System.Text.RegularExpressions.Match Dim arrLinks As New System.Collections.ArrayList() ' Create regular expression objRegEx = New System.Text.RegularExpressions.Regex( _ "a.*hrefs*=s*(?:""(?<1>[^""]*)""|(?<1>S+))", _ System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _ System.Text.RegularExpressions.RegexOptions.Compiled) ' Match expression to HTML objMatch = objRegEx.Match(HTML) ' Loop through matches and add <1> to ArrayList While objMatch.Success Dim strMatch As String strMatch = objMatch.Groups(1).ToString arrLinks.Add(strMatch) objMatch = objMatch.NextMatch() End While ' Pass back results Return arrLinks End Function Public Function ParseImages(ByVal HTML As String) As ArrayList ' Remember to add the following at top of class: ' - Imports System.Text.RegularExpressions Dim objRegEx As System.Text.RegularExpressions.Regex Dim objMatch As System.Text.RegularExpressions.Match Dim arrLinks As New System.Collections.ArrayList() ' Create regular expression objRegEx = New System.Text.RegularExpressions.Regex( _ "img.*srcs*=s*(?:""(?<1>[^""]*)""|(?<1>S+))", _ System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _ System.Text.RegularExpressions.RegexOptions.Compiled) ' Match expression to HTML objMatch = objRegEx.Match(HTML) ' Loop through matches and add <1> to ArrayList While objMatch.Success Dim strMatch As String strMatch = objMatch.Groups(1).ToString arrLinks.Add(strMatch) objMatch = objMatch.NextMatch() End While ' Pass back results Return arrLinks End Function
Here’s a simplified example using the ParseLinks routine. The ParseImages routine works in exactly the same way:
Dim arrLinks As ArrayList = ParseLinks( _ "<a href=""http://www.marksandler.com/"">" & _ "Visit MarkSandler.com</a>") ' Loop through results Dim shtCount As Integer For shtCount = 0 To arrLinks.Count - 1 MessageBox.Show(arrLinks(shtCount).ToString) Next
One word of warning: Many Web sites use relative links. In other words, an image may refer to /images/mypic.gif rather than http://www.mysite.com/images/mypic.gif. You may wish to check for this in code (perhaps look for the existence of “http”)—if the prefix isn’t there, add it programmatically.
And that’s all you need to know to successfully strip links and images out of any HTML. Best wishes!
About the Author
Karl Moore (MCSD, MVP) is an experience author living in Yorkshire, England. He is author of numerous technology books, including the new Ultimate VB .NET and ASP.NET Code Book (ISBN 1-59059-106-2), plus regularly features at industry conferences and on BBC radio. Moore also runs his own creative consultancy, White Cliff Computing Ltd. Visit his official Web site at www.karlmoore.com.
# # #