Microsoft & .NETVisual BasicTricks of Parsing a Page for Links and Images

Tricks of Parsing a Page for Links and Images

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

So, you’ve retrieved the HTML of that Web page (See How to Snatch HTML Using Visual Basic Code! and now need to parse out all the links to use in your research database. Or maybe you’ve visited the page and want to make a note of all the image links, so you can download them at some later point.

Well, you have two options. You can write your own parsing algorithm, consisting of ten million InStr and Mid statements. They’re often slow and frequently buggy, but they’re a truly great challenge (always my favorite routines to write).

Alternatively, you can write a regular expression in VB.NET. This is where you provide an “expression” that describes how a link looks and what portion you want to retrieve (that is, the bit after <a href=” but before the next for a hyperlink). Then, you run the expression and retrieve matches. The problem with these is that they’re difficult to formulate.

So, why not cheat? Following you’ll find two neat little functions I’ve already put together using regular expressions (plug: you’ll find dozens of ready-to-run expressions in my new book, The Ultimate VB .NET and ASP.NET Code Book). Here, just pass in the HTML from your Web page, and it’ll return an ArrayList object containing the link/image matches:

Public Function ParseLinks(ByVal HTML As String) As ArrayList
    ' Remember to add the following at top of class:
    ' - Imports System.Text.RegularExpressions
    Dim objRegEx As System.Text.RegularExpressions.Regex
    Dim objMatch As System.Text.RegularExpressions.Match
    Dim arrLinks As New System.Collections.ArrayList()
    ' Create regular expression
    objRegEx = New System.Text.RegularExpressions.Regex( _
        "a.*hrefs*=s*(?:""(?<1>[^""]*)""|(?<1>S+))", _
        System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _
        System.Text.RegularExpressions.RegexOptions.Compiled)
    ' Match expression to HTML
    objMatch = objRegEx.Match(HTML)
    ' Loop through matches and add <1> to ArrayList
    While objMatch.Success
        Dim strMatch As String
        strMatch = objMatch.Groups(1).ToString
        arrLinks.Add(strMatch)
        objMatch = objMatch.NextMatch()
    End While
    ' Pass back results
    Return arrLinks
End Function

Public Function ParseImages(ByVal HTML As String) As ArrayList
    ' Remember to add the following at top of class:
    ' - Imports System.Text.RegularExpressions
    Dim objRegEx As System.Text.RegularExpressions.Regex
    Dim objMatch As System.Text.RegularExpressions.Match
    Dim arrLinks As New System.Collections.ArrayList()
    ' Create regular expression
    objRegEx = New System.Text.RegularExpressions.Regex( _
        "img.*srcs*=s*(?:""(?<1>[^""]*)""|(?<1>S+))", _
        System.Text.RegularExpressions.RegexOptions.IgnoreCase Or _
        System.Text.RegularExpressions.RegexOptions.Compiled)
    ' Match expression to HTML
    objMatch = objRegEx.Match(HTML)
    ' Loop through matches and add <1> to ArrayList
    While objMatch.Success
        Dim strMatch As String
        strMatch = objMatch.Groups(1).ToString
        arrLinks.Add(strMatch)
        objMatch = objMatch.NextMatch()
    End While
    ' Pass back results
    Return arrLinks
End Function

Here’s a simplified example using the ParseLinks routine. The ParseImages routine works in exactly the same way:

Dim arrLinks As ArrayList = ParseLinks( _
    "<a href=""http://www.marksandler.com/"">" & _
    "Visit MarkSandler.com</a>")
' Loop through results
Dim shtCount As Integer
For shtCount = 0 To arrLinks.Count - 1
    MessageBox.Show(arrLinks(shtCount).ToString)
Next

One word of warning: Many Web sites use relative links. In other words, an image may refer to /images/mypic.gif rather than http://www.mysite.com/images/mypic.gif. You may wish to check for this in code (perhaps look for the existence of “http”)—if the prefix isn’t there, add it programmatically.

And that’s all you need to know to successfully strip links and images out of any HTML. Best wishes!

About the Author

Karl Moore (MCSD, MVP) is an experience author living in Yorkshire, England. He is author of numerous technology books, including the new Ultimate VB .NET and ASP.NET Code Book (ISBN 1-59059-106-2), plus regularly features at industry conferences and on BBC radio. Moore also runs his own creative consultancy, White Cliff Computing Ltd. Visit his official Web site at www.karlmoore.com.

# # #

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories