Microsoft & .NETVisual C#Parsing HTML in Microsoft C#

Parsing HTML in Microsoft C#

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Most data on the Web is stored in the Hypertext Markup Language (HTML) format. There are many times that you might want to parse HTML in your C# application. However, the .NET framework does not provide an easy way to parse HTML. Evidence of this is the numerous questions posted by C# programmers looking for an easy way to parse HTML.

The Microsoft .NET framework includes extensive support for Extensible Markup Language (XML). However, although XML and HTML look very similar, they are not very compatible. Consider the following major differences between XML and HTML:

  • XML requires end tags.
  • All XML attribute values must be fully quoted with either single or double quotes.
  • XML tags must be properly nested.
  • XML tag names are case sensitive.
  • XML does not allow duplicate attributes.
  • Empty attributes are not allowed in XML.

Let’s look at one of these examples in code to illustrate the difference. In XML, every beginning tag must have an ending tag. The following HTML would cause problems for a XML parser.

<p>This is line 1<br>
This is line 2</p>

This is just one of many differences. Of course, you can require HTML to be written in such a way that it is compatible with XML. The preceding HTML could be rewritten as in the following example.

<p>This is line 1<br/>
This is line 2</p>

Both an XML parser and any modern browser could understand this. Unfortunately, this is not a viable solution because you do not control the sources of HTML. You will want your program to be able to process HTML from any source.

The Solution

Because of this, I found it necessary to write my own HTML parser. In this article, I will show you how my HTML parser was constructed, and how you can use this parser with your own applications. I will begin by showing you the main components that make up the HTML parser. I will conclude this article by showing a simple example that uses the HTML parser.

The HTML parser consists of the following four classes:

  • Attribute—The attribute class is used to hold an individual attribute inside an HTML tag.
  • AttributeList—The attribute list holds an individual HTML tag and all of its attributes.
  • Parse—Holds general text parsing routines.
  • ParseHTML—The main class that you will interface with; the ParseHTML class is fed the HTML that you would like to parse.

I will now show you how each of these classes functions, and how you will use them. I will begin with the Attribute class.

The Attribute Class

The Attribute class is used to store individual HTML attributes. The source code for the Attribute class can be seen in Listing 1. The following HTML tag demonstrates attributes:

<img src="picture.gif" alt="Some Picture">

The above HTML tag has two attributes named “src” and “alt”. The values of these two attributes are “picture.gif” and “Some Picture”, respectively.

The Attribute class consists of three properties named “name”, “value”, and “delim”. The “name” property stores the name of the attribute. The “value” property stores the value held by the property. And finally, the “delim” property holds the character that was used to delimit the value, if any. This property will either hold a quote (“), an apostrophe (‘), or nothing at all, depending on what was used to delineate the value.

The AttributeList Class

An HTML tag often consists of several attributes. The “AttributeList” class is used to hold a list of these attributes. The “AttributeList” class is shown in Listing 2. The “AttributeList” class consists of a name and a collection of attributes. The “AttributeList” name, stored in a property called “name”, holds the name of the tag. When tags are returned to you from the parser, they will be in the form of “AttributeList” objects.

The AttributeList class makes use of the C# indices. You can access individual attributes both by numeric and string indicies. For example, if an attribute “src” were stored in the “AttributeList” object “theTag”, you could access the “src” attribute in the following ways:

theTag[0]    // assuming "src" were the first attribute
theTag["src"]

Both of these methods could be used to access the attributes of the tag.

The Parse and ParseHTML Classes

If you only want to use the classes to parse HTML, you need not be concerned with the “Parse” class. The “Parse” class is used internally by the HTML parser to provide low-level support for attribute-value based files, such as HTML, SGML, XML, or even HTTP headers. The source code for the “Parse” class is shown in Listing 3. I will not cover the “Parse” class at length in this article. If you are interested in the Parse class, all of its methods are commented.

The “ParseHTML” class subclasses the “Parse” class. The “ParseHTML” class provides the HTML-specific code needed to make the parser work with HTML. The ParseHTML class is shown in Listing 4. The “ParseHTML” class will be your primary interface to the HTML parser. Actually, using the HTML parser is covered in the next section. The methods that you will use primarily are listed here.

    public char Parse()
    public AttributeList GetTag()

The Parse() method is called to retrieve the next character in the HTML file that you are parsing. If the next character is part of a tag, the value of zero is returned. When you see that Parse() has returned zero, you know that you must process an HTML tag. You can access the tag by calling the GetTag method. The GetTag method will return an ArrayList object that contains the tag and all of its attributes.

Using the HTML Parser

Now, I will show you an example of how to use the HTML parser. The example program will prompt you for a URL and then show you all of the hyperlinks contained within the HTML file located at the URL you specified. This example will only work on URLs that lead to HTML data; it will not work with images or other binary data. This example is shown in Listing 5.

To see the program work enter any URL address, such as
http://www.developer.com
. The program will then display all of the hyperlinks contained on the homepage of Developer.COM.

The loop that processes the page is shown here.

ParseHTML parse = new ParseHTML();
parse.Source = page;
while( !parse.Eof() )
{
  char ch = parse.Parse();
  if(ch==0)
  {
    AttributeList tag = parse.GetTag();
    if( tag["href"]!=null )
      System.Console.WriteLine( 
        "Found link: " + tag["href"].Value );
    }
}

As you can see, a ParseHTML object is instantiated, and the object’s Source property is set to the HTML page to be parsed. A loop continues so long as the end of the pages has not been reached. We ignore regular characters and look for each tag, noted by when ch is equal to zero. For each tag, we check to see if it has an HREF attribute. If an HREF attribute is present, the link is displayed.

Conclusions

As you can see, these classes provide an easy-to-use framework for HTML parsing that you can reuse in any Microsoft .NET application. The example program provided uses the parser only to display links. However, I have used this parser in a variety of complex HTML parsing applications.

Author Bio

Jeff Heaton (http://www.jeffheaton.com) is a leading authority on spiders and bots. Jeff is the author of Programming Spiders, Bots and Aggregators in Java (Sybex, 2001) and of the Heaton Spider, a collection of utilities for automated Web access.

Listing 1: The Attribute Class

using System;

namespace HTML
{
  /// <summary>
  /// Attribute holds one attribute, as is normally stored in an
  /// HTML or XML file. This includes a name, value and delimiter.
  /// This source code may be used freely under the
  /// Limited GNU Public License(LGPL).
  ///
  /// Written by Jeff Heaton (http://www.jeffheaton.com)
  /// </summary>
  public class Attribute: ICloneable
  {
    /// <summary>
    /// The name of this attribute
    /// </summary>
    private string m_name;

    /// <summary>
    /// The value of this attribute
    /// </summary>
    private string m_value;

    /// <summary>
    /// The delimiter for the value of this attribute(i.e. " or ').
    /// </summary>
    private char m_delim;

    /// <summary>
    /// Construct a new Attribute.  The name, delim, and value
    /// properties can be specified here.
    /// </summary>
    /// <param name="name">The name of this attribute.</param>
    /// <param name="value">The value of this attribute.</param>
    /// <param name="delim">The delimiter character for the value.
    /// </param>
    public Attribute(string name,string value,char delim)
    {
      m_name  = name;
      m_value = value;
      m_delim = delim;
    }


    /// <summary>
    /// The default constructor.  Construct a blank attribute.
    /// </summary>
    public Attribute():this("","",(char)0)
    {
    }

    /// <summary>
    /// Construct an attribute without a delimiter.
    /// </summary>
    /// <param name="name">The name of this attribute.</param>
    /// <param name="value">The value of this attribute.</param>
    public Attribute(String name,String value):this(name,value,
                                                   (char)0)
    {
    }

    /// <summary>
    /// The delimiter for this attribute.
    /// </summary>
    public char Delim
    {
      get
      {
        return m_delim;
      }

      set
      {
        m_delim = value;
      }
    }

    /// <summary>
    /// The name for this attribute.
    /// </summary>
    public string Name
    {
      get
      {
        return m_name;
      }

      set
      {
        m_name = value;
      }
    }

    /// <summary>
    /// The value for this attribute.
    /// </summary>
    public string Value
    {
      get
      {
        return m_value;
      }

      set
      {
        m_value = value;
      }
    }

    #region ICloneable Members
    public virtual object Clone()
    {
      return new Attribute(m_name,m_value,m_delim);
    }
    #endregion
  }
}

Listing 2: The AttributeList Class

using System;
using System.Collections;

namespace HTML
{
  /// <summary>
  /// The AttributeList class is used to store list of
  /// Attribute classes.
  /// This source code may be used freely under the
  /// Limited GNU Public License(LGPL).
  ///
  /// Written by Jeff Heaton (http://www.jeffheaton.com)
  /// </summary>
  ///
  public class AttributeList:Attribute
  {
    /// <summary>
    /// An internally used Vector.  This vector contains
    /// the entire list of attributes.
    /// </summary>
    protected ArrayList m_list;
    /// <summary>
    /// Make an exact copy of this object using the cloneable
    /// interface.
    /// </summary>
    /// <returns>A new object that is a clone of the specified
    /// object.</returns>
    public override Object Clone()
    {
      AttributeList rtn = new AttributeList();

      for ( int i=0;i<m_list.Count;i++ )
        rtn.Add( (Attribute)this[i].Clone() );

      return rtn;
    }

    /// <summary>
    /// Create a new, empty, attribute list.
    /// </summary>
    public AttributeList():base("","")
    {
      m_list = new ArrayList();
    }


    /// <summary>
    /// Add the specified attribute to the list of attributes.
    /// </summary>
    /// <param name="a">An attribute to add to this
    /// AttributeList.</paramv
    public void Add(Attribute a)
    {
      m_list.Add(a);
    }


    /// <summary>
    /// Clear all attributes from this AttributeList and return
    /// it to a empty state.
    /// </summary>
    public void Clear()
    {
      m_list.Clear();
    }

    /// <summary>
    /// Returns true of this AttributeList is empty, with no
    /// attributes.
    /// </summary>
    /// <returns>True if this AttributeList is empty, false
    /// otherwise.</returns>
    public bool IsEmpty()
    {
      return( m_list.Count<=0);
    }

    /// <summary>
    /// If there is already an attribute with the specified name,
    /// it will have its value changed to match the specified
    /// value. If there is no Attribute with the specified name,
    /// one will be created. This method is case-insensitive.
    /// </summary>
    /// <param name="name">The name of the Attribute to edit or
    /// create.  Case-insensitive.</param>
    /// <param name="value">The value to be held in this
    /// attribute.</param>
    public void Set(string name,string value)
    {
      if ( name==null )
        return;
      if ( value==null )
        value="";

      Attribute a = this[name];

      if ( a==null )

      {
        a = new Attribute(name,value);
        Add(a);
      }

      else
        a.Value = value;
    }

    /// <summary>
    /// How many attributes are in this AttributeList?
    /// </summary>
    public int Count
    {
      get
      {
        return m_list.Count;
      }
    }

    /// <summary>
    /// A list of the attributes in this AttributeList
    /// </summary>
    public ArrayList List
    {
      get
      {
        return m_list;
      }
    }

    /// <summary>
    /// Access the individual attributes
    /// </summary>
    public Attribute this[int index]
    {
      get
      {
        if ( index<m_list.Count )
          return(Attribute)m_list[index];
        else
          return null;
      }
    }

    /// <summary>
    /// Access the individual attributes by name.
    /// </summary>
    public Attribute this[string index]
    {
      get
      {
        int i=0;

        while ( this[i]!=null )
        {
          if ( this[i].Name.ToLower().Equals( (index.ToLower()) ))
            return this[i];
          i++;
        }

        return null;

      }
    }
  }
}

Listing 3: The Parse Class

using System;

namespace HTML
{
  /// <summary>
  /// Base class for parsing tag based files, such as HTML,
  /// HTTP headers, or XML.
  ///
  /// This source code may be used freely under the
  /// Limited GNU Public License(LGPL).
  ///
  /// Written by Jeff Heaton (http://www.jeffheaton.com)
  /// </summary>
  public class Parse:AttributeList
  {
    /// <summary>
    /// The source text that is being parsed.
    /// </summary>
    private string m_source;

    /// <summary>
    /// The current position inside of the text that
    /// is being parsed.
    /// </summary>
    private int m_idx;

    /// <summary>
    /// The most recently parsed attribute delimiter.
    /// </summary>
    private char m_parseDelim;

    /// <summary>
    /// This most recently parsed attribute name.
    /// </summary>
    private string m_parseName;

    /// <summary>
    /// The most recently parsed attribute value.
    /// </summary>
    private string m_parseValue;

    /// <summary>
    /// The most recently parsed tag.
    /// </summary>
    public string m_tag;

    /// <summary>
    /// Determine if the specified character is whitespace or not.
    /// </summary>
    /// <param name="ch">A character to check</param>
    /// <returns>true if the character is whitespace</returns>
    public static bool IsWhiteSpace(char ch)
    {
      return( "tnr ".IndexOf(ch) != -1 );
    }


    /// <summary>
    /// Advance the index until past any whitespace.
    /// </summary>
    public void EatWhiteSpace()
    {
      while ( !Eof() )
      {
        if ( !IsWhiteSpace(GetCurrentChar()) )
          return;
        m_idx++;
      }
    }

    /// <summary>
    /// Determine if the end of the source text has been reached.
    /// </summary>
    /// <returns>True if the end of the source text has been
    /// reached.</returns>
    public bool Eof()
    {
      return(m_idx>=m_source.Length );
    }

    /// <summary>
    /// Parse the attribute name.
    /// </summary>
    public void ParseAttributeName()
    {
      EatWhiteSpace();
      // get attribute name
      while ( !Eof() )
      {
        if ( IsWhiteSpace(GetCurrentChar()) ||
          (GetCurrentChar()=='=') ||
          (GetCurrentChar()=='>') )
          break;
        m_parseName+=GetCurrentChar();
        m_idx++;
      }

      EatWhiteSpace();
    }


    /// <summary>
    /// Parse the attribute value
    /// </summary>
    public void ParseAttributeValue()
    {
      if ( m_parseDelim!=0 )
        return;

      if ( GetCurrentChar()=='=' )
      {
        m_idx++;
        EatWhiteSpace();
        if ( (GetCurrentChar()==''') ||
          (GetCurrentChar()=='"') ) 
        {
          m_parseDelim = GetCurrentChar();
          m_idx++;
          while ( GetCurrentChar()!=m_parseDelim )
          {
            m_parseValue+=GetCurrentChar();
            m_idx++;
          }
          m_idx++;
        }
        else
        {
          while ( !Eof() &&
            !IsWhiteSpace(GetCurrentChar()) &&
            (GetCurrentChar()!='>') )

          {
            m_parseValue+=GetCurrentChar();
            m_idx++;
          }
        }
        EatWhiteSpace();
      }
    }

    /// <summary>
    /// Add a parsed attribute to the collection.
    /// </summary>
    public void AddAttribute()
    {
      Attribute a = new Attribute(m_parseName,
        m_parseValue,m_parseDelim);
      Add(a);
    }


    /// <summary>
    /// Get the current character that is being parsed.
    /// </summary>
    /// <returns></returns>
    public char GetCurrentChar()

    {

      return GetCurrentChar(0);

    }



    /// <summary>
    /// Get a few characters ahead of the current character.
    /// </summary>
    /// <param name="peek">How many characters to peek ahead
    /// for.</param>
    /// <returns>The character that was retrieved.</returns>
    public char GetCurrentChar(int peek)

    {
      if( (m_idx+peek)<m_source.Length )
        return m_source[m_idx+peek];
      else
        return (char)0;
    }



    /// <summary>
    /// Obtain the next character and advance the index by one.
    /// </summary>
    /// <returns>The next character</returns>
    public char AdvanceCurrentChar()

    {
      return m_source[m_idx++];
    }



    /// <summary>
    /// Move the index forward by one.
    /// </summary>
    public void Advance()
    {
      m_idx++;
    }


    /// <summary>
    /// The last attribute name that was encountered.
    /// <summary>
    public string ParseName
    {
      get
      {
        return m_parseName;
      }

      set
      {
        m_parseName = value;
      }
    }

    /// <summary>
    /// The last attribute value that was encountered.
    /// <summary>
    public string ParseValue
    {
      get
      {
        return m_parseValue;
      }

      set
      {
        m_parseValue = value;
      }
    }

    /// <summary>
    /// The last attribute delimeter that was encountered.
    /// <summary>
    public char ParseDelim
    {
      get
      {
        return m_parseDelim;
      }

      set
      {
        m_parseDelim = value;
      }
    }

    /// <summary>
    /// The text that is to be parsed.
    /// <summary>
    public string Source
    {
      get
      {
        return m_source;
      }

      set
      {
        m_source = value;
      }
    }
  }
}

Listing 4: The ParseHTML Class

using System;

namespace HTML
{
  /// <summary>
  /// Summary description for ParseHTML.
  /// </summary>

  public class ParseHTML:Parse
  {
    public AttributeList GetTag()
    {
      AttributeList tag = new AttributeList();
      tag.Name = m_tag;

      foreach(Attribute x in List)
      {
        tag.Add((Attribute)x.Clone());
      }

      return tag;
    }

    public String BuildTag()
    {
      String buffer="<";
      buffer+=m_tag;
      int i=0;
      while ( this[i]!=null )

      {// has attributes
        buffer+=" ";
        if ( this[i].Value == null )
        {
          if ( this[i].Delim!=0 )
            buffer+=this[i].Delim;
          buffer+=this[i].Name;
          if ( this[i].Delim!=0 )
            buffer+=this[i].Delim;
        }
        else
        {
          buffer+=this[i].Name;
          if ( this[i].Value!=null )
          {
            buffer+="=";
            if ( this[i].Delim!=0 )
              buffer+=this[i].Delim;
            buffer+=this[i].Value;
            if ( this[i].Delim!=0 )
              buffer+=this[i].Delim;
          }
        }
        i++;
      }
      buffer+=">";
      return buffer;
    }

    protected void ParseTag()
    {
      m_tag="";
      Clear();

      // Is it a comment?
      if ( (GetCurrentChar()=='!') &&
        (GetCurrentChar(1)=='-')&&
        (GetCurrentChar(2)=='-') )
      {
        while ( !Eof() )
        {
          if ( (GetCurrentChar()=='-') &&
            (GetCurrentChar(1)=='-')&&
            (GetCurrentChar(2)=='>') )
            break;
          if ( GetCurrentChar()!='r' )
            m_tag+=GetCurrentChar();
          Advance();
        }
        m_tag+="--";
        Advance();
        Advance();
        Advance();
        ParseDelim = (char)0;
        return;
      }

      // Find the tag name
      while ( !Eof() )
      {
        if ( IsWhiteSpace(GetCurrentChar()) ||
                         (GetCurrentChar()=='>') )
          break;
        m_tag+=GetCurrentChar();
        Advance();
      }

      EatWhiteSpace();

      // Get the attributes
      while ( GetCurrentChar()!='>' )
      {
        ParseName  = "";
        ParseValue = "";
        ParseDelim = (char)0;

        ParseAttributeName();

        if ( GetCurrentChar()=='>' ) 

        {
          AddAttribute();
          break;
        }

        // Get the value(if any)
        ParseAttributeValue();
        AddAttribute();
      }
      Advance();
    }


    public char Parse()
    {
      if( GetCurrentChar()=='<' )
      {
        Advance();

        char ch=char.ToUpper(GetCurrentChar());
        if ( (ch>='A') && (ch<='Z') || (ch=='!') || (ch=='/') ) 
        {
          ParseTag();
          return (char)0;
        }

        else return(AdvanceCurrentChar());
      }
      else return(AdvanceCurrentChar());
    }
  }
}

Listing 5: The FindLinks Class

using System;
using System.Net;
using System.IO;


namespace HTML
{
  /// <summary>
  /// FindLinks is a class that will test the HTML parser.
  /// This short example will prompt for a URL and then
  /// scan that URL for links.
  /// This source code may be used freely under the
  /// Limited GNU Public License(LGPL).
  ///
  /// Written by Jeff Heaton (http://www.jeffheaton.com)
  /// </summary>
//class FindLinks
//{
    /// <summary>
    /// The main entry point for the application.
    /// </summary>
    [STAThread]
    static void Main(string[] args)
    {
      System.Console.Write("Enter a URL address:");
      string url = System.Console.ReadLine();
      System.Console.WriteLine("Scanning hyperlinks at: " + url );
      string page = GetPage(url);
      if(page==null)
      {
        System.Console.WriteLine("Can't process that type of file,"
                                  +
                                  "please specify an HTML file URL."
                                  );
        return;
      }

      ParseHTML parse = new ParseHTML();
      parse.Source = page;
      while( !parse.Eof() )
      {
        char ch = parse.Parse();
        if(ch==0)
        {
          AttributeList tag = parse.GetTag();
          if( tag["href"]!=null )
            System.Console.WriteLine( "Found link: " +
                                       tag["href"].Value );
        }
      }
    }


    public static string GetPage(string url)
    {
      WebResponse response = null;
      Stream stream = null;
      StreamReader
        reader = null;

      try
      {
        HttpWebRequest request =
                       (HttpWebRequest)WebRequest.Create(url);

        response = request.GetResponse();
        stream = response.GetResponseStream();

        if( !response.ContentType.ToLower().StartsWith("text/") )
          return null;

        string buffer = "",line;

        reader = new StreamReader(stream);

        while( (line = reader.ReadLine())!=null )
        {
          buffer+=line+"rn";
        }

        return buffer;
      }
      catch(WebException e)
      {
        System.Console.WriteLine("Can't download:" + e);
        return null;
      }
      catch(IOException e)
      {
        System.Console.WriteLine("Can't download:" + e);
        return null;
      }
      finally
      {
        if( reader!=null )
          reader.Close();

        if( stream!=null )
          stream.Close();

        if( response!=null )
          response.Close();
      }
    }
  }
}

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories