developer.com
Search EarthWeb
CodeGuru | Gamelan | Jars | Wireless | Discussions
Navigate developer.com
Architecture & Design  
Database  
Java
Languages & Tools
Microsoft & .NET
Open Source  
Project Management  
Security  
Techniques  
Voice  
Web Services  
Wireless/Mobile
XML  
Technology Jobs  

   Developer.com Webcasts:
  The Impact of Coding Standards and Code Reviews

  Project Management for the Developer

  Defining Your Own Software Development Methodology

  more Webcasts...




See the Winners!


Developer Jobs

Be a Commerce Partner
Boat Donations
Computer Deals
Find Software
Holiday Gift Ideas
Data Center Solutions
Best Price
Server Racks
Online Education
Calling Cards
Dental Insurance
Promos and Premiums
Laptop Batteries
Promote Your Website
Promotional Gifts

 


Web Devs:
Moonlight as a Game Developer and Win Cool Prizes by Accepting the RIA Run Challenge

Now, your mission--should you choose to accept: Take your shot at gaming stardom if you think you might have what it takes to build a cool RIA game and you could win an Xbox 360 or other fabulous prizes. Hurry! You only have until May 15, 2008 to enter. »

 
Article:
Leveraging Your Flash Development with Silverlight

You're not giving up Flash any time soon (and we don't blame you.) But if you could get your Flash application working in Silverlight, why wouldn't you? We show you the tools and techniques required to have your rockin' Flash application rolled for Silverlight. Learn more here. »

 
Article:
What Does it Take to Build the Best RIA?

With the proliferation of Rich Interactive Application (RIA) platform choices out there, you no longer have to take a one-size-fits-all approach to developing your next RIA application. Knowing the strengths (and weaknesses) of each platform can help you to decide the best RIA for your next application. »

 
Developer News -
SaaS Tool Offers Custom Database Development    May 9, 2008
Microsoft’s Automated Agent: Can We Talk?    May 7, 2008
Borland Finally Sells CodeGear    May 7, 2008
Red Hat Heads For The JON 2.0    May 7, 2008
Free Tech Newsletter -

Project Management Guide: Developing a Web Site. Best Practices, Tips and Strategies. Download Exclusive eBook Now.

Parsing HTML in Microsoft C#
By Jeff Heaton

Go to page: 1  2  Next  

Most data on the Web is stored in the Hypertext Markup Language (HTML) format. There are many times that you might want to parse HTML in your C# application. However, the .NET framework does not provide an easy way to parse HTML. Evidence of this is the numerous questions posted by C# programmers looking for an easy way to parse HTML.

The Microsoft .NET framework includes extensive support for Extensible Markup Language (XML). However, although XML and HTML look very similar, they are not very compatible. Consider the following major differences between XML and HTML:

  • XML requires end tags.
  • All XML attribute values must be fully quoted with either single or double quotes.
  • XML tags must be properly nested.
  • XML tag names are case sensitive.
  • XML does not allow duplicate attributes.
  • Empty attributes are not allowed in XML.

Let's look at one of these examples in code to illustrate the difference. In XML, every beginning tag must have an ending tag. The following HTML would cause problems for a XML parser.

<p>This is line 1<br>
This is line 2</p>

This is just one of many differences. Of course, you can require HTML to be written in such a way that it is compatible with XML. The preceding HTML could be rewritten as in the following example.

<p>This is line 1<br/>
This is line 2</p>

Both an XML parser and any modern browser could understand this. Unfortunately, this is not a viable solution because you do not control the sources of HTML. You will want your program to be able to process HTML from any source.

The Solution

Because of this, I found it necessary to write my own HTML parser. In this article, I will show you how my HTML parser was constructed, and how you can use this parser with your own applications. I will begin by showing you the main components that make up the HTML parser. I will conclude this article by showing a simple example that uses the HTML parser.

The HTML parser consists of the following four classes:

  • Attribute—The attribute class is used to hold an individual attribute inside an HTML tag.
  • AttributeList—The attribute list holds an individual HTML tag and all of its attributes.
  • Parse—Holds general text parsing routines.
  • ParseHTML—The main class that you will interface with; the ParseHTML class is fed the HTML that you would like to parse.

I will now show you how each of these classes functions, and how you will use them. I will begin with the Attribute class.

The Attribute Class

The Attribute class is used to store individual HTML attributes. The source code for the Attribute class can be seen in Listing 1. The following HTML tag demonstrates attributes:

<img src="picture.gif" alt="Some Picture">

The above HTML tag has two attributes named "src" and "alt". The values of these two attributes are "picture.gif" and "Some Picture", respectively.

The Attribute class consists of three properties named "name", "value", and "delim". The "name" property stores the name of the attribute. The "value" property stores the value held by the property. And finally, the "delim" property holds the character that was used to delimit the value, if any. This property will either hold a quote ("), an apostrophe ('), or nothing at all, depending on what was used to delineate the value.

The AttributeList Class

An HTML tag often consists of several attributes. The "AttributeList" class is used to hold a list of these attributes. The "AttributeList" class is shown in Listing 2. The "AttributeList" class consists of a name and a collection of attributes. The "AttributeList" name, stored in a property called "name", holds the name of the tag. When tags are returned to you from the parser, they will be in the form of "AttributeList" objects.

The AttributeList class makes use of the C# indices. You can access individual attributes both by numeric and string indicies. For example, if an attribute "src" were stored in the "AttributeList" object "theTag", you could access the "src" attribute in the following ways:

theTag[0]    // assuming "src" were the first attribute
theTag["src"]

Both of these methods could be used to access the attributes of the tag.

Go to page: 1  2  Next  


Tools:
Add www.developer.com to your favorites
Add www.developer.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed


Visual C# Archives

Work With InterSystems. Not Separate Systems. Rapidly develop and deploy connectable applications.
Is it time to make your move to the multi-threaded and parallel processing world? Find out!
Guide to Developing a Web Site. Best Practices, Tips and Strategies. Download Exclusive eBook Now.
Intel Go Parallel Portal: Translating Multicore Power into Application Performance
Flash Demo: Learn how IBM Information Server Blade is easy to manage, highly scalable and efficient.



JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
Microsoft Article: HyperV-The Killer Feature in WinServer ‘08
Avaya Article: How to Feed Data into the Avaya Event Processor
Microsoft Article: Install What You Need with Win Server ‘08
HP eBook: Putting the Green into IT
Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
Avaya Article: Setting Up a SIP A/S Development Environment
IBM Article: How Cool Is Your Data Center?
Microsoft Article: Managing Virtual Machines with Microsoft System Center
HP eBook: Storage Networking , Part 1
Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Intel Video: Are Multi-core Processors Here to Stay?
On-Demand Webcast: Five Virtualization Trends to Watch
HP Video: Page Cost Calculator
Intel Video: APIs for Parallel Programming
HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Sun Download: Solaris 8 Migration Assistant
Sybase Download: SQL Anywhere Developer Edition
Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
Red Gate Download: SQL Compare Pro 6
Iron Speed Designer Application Generator
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
IBM Article: Collaborating in the High-Performance Workplace
HP Demo: StorageWorks EVA4400
Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
Microsoft How-to Article: Get Going with Silverlight and Windows Live
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES