Using HTML Agility Pack to Parse a Web Page

Parsing information from a web page is not a trivial task. Fortunately, HTML has a defined structure and libraries exist to help us navigate that structure.

One such library for C# is HTML Agility Pack or HAP.

You can add this library to a project via NuGet. Simply right-click your project in the Visual Studio Solution Explorer, select "Manage NuGet Packages" (Fig. 1); then search for "HTML Agility Pack" and install the package (Fig. 2).

Fig. 1

Fig. 2

Once the package is installed, you can load your document into an HtmlAgilityPack.HtmlDocument and begin working with it.

There are 3 ways to load a web page into an HtmlDocument: from a file on disk; from a string of HTML, from a URL, and from whatever document is loaded in a browser.

Below are examples of each (taken from the HAP web site).

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

// From Browser
var web1 = new HtmlWeb();
var doc1 = web1.LoadFromBrowser(url, o =>
{
	var webBrowser = (WebBrowser) o;

	// WAIT until the dynamic text is set
	return !string.IsNullOrEmpty(webBrowser.Document.GetElementById("uiDynamicText").InnerText);
});

Listing 1

I am interested in this project because I have a web page that lists all my public presentations and I want to make this page data driven, so I won't have to update a text file every time I schedule a new presentation.

To populate a database with this information, I could either type in every presentation or grab it from my web page and parse out the relevant information in an HTML table. Since I have delivered nearly 500 presentations, re-typing each of these individually seemed like way too much work.

Here is a screenshot of my "Public Presentations" page:

Fig. 3

and here is a partial listing of the source HTML for that page.

 id="talklist">
  style="height:15.0pt">
   style="height:15.0pt;width:62pt" width="83">
   Date
   width="458">
   Topic
   width="311">
   Event
   width="64">
	Location
 
        
  style="height:15.0pt">
 
  Sep 20, 2018
 
 
  Adding Image and Voice Intelligence to Your Apps with Microsoft Cognitive Services
 
 
  VSLive
 
 
  Chicago, IL
 
 

  style="height:15.0pt">
 
  Sep 20, 2018
 
 
  Effective Data Visualization
 
 
  VSLive
 
 
  Chicago, IL
 
 

  style="height:15.0pt">
 
  Jun 22, 2018
 
 
  Effective Data Visualization
 
 
  Beer City Code
 
 
  Grand Rapids, MI

Listing 2

I used the following code to load the data into an HtmlDocument:

var url = "/Schedule.aspx";
Console.WriteLine("Getting data from {0}...", url);
var web = new HtmlWeb();
var doc = web.Load(url);

Listing 3

The HtmlDocument has a DocumentNode property, which returns the root element of the HTML document as a node. Most of the time, I found myself working with nodes and collections of nodes. Each node has a SelectSingleNode and a SelectNodes method which returns a node and a collection of nodes, respectively. These take an XPATH argument, with which I was familiar from my days working with XML documents.

The following code retrieves a nodelist of all the row nodes within the "talklist" table, shown in Listing 2.

var documentNode = doc.DocumentNode;
var tableNode = documentNode
			.SelectSingleNode("//table[@id='talklist']");
var rowsNodesList = tableNode.SelectNodes("tr");

Listing 4

Finally, because each node contains 4 nodes of cells, I can iterate through each node, find all the cells, and the innertext of each cell. For good measure, I stripped off any non-printing characters. Because the title row contains cells, instead of cells, I want to check for this before extracting information.

The code for this is in Listing 5.

var rowCount = 1;
foreach (var row in rowsNodesList)
{
	var cells = row.SelectNodes("td");
	if (cells != null && cells.Count > 0)
	{
		var date = cells[0].InnerText;
		date = date.Replace("\r\n", "").Trim();
		var topic = cells[1].InnerText;
		topic = topic.Replace("\r\n", "").Trim();
		var eventName = cells[2].InnerText;
		eventName = eventName.Replace("\r\n", "").Trim();
		var location = cells[3].InnerText;
		location = location.Replace("\r\n", "").Trim();

		Console.WriteLine("Row: {0}", rowCount);
		Console.WriteLine("Date: {0}", date);
		Console.WriteLine("Topic: {0}", topic);
		Console.WriteLine("Event: {0}", eventName);
		Console.WriteLine("Location: {0}", location);
		Console.WriteLine("--------------------");
		rowCount++;
	}
}

Listing 5

Here is the full code listing for my console app that retrieves the text of each cell.

using HtmlAgilityPack;
using System;

namespace TestHAP
{
    class Program
    {
        static void Main(string[] args)
        {
            var url = "/Schedule.aspx";
            Console.WriteLine("Getting data from {0}...", url);
            var web = new HtmlWeb();
            var doc = web.Load(url);

            var documentNode = doc.DocumentNode;
            var tableNode = documentNode
                        .SelectSingleNode("//table[@id='talklist']");
            var rowsNodesList = tableNode.SelectNodes("tr");

            var rowCount = 1;
            foreach (var row in rowsNodesList)
            {
                var cells = row.SelectNodes("td");
                if (cells != null && cells.Count > 0)
                {
                    var date = cells[0].InnerText;
                    date = date.Replace("\r\n", "").Trim();
                    var topic = cells[1].InnerText;
                    topic = topic.Replace("\r\n", "").Trim();
                    var eventName = cells[2].InnerText;
                    eventName = eventName.Replace("\r\n", "").Trim();
                    var location = cells[3].InnerText;
                    location = location.Replace("\r\n", "").Trim();

                    Console.WriteLine("Row: {0}", rowCount);
                    Console.WriteLine("Date: {0}", date);
                    Console.WriteLine("Topic: {0}", topic);
                    Console.WriteLine("Event: {0}", eventName);
                    Console.WriteLine("Location: {0}", location);
                    Console.WriteLine("--------------------");
                    rowCount++;
                }
            }

            Console.ReadLine();
        }
    }
}

Listing 6

You can download this solution from my GitHub repository.

Although I don’t intend to use it for this, HAP also supports modifying the HTML you select with node methods like AppendChild(), InsertAfter(), and RemoveChild().

This tool will help me to retrieve and parse the hundreds of rows of data from my web page and insert them into a database.