Friday, August 14, 2015

Crawling HTML with CsQuery

Recently I was asked to build a crawler for a webpage.

The crawler was supposed to get part of the main page values like the TV Show name, it’s season etc.

So how do you do it?

First you have to download the page’s html to your server:
using (var client = new WebClient())
     htmlContent  =            client.DownloadString(link);
Now you have the whole document as a string. In order to get the relevant values you have to identify the path to the element that has the desired value. You’ll want to find an element with an “Id” attribute so you can be sure that it is unique and set it as the root of your path. From that element you’ll have to travel the DOM until you get to the wanted element and its value.

For example: In order to get the TV show title in the above website I’ll Inspect the element from the browser:

The TV Show name is the innerText of the “a” element. The first unique element with an “Id” attribute is the <div id=”main”> (there’s another element that doesn’t have an id attribute but still seems kind of unique – <div class=”subpage_title_block”> both can be used).
After we identified the root element we need to explicitly describe the whole path:

div (id=main) => div => div => div (notice that the former element has an “a” element as a first child) > h3 => a. The last element has the TV Show title as its innerText.

The fun part

So how can we create such a path while the DOM is represented as a single string?
There are several solutions for crawling an HTML string: the most common is the HTMLAgilityPack which allows to perform lync style operation on the DOM. It’s nice but not simple enough to use.

There is another crawling solution - CsQuery (“Install-Package CsQuery” from nuget). It’s API is so neat and straight forward - just the way you would want it:
CQ dom = "<div>Hello world! <b>I am feeling bold!</b> What about <b>you?</b></div>";

In order to create a new CQ instance all you have to do is just serve the html string.
The selection is really simple too:
var boldElements = dom["b"].Select(x => x.InnerText).ToList();

Here you select all the bold text from the DOM.
So how does CsQuery help us in our example?
var tvShowTitle = dom["div#"main > div > div > div> h3 > a].InnerText;

Yes! That simple!


No comments:

Post a Comment