Scraping Dynamic Websites Using JRuby and HtmlUnit

Scraping static web sites to verify functionality or to access data has been around as long as there has been a web (example of scraping of a static web page with Ruby).  But with the advent of AJAX and other techniques that use JavaScript to dynamically insert HTML into a web page, scraping has gotten more challenging.  Most scraping technology does fine when downloading a single HTML page, but cannot easily handle the dynamic content.

With the 1.12 release of HtmlUnit, this headless web browser can now support parsing and executing JavaScript.  This allows a scraper to access this dynamic content as simply as the scraper accesses static content, and without having to fire up a heavy execution engine like Gecko.

JRuby is a great technology for easily construct a script that calls into the HtmlUnit functionality without having to deal with all the syntactic sugar that Java requires.

Step 0: About The Example Code

The tar ball, scraper.tgz, contains:

  • scraper.rb – the JRuby script we will be executing.  All code discussed in this example comes from there
  • lib/*.jar – all of the JAR files needed to run the example
  • run.sh – a simple shell script that points JRuby at the lib directory and silences some warning messages

Step 1: Enabling JRuby to Use the Java JAR files

# Require Java so we can use the Java libraries
require 'java';

# Get HTML Unit and all of its required libraries
require 'htmlunit-1.13.jar';
require 'commons-httpclient-3.1.jar';
require 'commons-io-1.3.1.jar';
require 'commons-logging-1.1.jar';
require 'commons-lang-2.3.jar'
require 'commons-codec-1.3.jar'
require 'xercesImpl-2.6.2.jar'
require 'xmlParserAPIs-2.6.2.jar'
require 'jaxen-1.1.1.jar'
require 'commons-collections-3.2.jar'
require 'js-1.6R7'
require 'nekohtml-0.9.5.jar'

# Include the Web Client class
include_class 'com.gargoylesoftware.htmlunit.WebClient';

In this block, we first use the are telling JRuby to use the JAR files required by HtmlUnit.  Some notes:

  • You have to specify every JAR file that HtmlUnit depends upon, even if you are not calling the method directly
  • All JAR files must be in the LOAD_PATH for JRuby.  This is done by -I<DIR_NAME> arguments passed in to JRuby from the command line.
  • The include_class is similar to an import statement in Java and puts the WebClient object in scope.

At this point, we can now instantiate and use the WebClient class

Step 2: Parsing a Basic HTML Page

Before we get into parsing a dynamic page, let’s take a look at how to parse a simple page.  In this example, I am going to parse out information from the Maven 2 Archive for HtmlUnit found at http://repo1.maven.org/maven2/htmlunit/htmlunit.

# Function for getting a list of all directories
def get_htmlunit_maven_pages
  wc = WebClient.new;

  page = wc.getPage("http://repo1.maven.org/maven2/htmlunit/htmlunit");

  # List the directories...
  page.getByXPath('//img[@alt="[DIR]"]').each do |img|
    a = img.getNextSibling.getNextSibling
    puts 'DIR: ' + a.getHrefAttribute
  end

  # List the files...
  page.getByXPath('//img[@alt="[TXT]"]').each do |img|
    a = img.getNextSibling.getNextSibling
    puts 'FILE: ' + a.getHrefAttribute
  end
end

The first step in the method is instantiating a new instance by calling WebClient.new and then download the page using wc.getPage.

When requesting a page with a content type of text/html, the getPage call will return an instance of HtmlPage, and we can now use XPath expressions and DOM calls to get the URLs for the directories and for files.  Very simple to get at the appropriate data.  HtmlUnit has a bunch of other methods that you can use to navigate the page, check out the source documentation for the HtmlPage object.

Step 3: Parsing Data Written By JavaScript Functions

The code behind parsing a HTML page that uses JavaScript to dynamically create content is actually no harder than the previous example.  HtmlUnit will detect the script tags in the page you are downloading and execute the appropriate script in line.  For an example, I will use my blog home page and its inclusion of a JavaScript widget from MyBlogLog.  This script makes a call to MyBlogLog and finds out who the most recent registered users to visit my site have been.  In our example, we will parse out these users name and URLs.

# Function for seeing who the most recent my blog log users were

def search_iotr
  wc = WebClient.new;

  page = wc.getPage("http://www.innovationontherun.com");
  my_blog_log_info = page.getHtmlElementById("MBL_COMM")
  my_blog_log_info.getByXPath('//td[@class="mbl_mem"]').each do |td|
    td.getByXPath('//a').each do |a|
      puts a.asText + ":" + a.getHrefAttribute
    end
  end
end

If you look at the source for this page, you will see a script tag that downloads a JavaScript file from MyBlogLog.com.  The downloaded JavaScript will make calls to document.write that will insert an HTML table into the page.  The id of the table is MBL_COMM, so our first step is to find that HTML element.  Once we have the element, it is a couple of simple XPath expressions to find the anchor tag that contains the recent visitors name and URL.  All of the implementation of downloading the data and putting into the HTML page is hidden from us by HtmlUnit so we can easily use DOM to get at the information we are interested in.

Other Situations Where HtmlUnit Rocks!

Anytime JavaScript is being used to either enable navigation or modify the HTML document, HtmlUnit can be a great asset in your parsing.  This includes:

  • Content from AJAX requests
  • Situations where JavaScript events are being used to impact behavior.  An example would be a page using an onChange handler on a select list to modify form values and/or submit the form.  HtmlUnit is very handy for simplifying this interaction.

A word of caution, the JavaScript implementation is not fully featured in HtmlUnit, so some sites still may not work.  However, the HtmlUnit team is validating the browser against a fair number of popular libraries, so hopefully in future HtmlUnit releases, this will be less of an issue.

Appendix

Prerequisite Information To Run the Example

Make sure that you have Java installed.  I am using Java 1.6, but HtmlUnit and JRuby should support older versions.

Download JRuby from http://dist.codehaus.org/jruby/ and put the jruby executable (found in the bin directory of the downloaded file) in your path.

To verify that Java and jruby are set up correctly, just run jruby from the command line and ask for the version:

  > jruby -version 
ruby 1.8.5 (2007-08-23 rev 4201) [x86-jruby1.0.1]

My Environment Details

  • JRuby version 1.0.1
  • HtmlUnit 1.13
  • Java version 1.6.0_02-b05
  • Ubuntu 7.0.4

Reference