JRuby is a great technology for easily construct a script that calls into the HtmlUnit functionality without having to deal with all the syntactic sugar that Java requires.
Step 0: About The Example Code
The tar ball, scraper.tgz, contains:
- scraper.rb – the JRuby script we will be executing. All code discussed in this example comes from there
- lib/*.jar – all of the JAR files needed to run the example
- run.sh – a simple shell script that points JRuby at the lib directory and silences some warning messages
Step 1: Enabling JRuby to Use the Java JAR files
# Require Java so we can use the Java libraries require 'java'; # Get HTML Unit and all of its required libraries require 'htmlunit-1.13.jar'; require 'commons-httpclient-3.1.jar'; require 'commons-io-1.3.1.jar'; require 'commons-logging-1.1.jar'; require 'commons-lang-2.3.jar' require 'commons-codec-1.3.jar' require 'xercesImpl-2.6.2.jar' require 'xmlParserAPIs-2.6.2.jar' require 'jaxen-1.1.1.jar' require 'commons-collections-3.2.jar' require 'js-1.6R7' require 'nekohtml-0.9.5.jar' # Include the Web Client class include_class 'com.gargoylesoftware.htmlunit.WebClient';
In this block, we first use the are telling JRuby to use the JAR files required by HtmlUnit. Some notes:
- You have to specify every JAR file that HtmlUnit depends upon, even if you are not calling the method directly
- All JAR files must be in the LOAD_PATH for JRuby. This is done by -I<DIR_NAME> arguments passed in to JRuby from the command line.
- The include_class is similar to an import statement in Java and puts the WebClient object in scope.
At this point, we can now instantiate and use the WebClient class
Step 2: Parsing a Basic HTML Page
Before we get into parsing a dynamic page, let’s take a look at how to parse a simple page. In this example, I am going to parse out information from the Maven 2 Archive for HtmlUnit found at http://repo1.maven.org/maven2/htmlunit/htmlunit.
# Function for getting a list of all directories def get_htmlunit_maven_pages wc = WebClient.new; page = wc.getPage("http://repo1.maven.org/maven2/htmlunit/htmlunit"); # List the directories... page.getByXPath('//img[@alt="[DIR]"]').each do |img| a = img.getNextSibling.getNextSibling puts 'DIR: ' + a.getHrefAttribute end # List the files... page.getByXPath('//img[@alt="[TXT]"]').each do |img| a = img.getNextSibling.getNextSibling puts 'FILE: ' + a.getHrefAttribute end end
The first step in the method is instantiating a new instance by calling WebClient.new and then download the page using wc.getPage.
When requesting a page with a content type of text/html, the getPage call will return an instance of HtmlPage, and we can now use XPath expressions and DOM calls to get the URLs for the directories and for files. Very simple to get at the appropriate data. HtmlUnit has a bunch of other methods that you can use to navigate the page, check out the source documentation for the HtmlPage object.
# Function for seeing who the most recent my blog log users were def search_iotr wc = WebClient.new; page = wc.getPage("http://www.innovationontherun.com"); my_blog_log_info = page.getHtmlElementById("MBL_COMM") my_blog_log_info.getByXPath('//td[@class="mbl_mem"]').each do |td| td.getByXPath('//a').each do |a| puts a.asText + ":" + a.getHrefAttribute end end end
Other Situations Where HtmlUnit Rocks!
- Content from AJAX requests
Prerequisite Information To Run the Example
Make sure that you have Java installed. I am using Java 1.6, but HtmlUnit and JRuby should support older versions.
Download JRuby from http://dist.codehaus.org/jruby/ and put the jruby executable (found in the bin directory of the downloaded file) in your path.
To verify that Java and jruby are set up correctly, just run jruby from the command line and ask for the version:
> jruby -version ruby 1.8.5 (2007-08-23 rev 4201) [x86-jruby1.0.1]
My Environment Details
- JRuby version 1.0.1
- HtmlUnit 1.13
- Java version 1.6.0_02-b05
- Ubuntu 7.0.4
- JRuby Home Page
- HtmlUnit Home Page
- HtmlUnit Java Doc
- http://www.innovationontherun.com/jruby-scraper/scraper.tgz - The source code for this example