KhalsaGuru

Trick to Navigating w/ Mechanize and NokogiriMarch 20, 2014

I started scratching the surface of coding initially by creating some simple e-commerce stores. If you've ever had an online store of any kind you know that there are a lot of mundane tasks that are ripe for automation. Now that I'm finally starting to dive into real programming I'm excited to revisit some of these boring tasks that I spent many monotonous hours with. I wanted to start with some screen scraping with the help of Mechanize and Nokogiri, which will ultimately be used to create products in a Spree shopping cart (in another post).

When navigating with Mechanize it's easy if links all have logical and plentiful classes, you can just use links_with(:dom_class => 'foo'). But that never seems to be the case for me and when classes aren't available on the links directly, to identify the link you want to target you will need to use Mechanize's Nokogiri parser:

page.parser.css('.fooclass a')

But then you quickly run into a problem. The moment you use Nokogiri to identify a link, it returns that link element to you as a Nokogiri object, which can no longer be clicked or navigated like a Mechanize object. To fix this you just have to manually create the Mechanize link object from the Nokogiri object you have. Here is how:

page = Mechanize::Page::Link.new(noko_obj, agent, page).click

noko_obj is the Nokogiri object containing the link you need to click, agent is the Mechanize agent, and page is the current Mechanize page object. You create the new Mechanize link object, click it, and then capture the resulting page (Mechanize page object) in a variable (page) so you can keep navigating and/or scrape what you need on that resulting page.

Here's how you'd use it in a basic scraping loop that clicks on each link, visits the page, and displays the page title (obviously you can use Nokogiri to get whatever information you want once you're on the page):

require 'rubygems'
require 'mechanize'
require 'nokogiri'


agent = Mechanize.new
page = agent.get('http://foobar.com')


page.parser.css('.fooclass a').each do |link|
       page = Mechanize::Page::Link.new(link, agent, page).click
       puts page.title
end

With the ability to move seamlessly between Mechanize and Nokogiri you can single out elements with complicated identifiers and navigate through them with ease, creating complicated scraping loops that get you all the data you need. Nowhere is safe from your scraping endeavors!

I highly recommend Ryan Bates' Railscasts on Nokogiri and Mechanize to get started with the tools, and SelectorGadget to identify DOM elements without obvious class selectors (which Ryan mentions in his videos).

For me the next step was turning this script into a Rake task and using this knowledge to scrape product information and automatically generate Spree Commerce products to avoid the time and hassle of doing this by hand.