KhalsaGuru

More on Mechanize and NokogiriMay 23, 2014

Previously I mentioned a useful tip that has proven essential for me in navigating successfully with mechanize and Nokogiri. Now I want to share another useful thing I've learned.

When a website is structured without uniquely identifiable structural elements or CSS attributes (often this can happen when a lot of content is added within one html tag without classes or IDs), you may need to look to the content of the tag to find what you need. For example you may have a series of unidentifiable tags that look like:

<p>Name: Foo Bar</p>
<p>Description: A delicious energy bar made of the highest quality foo.</p>
<p>Ingredients: foo, sugar, water</p>

If you need to extract the ingredients from each page you're scraping and every page follows this format (but you can't be sure that there are the same number of tags in the same order on each page) you can simply use a content selector in Nokogiri which can be implemented in the following way:

agent = Mechanize.new

page = agent.get('http://foobar.com')

ingredients = page.parser.at_css('p:contains("Ingredients")')

Which will capture that tag in the ingredients variable. You use the ".text" method to get the text content of that tag and you can of course also tack on a regex or a slice to only return the content you need. So if you wanted to return the ingredients without the "Ingredients: " text before it, you might type:

ingredients = page.parser.at_css('p:contains("Ingredients")').text[13..-1]

Or if you prefer to use regex:

ingredients = page.parser.at_css('p:contains("Ingredients")').text.match(/:\s(.+)/)[1]

The (.+) in this case will capture all characters in the tag after the semicolon and space ":\s" until it hits a line break character (which it won't in this situation). The [1] after .match will return the first capture of the regex, in other words all of the characters captured by (.+), which will be all of the ingredients.