Book Image

Instant Nokogiri

By : S. Hunter Powers
Book Image

Instant Nokogiri

By: S. Hunter Powers

Overview of this book

A wealth of information sits waiting on the Internet. Instant Nokogiri helps you access this information today with Nokogiri, a slick and fast HTML and XML parsing engine. Bundled in an easy-to-use Ruby gem, Nokogiri empowers you to combine disparate data sources and gain an unprecedented insight into your Ruby applications. "Instant Nokogiri" is a hands-on guide to extracting information from the sources available on the Internet, sources that are not traditionally accessible to developers. You will learn the secrets of identifying content, extracting just the right parts, and incorporating the new data in your Ruby applications. "Instant Nokogiri" provides step-by-step instructions on how to incorporate the power of the Nokogiri gem and data parsing into your Ruby projects. You will learn all the basics of designing a project around data parsing, exploring disparate data sources, and refining strategies and theories. You will also combine your thoughts in a real-world, real-data sample application. This book will examine common Nokogiri and Ruby methods useful in scraping and parsing complete with practical code samples. You will also learn the secrets behind effective caching, rate limiting, and masking your identity. Instant Nokogiri will teach you how to get targeted data out of HTML and into Ruby, as well as tons of tips, tricks, code snippets, and expert advice.
Table of Contents (7 chapters)

Top 13 features you need to know about


Nokogiri is a pretty simple and straightforward gem. In the coming few pages we will take a more in-depth look at the most important methods, along with a few useful Ruby methods to take your parsing skills to the next level.

The css method

css(rules) —> NodeSet

The css method searches self for nodes matching CSS rules and returns an iterable NodeSet. Self is a Nokogiri::HTML::Document. In contrast to the at_css method used in the quick start project, the css method returns all nodes that match the CSS rules. The at_css method only returns the first node.

The following is a css method example:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# get all the h3 headings
doc.css('h3')

# get all the paragraphs
doc.css('p');

# get all the unordered lists
doc.css('ul')

# get all the section/category list items
doc.css('.navigationHomeLede li')

There is no explicit output from this code.

For more information refer to the site:

http://nokogiri.org/Nokogiri/XML/Node.html#method-i-css

The length method

length —> int

The length method returns the number of objects in self. Self is an array. length is one of the standard methods included with Ruby and is not Nokogiri-specific. It is very useful in playing with Nokogiri NodeSets as they extend the array class, meaning you can call length on them. For example, you can use length to see how many nodes are matching your CSS rule when using the css method.

An example of the length method is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# get all the h3 headings
h3_count = doc.css('h3').length
puts "h3 count #{h3_count}"

# get all the paragraphs
p_count = doc.css('p').length;
puts "p count #{p_count}"

# get all the unordered lists
ul_count = doc.css('ul').length;
puts "ul count #{ul_count}"

# get all the section/category list items
# size is an alias for length and may be used interchangeably
section_count = doc.css('.navigationHomeLede li').size;
puts "section count #{section_count}"
Run the above code to see the counts.
$ ruby length.rb
h3 count 7
p count 47
ul count 64
section count 13

Your counts will be different as this code is running against the live New York Times website.

I use this method most in the IRB shell during the exploration phase. Once you know how large the array is, you can also access individual nodes using the standard array selector:

> doc.css('h3')[2]
 => #<Nokogiri::XML::Element:0x3fd31ec696f8 name="h3" children=[#<Nokogiri::XML::Element:0x3fd31ec693b0 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fd31ec692c0 name="href" value="http://www.nytimes.com/2013/06/25/world/europe/snowden-case-carries-a-cold-war-aftertaste.html?hp">] children=[#<Nokogiri::XML::Text:0x3fd31ec68924 "\nSnowden Case Has Cold War Aftertaste">]>]>

For more information refer to the site:

http://ruby-doc.org/core-2.0/Array.html#method-i-length

The each method 1

each { |item| block } —> ary

The each method calls the block once for each element in self. Self is a Ruby enumerable object. This method is part of the Ruby standard library and not specific to Nokogiri.

The each method is useful to iterate over Nokogiri NodeSets:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# iterate through the h3 headings
doc.css('h3').each{ |h3|
  puts h3
}

Run the preceding code to see the following iteration:

$ each.rb
<h3><a href="http://dealbook.nytimes.com/2013/06/24/u-s-civil-charges-against-corzine-are-seen-as-near/?hp">
Regulators Are Said
to Plan a Civil Suit
Against Corzine</a></h3>
<h3><a href="http://www.nytimes.com/2013/06/25/business/global/credit-warnings-give-world-a-peek-into-chinas-secretive-banks.html?hp">
Credit Warnings
Expose China&acirc;€™s
Secretive Banks</a></h3>

Your output will differ as this is run against the live New York Times website. For more information refer to the site:

http://ruby-doc.org/core-2.0/Array.html#method-i-each

The each method 2

each { |key,value| block } —> ary

There is also a Nokogiri native each method which is called on a single node to iterate over name value pairs in that node. This isn't particularly useful, but we will take a look at an example to help avoid confusion.

The example is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# iterate through key value pairs of an individual node
# as we know, the css method returns an enumberable object
# so we can access a specific node using standard array syntax
doc.css('a')[4].each{ |node_name, node_value|
  puts "#{node_name}: #{node_value}"
}

This shows us the available attributes for the fifth link on the page:

$ nokogiri_each.rb
style: display: none;
id: clickThru4Nyt4bar1_xwing2
href: http://www.nytimes.com/adx/bin/adx_click.html?type=goto&opzn&page=homepage.nytimes.com/index.html&pos=Bar1&sn2=5b35bc29/49f095e7&sn1=ab317851/c628eac9&camp=nyt2013_abTest_multiTest_anchoredAd_bar1_part2&ad=bar1_abTest_hover&goto=https%3A%2F%2Fwww%2Enytimesathome%2Ecom%2Fhd%2F205%3Fadxc%3D218268%26adxa%3D340400%26page%3Dhomepage.nytimes.com/index.html%26pos%3DBar1%26campaignId%3D3JW4F%26MediaCode%3DWB7AA

Your output will differ as this is run against the live New York Times website.

For more information refer to the site:

http://nokogiri.org/Nokogiri/XML/Node.html#method-i-each

The content method

content —> string

The content method returns the text content of a node. This is how you parse content from a CSS selector. If you used the css method and have a NodeSet, you will need to iterate with the each method to extract the content of each node.

The example for the content method is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# iterate through the h3 headings
doc.css('h3').each{ |h3|
  # extract the content from the h3
  puts h3.content
}

Run the preceding code to see the h3 tags content:

$ content.rb
Regulators Are Said to Plan a Civil Suit Against Corzine
Credit Warnings Expose China's Secretive Banks
Affirmative Action Case Has Both Sides Claiming Victory
Back in the News, but Never Off U.S. Radar When Exercise Becomes an Addiction
Lee Bollinger: A Long, Slow Drift From Racial Justice

Your output will differ as this is run against the live New York Times website.

For more information refer to the site:

http://nokogiri.org/Nokogiri/XML/Node.html#method-i-content

The at_css method

at_css(rules) —> node

The at_css method searches the document and returns the first node matching the CSS selector. This is useful when you know there is only one match in the DOM or the first match is fine. Because it is able to stop at the first match, at_css is faster than the naked css method. Additionally, you don't have to iterate over the object to access its properties.

The example is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# get the content of the title of the page
# because there is only one title, we can use at_css
puts doc.at_css('title').content

Run the preceding code to parse the title:

$ ruby at_css.rb
The New York Times - Breaking News, World News & Multimedia

Your output will likely be the same because it is unlikely that The New York Times has changed their title tag, but it is possible they have updated it.

For more information refer to the site:

http://nokogiri.org/Nokogiri/XML/Node.html#method-i-at_css

The xpath method

xpath(paths) —> NodeSet

The xpath method searches self for nodes matching XPath rules and returns an iterable NodeSet. Self is a Nokogiri::HTML::Document or Nokogiri::XML::Document. The xpath method returns all nodes that match the XPath rules. The at_xpath method only returns the first node.

An example use of the xpath method is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# get all the h3 headings
h3_count = doc.xpath('//h3').length
puts "h3 count #{h3_count}"

# get all the paragraphs
p_count = doc.xpath('//p').length;
puts "p count #{p_count}"

# get all the unordered lists
ul_count = doc.xpath('//ul').length;
puts "ul count #{ul_count}"

# get all the section/category list items
# note this rule is substantially different from the CSS.
# *[@class="navigationHomeLede"] says to find any node
# with the class attribute = navigationHomeLede.  We then
# have to explicitly search for an unordered list before
# searching for list elements.
section_count = doc.xpath('//*[@class="navigationHomeLede"]/ul/li').size;
puts "section count #{section_count}"

Run the preceding code to see the counts:

$ xpath.rb
h3 count 7
p count 47
ul count 64
section count 13

Your counts will differ as this code is running against the live New York Times website, but your counts should be consistent with using the css method.

For more information refer to the site:

http://nokogiri.org/Nokogiri/XML/Node.html#method-i-xpath

The at_xpath method

at_xpath(paths) —> node

The at_xpath method searches the document and returns the first node matching the XPath selector. This is useful when you know there is only one match in the DOM or the first match is fine. Because it is able to stop at the first match, at_xpath is faster than the naked xpath method. Additionally, you don't have to iterate over the object to access its properties.

An example for the at_xpath method is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'));

# get the content of the title of the page
# because there is only one title, we can use at_css
puts doc.at_xpath('//title').content

Run the preceding code to parse the title:

$ ruby at_xpath.rb
The New York Times - Breaking News, World News & Multimedia

Your output will likely be the same because it is unlikely that The New York Times has changed their title tag, but it is possible they have updated it. Your output however should be the same as at_css.

For more information refer to the site:

http://nokogiri.org/Nokogiri/XML/Node.html#method-i-at_xpath

The to_s method

to_s —> string

The to_s method turns self into a string. If self is an HTML document, to_s returns HTML. If self is an XML document, to_s returns XML. This is useful in an IRB session where you want to examine the source of a node to determine how to craft your selector or need the raw HTML for your project.

An example of to_s is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# create a new Nokogiri HTML document from the scraped URL
doc = Nokogiri::HTML(open('http://nytimes.com'))

# get the HTML for the top story link
# if you remember from the quick start, there is only one
# of these on the page, so we can us at_css to target.
puts doc.at_css('h2 a').to_s

Run the preceding code to see the HTML:

$ ruby to_s.rb
<a href="http://www.nytimes.com/2013/06/26/us/politics/obama-plan-to-cut-greenhouse-gases.html?hp">President to Outline Plan on Greenhouse Gas Emissions</a>

Your output will differ as this is run against the live New York Times website, but you should be able to confirm this is indeed the top headline by visiting http://www.nytimes.com in your browser.

For more information refer to the site:

http://nokogiri.org/Nokogiri/XML/Node.html#method-i-to_s

This concludes the base methods you will need to interact with Nokogiri for your scraping and parsing projects. You now know how to target specific content with CSS or XPath selectors, iterate through NodeSets, and extract their content. Next, we will go over a few tips and tricks that will help you should you get into a bind with your Nokogiri project.

Spoofing browser agents

When you request a web page, you send metainformation along with your request in the form of headers. One of these headers, User-agent, informs the web server which web browser you are using. By default open-uri, the library we are using to scrape, will report your browser as Ruby.

There are two issues with this. First, it makes it very easy for an administrator to look through their server logs and see if someone has been scraping the server. Ruby is not a standard web browser. Second, some web servers will deny requests that are made by a nonstandard browsing agent.

We are going to spoof our browser agent so that the server thinks we are just another Mac using Safari.

An example is as follows:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# this string is the browser agent for Safari running on a Mac
browser = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1'

# create a new Nokogiri HTML document from the scraped URL and pass in the
# browser agent as a second parameter
doc = Nokogiri::HTML(open('http://nytimes.com', browser))

# you can now go along with your request as normal
# you will show up as just another safari user in the logs
puts doc.at_css('h2 a').to_s

Caching

It's important to remember that every time we scrape content, we are using someone else's server's resources. While it is true that we are not using any more resources than a standard web browser request, the automated nature of our requests leave the potential for abuse.

In the previous examples we have searched for the top headline on The New York Times website. What if we took this code and put it in a loop because we always want to know the latest top headline? The code would work, but we would be launching a mini denial of service (DOS) attack on the server by hitting their page potentially thousands of times every minute.

Many servers, Google being one example, have automatic blocking set up to prevent these rapid requests. They ban IP addresses that access their resources too quickly. This is known as rate limiting.

To avoid being rate limited and in general be a good netizen, we need to implement a caching layer. Traditionally in a large app this would be implemented with a database. That's a little out of scope for this book, so we're going to build our own caching layer with a simple TXT file. We will store the headline in the file and then check the file modification date to see if enough time has passed before checking for new headlines.

Start by creating the cache.txt file in the same directory as your code:

$ touch cache.txt

We're now ready to craft our caching solution:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# set how long in minutes until our data is expired
# multiplied by 60 to convert to seconds
expiration = 1 * 60

# file to store our cache in
cache = "cache.txt"

# Calculate how old our cache is by subtracting it's modification time
# from the current time.

# Time.new gets the current time
# The mtime methods gets the modification time on a file
cache_age = Time.new - File.new(cache).mtime

# if the cache age is greater than our expiration time
if cache_age > expiration
  # our cache has expire
  puts "cache has expired. fetching new headline"

  # we will now use our code from the quick start to 
  # snag a new headline

  # scrape the web page
  data = open('http://nytimes.com')

  # create a Nokogiri HTML Document from our data
  doc = Nokogiri::HTML(data)

  # parse the top headline and clean it up
  headline = doc.at_css('h2 a').content.gsub(/\n/," ").strip

  # we now need to save our new headline
  # the second File.open parameter "w" tells Ruby to overwrite
  # the old file
  File.open(cache, "w") do |file|
    # we then simply puts our text into the file
    file.puts headline
  end

  puts "cache updated"

else
  # we should use our cached copy
  puts "using cached copy"
  # read cache into a string using the read method
  headline = IO.read("cache.txt")
end

puts "The top headline on The New York Times is ..."
puts headline

Our cache is set to expire in one minute, so assuming it has been one minute since you created your cache.txt file, let's fire up our Ruby script:

$ ruby cache.rb
cache has expired. fetching new headline
cache updated
The top headline on The New York Times is ...
Supreme Court Invalidates Key Part of Voting Rights Act

If we run our script again before another minute passes, it should use the cached copy:

$ ruby cache.rb
using cached copy
The top headline on The New York Times is ...
Supreme Court Invalidates Key Part of Voting Rights Act

SSL

By default, open-uri does not support scraping a page with SSL. This means any URL that starts with https will give you an error. We can get around this by adding one line below our require statements:

# import nokogiri to parse and open-uri to scrape
require 'nokogiri'
require 'open-uri'

# disable SSL checking to allow scraping
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE

Mechanize

Sometimes you need to interact with a page before you can scrape it. The most common examples are logging in or submitting a form. Nokogiri is not set up to interact with pages. Nokogiri doesn't even scrape or download the page. That duty falls on open-uri. If you need to interact with a page, there is another gem you will have to use: Mechanize.

Mechanize is created by the same team as Nokogiri and is used for automating interactions with websites. Mechanize includes a functioning copy of Nokogiri.

To get started, install the mechanize gem:

$ gem install mechanize
Successfully installed mechanize-2.7.1

We're going to recreate the code sample from the installation where we parsed the top Google results for "packt", except this time we are going to start by going to the Google home page and submitting the search form:

# mechanize takes the place of Nokogiri and open-uri
require 'mechanize'

# create a new mechanize agent
# think of this as launching your web browser
agent = Mechanize.new

# open a URL in your agent / web browser
page = agent.get('http://google.com/')

# the google homepage has one big search box
# if you inspect the HTML, you will find a form with the name 'f'
# inside of the form you will find a text input with the name 'q'
google_form = page.form('f')

# tell the page to set the q input inside the f form to 'packt'
google_form.q = 'packt'

# submit the form
page = agent.submit(google_form)

# loop through an array of objects matching a CSS 
# selector. mechanize uses the search method instead of 
# xpath or css. search supports xpath and css
# you can use the search method in Nokogiri too if you 
# like it
page.search('h3.r').each do |link|
  # print the link text
  puts link.content
end

Now execute the Ruby script and you should see the titles for the top results:

$ ruby mechanize.rb
Packt Publishing: Home
Books
Latest Books
Login/register
PacktLib
Support
Contact
Packt - Wikipedia, the free encyclopedia
Packt Open Source (PacktOpenSource) on Twitter
Packt Publishing (packtpub) on Twitter
Packt Publishing | LinkedIn
Packt Publishing | Facebook

For more information refer to the site:

http://mechanize.rubyforge.org/