Book Image

Instant Nokogiri

By : S. Hunter Powers
Book Image

Instant Nokogiri

By: S. Hunter Powers

Overview of this book

A wealth of information sits waiting on the Internet. Instant Nokogiri helps you access this information today with Nokogiri, a slick and fast HTML and XML parsing engine. Bundled in an easy-to-use Ruby gem, Nokogiri empowers you to combine disparate data sources and gain an unprecedented insight into your Ruby applications. "Instant Nokogiri" is a hands-on guide to extracting information from the sources available on the Internet, sources that are not traditionally accessible to developers. You will learn the secrets of identifying content, extracting just the right parts, and incorporating the new data in your Ruby applications. "Instant Nokogiri" provides step-by-step instructions on how to incorporate the power of the Nokogiri gem and data parsing into your Ruby projects. You will learn all the basics of designing a project around data parsing, exploring disparate data sources, and refining strategies and theories. You will also combine your thoughts in a real-world, real-data sample application. This book will examine common Nokogiri and Ruby methods useful in scraping and parsing complete with practical code samples. You will also learn the secrets behind effective caching, rate limiting, and masking your identity. Instant Nokogiri will teach you how to get targeted data out of HTML and into Ruby, as well as tons of tips, tricks, code snippets, and expert advice.
Table of Contents (7 chapters)

So, what is Nokogiri?


Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.

The Nokogiri homepage is shown in the following screenshot:

Nokogiri is fast and efficient. It combines the raw power of the native C parser Libxml2 (http://www.xmlsoft.org/) with the intuitive parsing API of Hpricot (https://github.com/hpricot/hpricot).

The primary use case for a parsing library is data scraping. Data scraping is the process of extracting data intended for humans and structuring it for input into another program. Data by itself is meaningless without structure. Software imposes rigid structure over data referred to as format.

The same can be said of spoken language. We do not yell out random sounds and expect them to have meaning. We use words to form sentences to form meaning. This is our format. It is a loose structure. You could learn ten words in a foreign language, combine those with a few hand symbols, and add in a little amateur acting to convey fairly advanced concepts to people who don't speak your native tongue. This interpretive prowess is not shared by computers. Computer communication must follow protocols; fail to follow the protocol and no communication will be made.

The goal here is to bridge the two. Take the data intended for humans, get rid of the superfluous, and parse it into a structured data format for a computer. Data intended for humans is inherently fickle as the structure frequently changes. Data scraping should be used as a last effort and is generally appropriate in two scenarios: interfacing systems with incompatible data formats, and third-party sources lacking an API. If you aren't solving one of these two problems, you probably shouldn't be scraping.

An example of this is the most common scrape and parse use case in tutorials on the Internet: Amazon price searching. The scenario is: you have a database of products and you want up-to-date pricing information. The tutorials inevitably lead you through the process of scraping and parsing Amazon's search results to extract prices. The problem is, Amazon provides an API with all of this information and more on the Amazon Product Advertising API.

It is important to remember that you are using someone else's server resources when scraping. This is why the preferred method of accessing information should always be a developer approved API. An API in general will provide faster, cleaner, and more direct access to data while not expressing undue toll on the provider's servers.

A wealth of information sits waiting on the Internet. A small fraction is made easily accessible to developers via APIs. Nokogiri bridges that gap with its slick, fast, HTML and XML parsing engine bundled in an easy to use Ruby gem.