There are many ways to index web pages. We could download them, parse them, and index them with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem – how to fetch them? We could possibly create our own software to do that, but that takes time and resources. That's why this recipe will cover how to fetch and index web pages using Apache Nutch.
For the purpose of this task we will be using Version 1.5.1 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch.apache.org.
Let's assume that the website we want to fetch and index is http://lucene.apache.org.
First of all we need to install Apache Nutch. To do that we just need to extract the downloaded archive to the directory of our choice; for example, I installed it in the directory
/usr/share/nutch
. Of course this is a single server installation and it doesn't include the Hadoop filesystem, but for the purpose of the recipe it will be enough. This directory will be referred to as$NUTCH_HOME
.Then we'll open the file
$NUTCH_HOME/conf/nutch-default.xml
and set the valuehttp.agent.name
to the desired name of your crawler (we've takenSolrCookbookCrawler
as a name). It should look like the following code:<property> <name>http.agent.name</name> <value>SolrCookbookCrawler</value> <description>HTTP 'User-Agent' request header.</description> </property>
Now let's create empty directories called
crawl
andurls
in the$NUTCH_HOME
directory. After that we need to create theseed.txt
file inside the createdurls
directory with the following contents:http://lucene.apache.org
Now we need to edit the
$NUTCH_HOME/conf/crawl-urlfilter.txt
file. Replace the+.
at the bottom of the file with+^http://([a-z0-9]*\.)*lucene.apache.org/
. So the appropriate entry should look like the following code:+^http://([a-z0-9]*\.)*lucene.apache.org/
One last thing before fetching the data is Solr configuration.
We start with copying the index structure definition file (called
schema-solr4.xml
) from the$NUTCH_HOME/conf/
directory to your Solr installation configuration directory (which in my case was/usr/share/solr/collection1/conf/
). We also rename the copied file toschema.xml
.
We also create an empty stopwords_en.txt
file or we use the one provided with Solr if you want stop words removal.
Now we need to make two corrections to the schema.xml
file we've copied:
The first one is the correction of the
version
attribute in theschema
tag. We need to change its value from1.5.1
to1.5
, so the finalschema
tag would look like this:<schema name="nutch" version="1.5.1">
Then we change the
boost
field type (in the sameschema.xml
file) fromstring
tofloat
, so theboost
field definition would look like this:<field name="boost" type="float" stored="true" indexed="false"/>
Now we can start crawling and indexing by running the following command from the $NUTCH_HOME
directory:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 50
Depending on your Internet connection and your machine configuration you should finally see a message similar to the following one:
crawl finished: crawl-20120830171434
This means that the crawl is completed and the data was indexed to Solr.
After installing Nutch and Solr, the first thing we did was set our crawler name. Nutch does not allow empty names so we must choose one. The file nutch-default.xml
defines more properties than the mentioned ones, but at this time we only need to know about that one.
In the next step, we created two directories; one (crawl
) which will hold the crawl data and the second one (urls
) to store the addresses we want to crawl. The contents of the seed.txt
file we created contains addresses we want to crawl, one address per line.
The crawl-urlfilter.txt
file contains information about the filters that will be used to check the URLs that Nutch will crawl. In the example, we told Nutch to accept every URL that begins with http://lucene.apache.org
.
The schema.xml
file we copied from the Nutch configuration directory is prepared to be used when Solr is used for indexing. But the one for Solr 4.0 is a bit buggy, at least in Nutch 1.5.1 distribution, and that's why we needed to make the changes previously mentioned.
We finally came to the point where we ran the Nutch command. We specified that we wanted to store the crawled data in the crawl
directory (first parameter), and the addresses to crawl data from are in the urls
directory (second parameter). The –solr
switch lets you specify the address of the Solr server that will be responsible for the indexing crawled data and is mandatory if you want to get the data indexed with Solr. We decided to index the data to Solr installed at the same server. The –depth
parameter specifies how deep to go after the links defined. In our example, we defined that we want a maximum of three links from the main page. The –topN
parameter specifies how many documents will be retrieved from each level, which we defined as 50.
There is one more thing worth knowing when you start a journey in the land of Apache Nutch.
The crawl
command of the Nutch command-line utility has another option – it can be configured to run crawling with multiple threads. To achieve that you add the following parameter:
-threads N
So if you would like to crawl with 20 threads you should run the crawl command like sot:
bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50 –threads 20
If you seek more information about Apache Nutch please refer to the http://nutch.apache.org and go to the Wiki section.