Book Image

Solr 1.4 Enterprise Search Server

By : David Smiley, Eric Pugh
Book Image

Solr 1.4 Enterprise Search Server

By: David Smiley, Eric Pugh

Overview of this book

<p>If you are a developer building a high-traffic web site, you need to have a terrific search engine. Sites like Netflix.com and Zappos.com employ Solr, an open source enterprise search server, which uses and extends the Lucene search library. This is the first book in the market on Solr and it will show you how to optimize your web site for high volume web traffic with full-text search capabilities along with loads of customization options. So, let your users gain a terrific search experience.<br /><br />This book is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate it with other languages and frameworks.<br /><br />This book first gives you a quick overview of Solr, and then gradually takes you from basic to advanced features that enhance your search. It starts off by discussing Solr and helping you understand how it fits into your architecture—where all databases and document/web crawlers fall short, and Solr shines. The main part of the book is a thorough exploration of nearly every feature that Solr offers. To keep this interesting and realistic, we use a large open source set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project. Using this data as a testing ground for Solr, you will learn how to import this data in various ways from CSV to XML to database access. You will then learn how to search this data in a myriad of ways, including Solr's rich query syntax, "boosting" match scores based on record data and other means, about searching across multiple fields with different boosts, getting facets on the results, auto-complete user queries, spell-correcting searches, highlighting queried text in search results, and so on.<br /><br />After this thorough tour, we'll demonstrate working examples of integrating a variety of technologies with Solr such as Java, JavaScript, Drupal, Ruby, XSLT, PHP, and Python.<br /><br />Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.</p>
Table of Contents (15 chapters)
Solr 1.4 Enterprise Search Server
Credits
About the Authors
About the Reviewers
Preface
Index

Getting started


Solr is a Java based web application, but you don't need to be particularly familiar with Java in order to use it. With most topics, this book assumes little to no such knowledge on your part. However, if you wish to extend Solr, then you will definitely need to know Java. I also assume a basic familiarity with the command line, whether it is DOS or any Unix shell.

Before truly getting started with Solr, let's get the prerequisites out of the way. Note that if you are using Mac OS X, then you should have the needed pieces already (though you may need the developer tools add-on). If any of the -version test commands mentioned as follows fail, then you don't have it. URLs are provided for convenience, but it is up to you to install the software according to instructions provided at the relevant sites.

A Java Development Kit (JDK) v1.5 or later: You can download the JDK from http://java.sun.com/javase/. Typing java -version will tell you which version of Java you are using if any, and you should type javac -version to ensure that you have the development kit too. You only need the JRE to run Solr, but you will need the JDK to compile it from source and to extend it.

Apache Ant: Any recent version should do and is available at http://ant.apache.org/. If you never modify Solr and just stick to a recent official release, then you can skip this. Note that the software provided with this book uses Ant as well. Therefore, you'll want Ant if you wish to follow along. Typing ant -version should demonstrate that you have it installed.

Subversion or Git for source control of Solr: http://subversion.tigris.org/getting.html or http://git-scm.com/. This isn't strictly necessary, but it's recommended for working with Solr's source code. If you choose to use a command line based distribution of either, then svn -version or git --version should work. Further instructions in this book are based on the command line, because it is a universal access method.

Any Java EE servlet engine app-server: This is a Java web server. Solr includes one already, Jetty, and we'll be using this throughout the book. In a later chapter, "Solr in the real world", deploying to an alternative is discussed.

The last official release or fresh code from source control

Let's finally get started and get Solr running. The official site for Solr is at http://lucene.apache.org/solr, where you can download the latest official release. Solr 1.3 was released on September 15th, 2008. Solr 1.4 is expected around the same time a year later and thus is probably available as you read this. This book was written in-between these releases and so it contains many but not all of 1.4's features. An alternative to downloading an official release is getting the latest code from source control (that is version control). In either case, the directory structure is conveniently identical and both include the source code. For many open source projects, the choice is almost always the last official release and not the latest source.

However, Solr's committers have made unit and integration testing a priority, evident by the testing infrastructure and test code-coverage of over 70 percent (http://hudson.zones.apache.org/hudson/view/Solr/job/Solr-trunk/clover/), which is very good. Many projects have none at all. As a result, the latest source release is very stable, and it also makes changes to Solr easier, given that so many tests are in place to give confidence that Solr is working properly—so far as the tests test it, of course. And unlike a database, which is almost never modified to suit the needs of a project, Solr is modified often. Also note that there are a good many feature additions provided as source code patches within Solr's JIRA (its issue tracking system). The decision is of course up to you. If you are satisfied with the feature-set in the latest release and/or you don't think you'll be modifying Solr at all, then the latest release is fine. One way to gauge what (completed) features are not yet in the latest official release is to visit Solr's JIRA at http://issues.apache.org/jira/browse/SOLR, and then click on Roadmap. Also, the Wiki at http://wiki.apache.org/solr/ should have features that are not yet in the latest release version marked as such.

Tip

Choose to get Solr through source control even if you are going to stick with the last official release. When/if you make changes to Solr, it will then be easier to see what those differences are. Switching to a different release becomes much easier too.

We're going to get the code through a subversion and check out the trunk (a source control term for the latest code). If you are using an IDE or some GUI tool for subversion, then feel free to use that. The command line will suffice too. You should be able to successfully execute the following:

svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr_svn

That will result in Solr being checked out into the solr_svn directory. If you prefer one of the official releases, then use one of the following URLs, instead of the one above: http://svn.apache.org/repos/asf/lucene/solr/tags/ (put that into your web browser to see the choices). So called nightlies are also available if you don't want to use a subversion but want recent code.

Testing and building Solr

If you prefer a downloadable pre-built Solr, instead of using a subversion, then you can skip this section.

Tip

Ant basics

Apache ant is a cross-platform build scripting tool specified with XML. It is largely Java oriented. An ant script is assumed to be named build.xml in the root of a project. It contains a set of named ant targets that you can run. In order to list them while including description, type ant -p to get a nice report. In order to run a target, simply supply it to ant as the first argument such as ant compile. Targets often internally invoke other targets, and you'll see this in the output. In the end, ant should report BUILD SUCCESSFUL if successful and BUILD FAILED if not. Note that ant's use of the term 'build' is universal in ant, even if 'build' is not an apt description of what a target performed.

Testing and building Solr is easy. Before we build Solr, we're going to test it first to ensure that there are no failing tests. Simply execute the test target in Solr's installation directory like ant test. That should have executed without any errors. On my old machine, it took about ten minutes to run. If there were errors (extremely rare), then you'll have to switch to a different version or wait shortly for it to be fixed. Now to build a ready-to-install Solr, just type ant dist. This is going to fill the dist directory with some JAR files and a WAR file. If you are not familiar with Java, these files are a packaging mechanism for compiled code and related resources. These files are technically ZIP files but with a different file extension, and so you can use any ZIP file tools to view their contents. The most important one is the WAR file which we'll be using next.

Solr's installation directory structure

In this section, we'll orient you to Solr's directory structure. This is not Solr's home directory, but a different place that we'll mention after this.

  • build: Only appears after Solr is built to house compiled code before being packaged. You won't need to look in here.

  • client: Contains convenient language-specific APIs for talking to Solr as an alternative to using your own code to send XML over HTTP. As of this writing, this only contains a couple of Ruby choices. The Java client called SolrJ is actually in src/solrj. More information on using clients to communicate with Solr is in Chapter 8.

  • dist: The built Solr JAR files and WAR file are here, as well as the dependencies. This directory is created and filled when Solr is built.

  • example: This is an installation of the Jetty servlet engine (a Java web server) including some sample data and Solr configuration. The interesting child directories are:

    • example/etc: Jetty's configuration. Among other things, here you can change the web port used from the pre-supplied 8983 to 80 (HTTP default).

    • example/multicore: Houses multiple Solr home directories in a Solr multicore setup. This will be discussed in Chapter 7.

    • example/solr: A Solr home directory for the default setup that we'll be using.

    • example/webapps: Solr's WAR file is deployed here.

  • lib: All of Solr's API dependencies. The larger pieces are Lucene, some Apache commons utilities, and Stax for efficient XML processing.

  • site: This is for managing what is published on the Solr web site. You won't need to go in here.

  • src: Various source code. It's broken down into a few notable directories:

    • src/java: Solr's source code, written in Java.

    • src/scripts: Unix bash shell scripts, particularly useful in larger production deployments employing multiple Solr servers.

    • src/solrj: Solr's Java client.

    • src/test: Solr's test source code and test files.

    • src/webapp: Solr's web administration interface, including Java Servlets (source code form) and JSPs. This is mostly what constitutes the WAR file. The JSPs for the admin interface are under here in web/admin/, if you care to tweak any to your needs.

If you are a Java developer, you may have noticed that the Java source in Solr is not located in one place. It's in src/java for the majority of Solr, src/common for the parts of Solr that are common to both the server side and Solrj client side, src/test for the test code, and src/webapp/src for the servlet-specific code. I am merely pointing this out to help you find code, not to be critical. Solr's files are well organized.

Solr's home directory

A Solr home directory contains Solr's configuration and data (a Lucene Index) for a running Solr instance. Solr includes a sample, one at example/solr, which we'll be using in-place throughout most of the book. Technically, example/multicore is also a valid Solr home but for a multi-core setup, which will be discussed much later. You know you're looking at a Solr home directory when it contains either a solr.xml file (formerly multicore.xml in Solr 1.3), or if it contains both a conf and a data directory, though strictly speaking these might not be the actual requirements.

Note

data might not yet be present because you haven't started Solr yet, which will create it if it's not present and assuming it's not configured to be named differently.

Solr's home directory is laid out like this:

  • bin: Suggested directory to place Solr replication scripts, if you have a more advanced setup.

  • conf: Configuration files. The two I mention below are very important, but it will also contain some other .txt and .xml files, which are referenced by these two files for different things such as special text analysis steps.

  • conf/schema.xml: This is the schema for the index including field type definitions with associated analyzer chains.

  • conf/solrconfig.xml: This is the primary Solr configuration file.

  • conf/xslt: This directory contains various XSLT files that can be used to transform Solr's XML query responses into formats such as Atom/RSS.

  • data: Contains the actual Lucene index data. It's binary data, so you won't be doing anything with it except perhaps deleting it occasionally.

  • lib: Optional placement of extra Java JAR files that Solr will load on startup, allowing you to externalize plugins from the Solr distribution (the WAR file) for convenience. If you extend Solr without modifying Solr itself, then those modifications can be deployed in a JAR file here.

It's really important to know how Solr finds its home directory. This is covered next.

How Solr finds its home

In the next section, you'll start Solr. When Solr starts up, about the first thing it does is load its configuration from its home directory. Where that is exactly can be specified in several different ways.

Solr first checks for a Java system property named solr.solr.home. There are a few ways to set a Java system property, but a universal one, no matter which servlet engine you use, is through the command line where Java is invoked. You could explicitly set Solr's home like so when you start Jetty: java -Dsolr.solr.home=solr/ -jar start.jar, or you could use Java Naming and Directory Interface (JNDI) to bind the directory path to java:comp/env/solr/home. As with Java system properties, there are multiple ways to do this. Some are app-server dependent, but a universal one is to add the following to the WAR file's web.xml located in src/web-app/web/WEB-INF (you'll find this there already but commented out).

<env-entry>
  <env-entry-name>solr/home</env-entry-name>
  <env-entry-value>solr/</env-entry-value>
  <env-entry-type>java.lang.String</env-entry-type>
</env-entry>

As this is a change to web.xml, you'll need to re-run ant dist-war to repackage it, and only then you'll redeploy it. Doing this with Jetty supplied with Solr is insufficient because JNDI itself isn't set up. I'm not going to get into this further, because if you know what JNDI is and want to use it, then you'll surely figure out how to do it for your particular app-server.

Finally, if Solr's home isn't configured as a Java system property or through JNDI, then it defaults to solr/. In the examples above, I used that particular path too. We're going to simply stick with this path for the rest of this book, because this is a development, not production, setting.

Tip

In a production environment, you will almost certainly configure Solr's home rather than let it fall back to the default solr/. You will also probably use an absolute path instead of a relative one, which wouldn't work if you accidentally start your app-server from a different directory.

When troubleshooting setting Solr's home, be sure to look at the very first Solr log messages when Solr starts:

Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config getInstanceDir

INFO: Solr home defaulted to 'null' (could not find system property or JNDI)

Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config setInstanceDir

INFO: Solr home set to 'solr/'

This shows that Solr was left to default to solr/. You'll see this output when you start Solr, as described in the next section.

Deploying and running Solr

The file we're going to deploy is the file ending in .war in the dist directory (dist/apache-solr-1.4.war). The WAR file in particular is important, because this single file represents an entire Java web application. It includes Solr's JAR file, all of Solr's dependencies (which amount to other JAR files), Java Server Pages (JSPs) (which are rendered to a web browser when the WAR is deployed), and various configuration files and other web resources. It does not include Solr's home directory, however.

How one deploys a WAR file to a Java servlet engine depends on that servlet engine, but it is common for there to be a directory named something like webapps, which contains WAR files optionally in an expanded form. By expanded, I mean that the WAR file may be uncompressed and thus a directory by the same name. This can be a convenient deployed form in order to make changes in-place (such as to JSP files and static web files) without requiring rebuilding a WAR file and replacing an existing one. The disadvantage is that changes are not directly tracked by source control (example: Subversion). Another thing to note about the WAR file is that by convention, its name (without the .war extension, if present) is the path portion of the URL where the web server mounts the web application. For example, if you have an apache-solr-1.4.war file, then you would access it at http://localhost:8983/apache-solr-1.4/, assuming it's on the local machine and running at that default port.

We're going to deploy this WAR file into the Jetty servlet engine included with Solr. If you are using a pre-built downloaded Solr distribution, then Solr is already deployed into Jetty as solr.war. Solr has an ant target that does this (and some other things we don't care about) called example, so you can simply run it like ant example. This target didn't keep the original WAR filename when copying it. It abbreviated it to simply solr.war. This means that the URL path is just solr. By the way, because ant targets generally call other necessary ant targets, it was technically not necessary to run ant dist earlier in order for this step to work. This would not have run the tests, however.

Now we're going to start up Jetty and finally see Solr running (albeit without any data to query yet). First go to the example directory, and then run Jetty's start.jar file by typing the following command:

cd example
java -jar start.jar

You'll see about a page of output including references to Solr. When it is finished, you should see this output at the very end of the command prompt:

2008-08-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983

The 0.0.0.0 means it's listening to connections from any host (not just localhost, notwithstanding potential firewalls) and 8983 is the port. If Jetty reports this, then it doesn't necessarily mean that Solr was deployed successfully. You might see an error such as a stack trace in the output, if something went wrong. Even if it did go wrong, you should be able to access the web server at this address: http://localhost:8983. It will show you a list of links to web applications which will just be Solr for this setup. Solr should have this link: http://localhost:8983/solr, and if you go there, then you should either see details about an error if Solr wasn't loaded correctly, or a simple page with a link to Solr's admin page, which should be http://localhost:8983/solr/admin/. You'll be visiting that link often.

Tip

To quit Jetty (and many other command line programs for that matter), hit Ctrl-C on the keyboard.