Apache Solr Enterprise Search Server

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Buy this Book

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Buy this Book

Overview of this book

<p>Solr Apache is a widely popular open source enterprise search server that delivers powerful search and faceted navigation features—features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, relevancy tuning, geospatial searches, and much more.</p> <p>This book is a comprehensive resource for just about everything Solr has to offer, and it will take you from first exposure to development and deployment in no time. Even if you wish to use Solr 5, you should find the information to be just as applicable due to Solr's high regard for backward compatibility. The book includes some useful information specific to Solr 5.</p>

Apache Solr Enterprise Search Server Third Edition

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Quick Starting Solr

An introduction to Solr

A few differences between Solr 4 and Solr 5

Resources outside this book

Summary

Schema Design

Is Solr schemaless?

MusicBrainz.org

One combined index or separate indices

Schema design

The schema.xml file

Summary

Text Analysis

Configuring field types

Character filters

Tokenization

Filtering

The multilingual search

Summary

Indexing Data

Communicating with Solr

Solr's Update-XML format

Commit, optimize, and rollback the transaction log

Atomic updates and optimistic concurrency

Sending CSV-formatted data to Solr

The DataImportHandler framework

Indexing documents with Solr Cell

Update request processors

Summary

Searching

Your first search – a walk-through

Solr's generic XML structured data representation

Solr's XML response format

Understanding request handlers

Query parameters

Query parsers and local-params

Query syntax (the lucene query parser)

The DisMax query parser – part 1

Filtering

Sorting

Joining

Spatial search

Summary

Search Relevancy

Scoring

The DisMax query parser – part 2

Functions and function queries

Summary

Faceting

A quick example – faceting release types

Field requirements

Types of faceting

Faceting field values

Faceting numeric and date ranges

Facet queries

Building a filter query from a facet

Pivot faceting

Excluding filters – multiselect faceting

Summary

Search Components

About components

The highlight component

The SpellCheck component

Query complete/suggest

The QueryElevation component

The MoreLikeThis component

The Stats component

The Clustering component

Collapsing and expanding

The TermVector component

Summary

Integrating Solr

Working with the included examples

Solritas – the integrated search UI

SolrJ – Solr's Java client API

Using JavaScript/AJAX with Solr

Using XSLT to transform XML search results

Accessing Solr from PHP applications

Ruby on Rails integrations

Nutch for crawling web pages

Solr and Hadoop

ManifoldCF – a connector framework

Document-level security

Summary

Scaling Solr

Tuning complex systems is hard

Use SolrMeter to test Solr performance

Optimizing a single Solr server – scale up

Configuring Solr for near real-time search

Use SolrCloud to go big – scale wide

Summary

Deployment

Deployment methodology for Solr

Installing Solr into a Servlet container

Configuring logging

A RequestHandler per search interface

Leveraging Solr cores

Setting up ZooKeeper for SolrCloud

Monitoring Solr performance

Securing Solr from prying eyes

Summary

Quick Reference

Core search

Diagnostic

The Lucene query parser

The DisMax query parser

The Lucene query syntax

Faceting

Highlighting

Spell checking

Miscellaneous nonsearch

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

If you are a developer building an application today, then you know how important a good search experience is. Apache Solr, built on Apache Lucene, is a wildly popular open source enterprise search server that easily delivers the powerful search and faceted navigation features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spellcheck, relevancy tuning, and more.

Apache Solr Enterprise Search Server, Third Edition is a comprehensive resource to almost everything Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate Solr with other languages and frameworks—even Hadoop.

By using a large set of metadata, including artists, releases, and tracks, courtesy of the MusicBrainz.org project, you will have a testing ground for Solr and will learn how to import this data in various ways. You will then learn how to search this data in different ways, including Solr's rich query syntax and boosting match scores based on record data. Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.

Note

Solr 4 or Solr 5?

See the What you need for this book section further below.

What this book covers

Chapter 1, Quick Starting Solr, introduces Solr to you so that you understand its unique role in your application stack. You'll get started quickly by indexing example data and searching it with Solr's sample / browse UI. This chapter is oriented to Solr 5, but the majority of content applies to Solr 4 too.

Chapter 2, Schema Design, guides you through an approach to modeling your data within Solr into one or more Solr indices and schemas. It covers the schema thoroughly and explores some of Solr's field types.

Chapter 3, Text Analysis, covers how to customize text tokenization, stemming, synonyms, and related matters to have fine control over keyword search matching. It also covers multilingual strategies.

Chapter 4, Indexing Data, explores all of the options Solr offers for importing data, such as XML, CSV, databases (SQL), and text extraction from common documents. This includes important information on commits, atomic updates, and real-time search.

Chapter 5, Searching, covers the query syntax, from the basics to Boolean options to more advanced wildcard and fuzzy searches, join queries, and geospatial search.

Chapter 6, Search Relevancy, explains how Solr scores documents for relevancy ranking. We'll review different options to influence the score, called boosting, and apply it to common examples such as boosting recent documents and boosting by a user vote.

Chapter 7, Faceting, shows you how to use Solr's killer feature—faceting. You'll learn about the different types of facets and how to build filter queries for a faceted navigation interface.

Chapter 8, Search Components, explores how to use a variety of valuable search features implemented as Solr search components. This includes result highlighting, query spellcheck, query suggest / complete, result grouping / collapsing, and more.

Chapter 9, Integrating Solr, explores some external integration options to interface with Solr. This includes some language-specific frameworks for Java, JavaScript, Ruby, and PHP, as well as a web crawler, Hadoop, a quick prototyping option, and more.

Chapter 10, Scaling Solr, covers how to tune Solr to get the most out of it. Then we'll introduce how to scale beyond one instance with SolrCloud.

Chapter 11, Deployment, guides you through deployment considerations to include deploying Solr to Apache Tomcat, to logging, and to security, and setting up Apache ZooKeeper.

Appendix, Quick Reference, serves as a small parameter quick-reference guide you can print to have within reach when you need it.

What you need for this book

The Getting started section in Chapter 1, Quick Starting Solr, explains what you need in detail. In summary, you should obtain:

Java 8, a JDK release. Java 7 is fine too. Support for Java 6 was last available in Solr 4.7. More information on this is in Chapter 1, Quick Starting Solr.
Apache Solr 4.8.1 is officially the version of Solr this book was written for. Nonetheless, some of the features are discussed or referenced in the later versions of Solr as far as 5.0. In fact, Chapter 1, Quick Starting Solr, orients you to Solr 5, which has a different first-impression experience than its predecessor. Once you get Solr running, you should be able to follow along easily with Solr 5. In Chapter 10, Scaling Solr, there are some SolrCloud startup commands that are a little different, and we've pointed out how they change. The only substantial topic not covered in this book that evolved through the Solr 4 point releases is data-driven schemaless mode, and HTTP API calls to make schema changes.
The code supplement to the book. It's not essential, but you'll want it to try some of the examples or to experiment with a sizable amount of real data. See the Downloading the example code section.

Who this book is for

This book is primarily for developers who want to learn how to use Apache Solr in their applications. Only basic programming skills are assumed, although the vast majority of content should be useful to those with a solid technical foundation that have not yet programmed.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Typing java –version at a command line will tell you exactly which version of Java you are using, if any."

A block of code is set as follows:

"responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "q": "lcd",
      "indent": "true",
      "wt": "json"
    }
  }
…

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

{
        "id": "9885A004",
        "name": "Canon PowerShot SD500",
        "manu": "Canon Inc.",
        "manu_id_s": "canon",
        "cat": [
          "electronics",
          "camera"
        ],
        "features": [
          "3x zoop, 7.1 megapixel Digital ELPH",
          "movie clips up to 640x480 @30 fps",
          "2.0\" TFT LCD, 118,000 pixels",
          "built in flash, red-eye reduction"
        ],
        "includes": "32MB SD card, USB cable, AV cable, battery",
        "weight": 6.4,
        "price": 329.95,
        "price_c": "329.95,USD",
        "popularity": 7,
        "inStock": true,
        "manufacturedate_dt": "2006-02-13T15:26:37Z",
        "store": "45.19614,-93.90341",
        "_version_": 1500358264225792000
      },
...

Any command-line input or output is written as follows:

>> cd example/exampledocs
>> java –Dc=techproducts -jar post.jar *.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/techproducts/
update using
content-type application/xml...
POSTing file gb18030-example.xml
POSTing file hd.xml
etc.
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/techproducts/update...

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Click on the Core Selector drop-down menu and select the techproducts link."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

A copy of the code bundle and possibly other information will also be available at http://www.solrenterprisesearchserver.com.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr Enterprise Search Server - Third Edition

Preface

Note

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions