Apache Solr Beginner's Guide

Apache Solr Beginner's Guide

By : Alfredo Serafini

Buy this Book

Apache Solr Beginner's Guide

By: Alfredo Serafini

Buy this Book

Overview of this book

With over 40 billion web pages, the importance of optimizing a search engine's performance is essential. Solr is an open source enterprise search platform from the Apache Lucene project. Full-text search, faceted search, hit highlighting, dynamic clustering, database integration, and rich document handling are just some of its many features. Solr is highly scalable thanks to its distributed search and index replication. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable with most popular programming languages. Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has a plugin architecture to support more advanced customization. With Apache Solr Beginner's Guide you will learn how to configure your own search engine experience. Using real data as an example, you will have the chance to start writing step-by-step, simple, real-world configurations and understand when and where to adopt this technology. Apache Solr Beginner's Guide will start by letting you explore a simple search over real data. You will then go through a step-by-step description that gives you the chance to explore several practical features. At the end of the book you will see how Solr is used in different real-world contexts. Using data from public domains like DBpedia, you will define several different configurations, exploring some of the most interesting Solr features, such as faceted search and navigation, auto-suggestion, and rich document indexing. You will see how to configure different analysers for handling different data types, without programming. You will learn the basics of Solr, focusing on real-world examples and practical configurations.

Apache Solr Beginner's Guide

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Ready with the Essentials

Understanding Solr

Learning the powerful aspects of Solr

Working with Java installation

Installing and testing Solr

Time for action – starting Solr for the first time

Time for action – posting some example data

Time for action – testing Solr with cURL

Who uses Solr?

Resources on Solr

How will we use Solr?

Summary

Indexing with Local PDF Files

Understanding and using an index

Posting example documents to the first Solr core

Time for action – configuring Solr Home and Solr core discovery

Time for action – writing a simple solrconfig.xml file

Time for action – writing a simple schema.xml file

Time for action – starting the new core

Time for action – defining an example document

Time for action – indexing an example document with cURL

Time for action – updating an existing document

Time for action – cleaning an index

Creating an index prototype from PDF files

Time for action – defining the schema.xml file with only dynamic fields and tokenization

Time for action – writing a simple solrconfig.xml file with an update handler

Time for action – using Tika and cURL to extract text from PDFs

Time for action – finding copies of the same files with deduplication

Time for action – looking inside an index with SimpleTextCodec

Understanding the structure of an inverted index

Writing the full configuration for our PDF index example

Summarizing some easy recipes for the maintenance of an index

Summary

Indexing Example Data from DBpedia – Paintings

Harvesting paintings' data from DBpedia

Analyzing the entities that we want to index

Writing Solr core configurations for the first tests

Time for action – defining the basic solrconfig.xml file

Time for action – defining the simple schema.xml file

Time for action – listing all the fields with the CSV output

Defining a new Solr core for our Painting entity

Time for action – refactoring the schema.xml file for the paintings core by introducing tokenization and stop words

Collecting the paintings data from DBpedia

Testing our paintings core

Time for action - looking at a field using the Schema browser in the web interface

Time for action – searching the new data in the paintings core

Summary

Searching the Example Data

Looking at Solr's standard query parameters

Time for action – searching for all documents with pagination

Time for action – projecting fields with fl

Time for action – adding a custom DocTransformer to hide empty fields in the results

Time for action – searching for terms with a Boolean query

Time for action – using q.op for the default Boolean operator

Time for action – selecting documents with the filter query

Time for action – searching for incomplete terms with the wildcard query

Time for action – using the Boost options

Time for action – searching for similar terms with fuzzy search

Time for action – writing a simple phrase query example

Time for action – playing with range queries

Time for action – sorting documents with the sort parameter

Time for action – adding a default parameter to a handler

Time for action – enabling XSLT Response Writer with Luke

Summary

Extending Search

Looking at different search parsers – Lucene, Dismax, and Edismax

Time for action – inspecting results using the stats and debug components

Time for action – debugging a query with the Lucene parser

Time for action – debugging a query with the Dismax parser

Time for action – executing a nested Edismax query

A short list of search components

Time for action – executing a simple pseudo-join query

Time for action – generating highlighted snippets over a term

Some idea about geolocalization with Solr

Time for action – creating a repository of cities

Time for action – expanding the original data with coordinates during the update process

Performing editorial correction on boosting

Introducing the spellcheck component

Time for action – playing with spellchecks

Summary

Using Faceted Search – from Searching to Finding

Exploring documents suggestion and matching with faceted search

Time for action – prototyping an auto-suggester with facets

Time for action – creating wordclouds on facets to view and analyze data

Thinking about faceted search and findability

Time for action – defining facets over enumerated fields

Performing data normalization for the keyword field during the update phase

Time for action – finding interesting topics using faceting on tokenized fields with a filter query

Using filter queries for caching filters

Time for action – finding interesting subjects using a facet query

Time for action – using range queries and facet range queries

Time for action – using a hierarchical facet (pivot)

Introducing group and field collapsing

Time for action – grouping results

Playing with terms

Time for action – playing with a term suggester

Time for action – having a look at the term vectors

Introducing the More Like This component and recommendations

Time for action – obtaining similar documents by More Like This

Summary

Working with Multiple Entities, Multicores, and Distributed Search

Working with multiple entities

Time for action – searching for cities using multiple core joins

Using sharding for distributed search

Time for action – playing with sharding (distributed search)

Time for action – finding a document from any shard

Collecting some ideas on schemaless versus normalization

Time for action – testing SolrCloud and Zookeeper locally

Summary

Indexing External Data sources

Stepping further into the real world

Time for action – indexing data from a database (for example, a blog or an e-commerce website)

Time for action – handling sub-entities (for example, joins on complex data)

Time for action – indexing incrementally using delta imports

Time for action – indexing CSV (for example, open data)

Time for action – importing Solr XML document files

Time for action – indexing rich documents (for example, PDF)

Adding more consideration about tuning

Time for action – indexing artist data from Tate Gallery and DBpedia

Summary

Introducing Customizations

Looking at the Solr customizations

Playing with specific languages

Time for action – detecting language with Tika and LangDetect

Introducing stemming for query expansion

Time for action – adopting a stemmer

Following an example plugin lifecycle

Time for action – writing a new ResponseWriter plugin with the Thymeleaf library

Using Maven for development

Time for action – integrating Stanford NER for Named Entity extraction

Summary

Solr Clients and Integrations

Introducing SolrJ – an embedded or remote Solr client using the Java (JVM) API

Time for action – playing with an embedded Solr instance

Choosing between an embedded or remote Solr instance

Time for action – playing with an external HttpSolrServer

Time for action – using Bean Scripting Framework and JavaScript

Writing Solr clients and integrations outside JVM

Summary

Pop Quiz Answers

Chapter 1, Getting Ready with the Essentials

Chapter 2, Indexing with Local PDF Files

Chapter 3, Indexing Example Data from DBpedia – Paintings

Chapter 4, Searching the Example Data

Chapter 5, Extending Search

Chapter 6, Using Faceted Search – from Searching to Finding

Chapter 7, Working with Multiple Entities, Multicores, and Distributed Search

Chapter 8, Indexing External Data sources

Chapter 9, Introducing Customizations

Appendix, Solr Clients and Integrations

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Learning the powerful aspects of Solr

Solr is a very powerful, flexible, mature technology, and it offers not only powerful full-text search capabilities but also autosuggestion, advanced filtering, geocoded search, highlighting in text, faceted search, and much more. The following are the most interesting ones from our perspective:

Advanced full-text search: This is the most obvious option. If we need to create some kind of an internal search engine on our site or application, or if we want to have more flexibility than the internal search capabilities of our database, Solr is the best choice. Solr is designed to perform fast searches and also to give us some flexibility on terms that are useful to intercept a natural user search, as we will see later. We can also combine our search with out of the box functionalities to perform searches over value intervals (imagine a search for a certain period in time), or by using geocoding functions.
Suggestions: Solr has components for creating autosuggestion results using internal similarity algorithms. This is useful because autosuggestion is one of the most intuitive user interface patterns; for example, think about the well-known Google search box that is shown in the following screenshot:
This simple Google search box performs queries on a remote server while we are typing, and automatically shows us some alternative term sequence that can be used for a query and has a chance to be relevant for us; it uses recurring terms and similarity algorithms over the data for this purpose. In the example, the tutorial keyword is suggested before the drupal one as it is judged more relevant from the system. With Solr, we can provide the backend service for developing our own autosuggestion component, inspired by this example.
Language analysis: Solr permits us to configure different types of language analysis even on a per-field basis, with the possibility to configure them specifically for a certain language. Moreover, integrations with tools such as Apache UIMA for metadata extraction already exist; and in general, you might have more new components so that you will be able to plug in to the architecture in the future, covering advanced language processing, information extraction capabilities, and other specific tasks.
Faceted search: This is a particular type of search based on classification. With Solr, we can perform faceted search automatically over our fields to gain information such as how many documents have the value London for the city field. This is useful to construct some kind of faceted navigation. This is another very familiar pattern in user experience that you probably know from having used it on e-commerce site such as Amazon. To see an example of faceted navigation, imagine a search on the Amazon site where we are typing apache s, as shown in the following screenshot:
In the previous screenshot you can clearly recognize some facets on the top-left corner, which is suggesting that we will find a certain number of items under a certain specific "book category". For example, we know in advance that we will find 11 items for the facet "Books: Java Programming". Then, we can decide from this information whether to narrow our search or not. In case we click on the facet, a new query will be performed, adding a filter based on the choice we implicitly made. This is exactly the way a Solr faceted search will perform a similar query. The term category here is somewhat misleading, as it seems to suggest a predefined taxonomy. But with Solr we can also obtain facets on our fields without explicitly classifying the document under a certain category. It's indeed Solr that automatically returns the faceted result using the current search keywords and criteria and shows us how many documents have the same value for a certain field. You may note that we have used an example of a user interface to give an introductory explanation for the service behind. This is true, and we can use faceted results in many different ways, as we will see later in the book. But I feel the example should help you to fix the first idea; we will explore this in Chapter 6, Using Faceted Search – from Searching to Finding.
It's easy to index data using Solr: for example, we can send data using a POST over HTTP, or we can index the text and metadata over a collection of rich documents (such as PDF, Word, or HTML) without too much effort, using the Apache Tika component. We can also read data from a database or another external data source, and configure an internal workflow to directly index them if needed—using the DataImportHandler components.
Solr also exposes its own search services that are REST-like on standard open formats such as JSON and XML, and it's then very simple to consume the data from JavaScript on HTTP.
Note
Representational State Transfer (REST) is a software architecture style that is largely used nowadays for exposing web services.
Refer to: http://en.wikipedia.org/wiki/Representational_state_transfer
The services are designed to be paginated, and to expose parameters for sorting and filtering results; so it's easy to consume the results from the frontend.
Note that other serialization formats can be used; which are designed for specific languages, such as Ruby or PHP, or to directly return the serialization of a Java object. There are already some third-party wrappers developed over these services to provide integration on existing applications, from Content Management Systems (CMS) such as Drupal, WordPress, to e-commerce platforms such as Magento. In a similar way, there are integrations that use the Java APIs, such as Alfresco and Broadleaf, directly (if you prefer you can see this as a type of "embedded" example).

It's possible to start Solr with a very small configuration, adopting an almost schemaless approach; but the internal schema is written in XML, and it is simple to read and write. The Solr application gives us a default web interface for administration, simple monitoring of the used resources, and direct testing of our queries.

This list is far from being exhaustive, and it had the purpose of only introducing you to some of the topics that we will see in the next chapters. If you visit the official site at http://lucene.apache.org/solr/features.html, you will find the complete list of features.

Apache Solr Beginner's Guide

By : Alfredo Serafini

Apache Solr Beginner's Guide

By: Alfredo Serafini

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr Beginner's Guide

Learning the powerful aspects of Solr

Note