Book Image

Scaling Apache Solr

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Apache Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Table of Contents (18 chapters)
Scaling Apache Solr
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Features of Apache Solr


Apache Solr comes with a rich set of features that can be utilized by enterprises to make the search experience unique and effective. Let's take an overview of some of these key features. We will understand how they can be configured in the next chapter at deeper level.

Solr for end users

A search is effective when the searched information can be seen in different dimensions. For example, if a visitor is interested in buying a camera and he visits online shopping websites and searches for his model. When a user query is executed on the search, a search would rank and return a huge number of results. It would be nice, if he can filter out the results based on the resolution of the camera, or the make of the camera. These are the dimensions that help the user improve querying. Apache Solr offers a unique user experience that enables users to retrieve information faster.

Powerful full text search

Apache Solr provides a powerful full text search capability. Besides normal search, Solr users can run a search for specific fields, for example, error_id:severe. Apache Solr supports wildcards in the queries. A search pattern consisting only of one or more asterisks will match all terms of the field in which it is used, for example, book_title:*. A question mark can be used where there might be variations for a single character. For example, a search for ?ar will match with car, bar, jar and a search for c?t will match with cat, cot, cut. Overall, Apache supports the following power expressions to enable the user to find information in all possible ways as follows:

  • Wildcards

  • Phrase queries

  • Regular expressions

  • Conditional login (and, or, not)

  • Range queries (date/integer)

Search through rich information

Apache Solr search can generate indexes out of different file types including many rich documents such as HTML, Word, Excel, Presentations, PDF, RTF, E-mail, ePub formats, the .zip files, and many more. It achieves this by integrating different packages such as Lucene, and Apache Tika. These documents when uploaded to Apache Solr get parsed and an index is generated by Solr for search. Additionally, Solr can be extended to work with specific formats by creating customer handlers/adapters for the same. This feature enables Apache Solr to work best for enterprises dealing with different types of data.

Results ranking, pagination, and sorting

When searching for information, Apache Solr returns results page-by-page starting with top K results. Each result row carries a certain score, and the results are sorted based on the score. The result ranking in Solr can be customized as per the application's requirements. This allows the user's flexibility to search more specifically for relevant content. The size of the page can be configured in Apache Solr configuration. Using pagination, Solr can compute and return the results faster than otherwise. Sorting is a feature that enables Solr users to sort the results on certain terms or attributes, for example, a user might consider sorting of results based on increasing price order on an online shopping portal search.

Facets for better browsing experience

Apache Solr facets do not only help users to refine their results using various attributes, but they allow better browsing experience along with the search interface. Apache Solr provides schema-driven, context-specific facets that help users discover more information quickly. Solr facets can be created based on the attributes of the schema that is designed before setting up the instance. Although Apache Solr works on a schema defined for the user, it allows them to have flexibility in the schema by means of dynamic fields, enabling users to work with content of a dynamic nature.

Note

Based on the schema attributes, Apache Solr generates facet information at the time of indexing instead of doing it on the stored values. That means, if we introduce new attributes in the schema after indexing of our information, Solr will not be able to identify them. This may be solved by re-indexing the information again.

Each of these facet elements contain the filter value, which carries a count of results that match among the searched results. For the newly introduced schema attributes, users need to recreate the indexes that are created before. There are different types of facets supported by Solr. The following screenshot depicts the different types of facets that are discussed:

The facets allow you to get aggregated view on your text data. These aggregations can be based on different compositions such as count (number of appearances), time based, and so on. The following table describes the facets and their description supported by Apache Solr:

Facet

Description

Field-value

You can have your schema fields as facet components here. It shows the count of top fields. For example, if a document has tags, a field-value facet on the tag Solr field will show the top N tags, which are found in the matched result as shown in the image.

Range

Range faceting is mostly used on date/numeric fields, and it supports range queries. You can specify start and end dates, gap in the range and so on. There is a facet called date facet for managing dates, but it has been deprecated since Solr 3.2, and now the date is being handled in range faceting itself. For example, if indexed, a Solr document has a creation date and time; a range facet will provide filtering based on the time range.

Pivot

A pivot gives Solr users the ability to perform simple math on the data. With this facet, they can summarize results, and then get them sorted, take an average, and so on. This gives you hierarchical results (also sometimes called hierarchical faceting).

Multi-select

Using this facet, the results can be refined with multiple selects on attribute values. These facets can be used by the users to apply multiple criteria on the search results.

Advanced search capabilities

Apache Solr provides various advanced search capabilities. Solr comes with a more like this feature, which lets you find documents similar to one or more seed documents. The similarity is calculated from one or more fields of your choice. When the user selects similar results, it will take the current search result and try to find a similar result in the complete index of Solr.

When the user passes a query, the search results can show the snippet among the searched keywords highlighted. Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response. Highlighting takes place only for the fields that are searched by the user. Solr provides a collection of highlighting utilities, which allow a great deal of control over the field's fragments, the size of fragments, and how they are formatted.

When a search query is passed to Solr, it is matched with the number of results. The order in which the results are displayed on the UI is based on the relevance of each result with the searched keyword(s), by default. Relevance is all about proximity of the result set with the searched keyword that is returned by Solr when a query is performed. This proximity can be measured in various ways. The relevance of a response depends upon the context in which the query was performed. A single search application may be used in different contexts by users with different needs and expectations. Apache Solr provides the relevant score calculations based on various factors such as the number of occurrences of searched keyword in the document or the co-ordination factor, which relates to the maximum number of terms matched among the searched keywords. Solr not only gives flexibility to users to choose the scoring, but also allows users to customize the relevant ranking as per the enterprise search expectations.

Apache Solr allows spell checker based on the index proximity. There are multiple options available under the label, in one case Solr provides suggestions for the misplaced word when searched, in the other case Solr returns a suggestion to the user with the Did you mean prompt The following screenshot shows an example of how these features would look on Apache Solr's client side:

Additionally, Apache Solr has a suggest feature that suggests the query terms or phrases based on incomplete user inputs. With the help of suggestions, users can choose from the list of suggestions as they start typing a few characters. These completions come from the Solr index generated at the time of data indexing from the first top-k matches ranked based on relevance, popularity, or the order of alphabets. Consider the following screenshot:

In many enterprises, location-based information along with text data brings value in terms of visual representation. Apache Solr supports geospatial search. A Solr search provides advanced geospatial capabilities in the search by which users can sort the results based on geographical distances (longitude and latitude), or rank the results based on proximity. This capability comes from the Lucene spatial module.

Enterprises are not limited to any languages and often contain a landscape of non-English applications used daily by the employees. Sometimes, the documentation has local languages. In such cases, an enterprise search is required to have the capability to work on various languages instead of limiting itself on one. Apache Solr has built-in language detection and provides language specific text analysis solutions for many languages. Many times, the implementers need to customize the Solr instance to work for us as per their requirements for multi-lingual support.

Administration

Like any other enterprise search operations, Apache Solr facilitates system administrators with various capabilities. This section discusses different features supported at the administration level for Apache Solr.

Apache Solr has built-in administration user interface for administrators and Solr developers. Apache Solr has evolved its administration screen. Version 4.6 contains many advanced features. The administration screen in Solr looks like the following screenshot:

The Admin UI provides a dashboard that provides information about the instance and the system. The logging section provides Apache logging service (log4j) outputs and various log levels such as warning, severe, and error. The core admin UI details out management information about different cores. The thread dump screen shows all threads with CPU time and thread time. The administrators can also see stack trace for threads.

A collection represents complete logical index, whereas a Solr core represents an index with a Solr instance that includes configuration and runtime. Typically, the configuration of Solr core is kept in the /conf directory. Once the user selects the core, they get access to various core-specific functions such as current configuration view, test UI for testing various handlers of Solr, and schema browser. Consider the following features:

  • JMX monitoring: The Java Management Extension (JMX) technology provides the tools for managing and monitoring of web-based, distributed system. Since Version 3.1, Apache Solr can expose the statistics of runtime activities as dynamic Managed Beans (MBeans). The beans can be viewed in any JMX client (for example, JConsole). With every release, MBeans gets added, and administrators can see the collective list of these MBeans using administration interface. (Typically, it can be seen by accessing: http://localhost:8983/solr/admin/mbeans/).

  • Near real time search: Unlike Google's lazy index update, based on crawler's chance of visiting certain pages, the enterprise search at times requires fast index updates based on the changes. It means, the user wants to search the near real time databases. Apache Solr supports soft commit.

    Note

    Whenever users upload documents to the Solr server, they must run a commit operation, to ensure that the uploaded documents are stored in the Solr repository. A soft commit is a Solr 4.0 introduced feature that allows users to commit fast, by passing costly commit procedures and making the data available for near real-time search.

    With soft commit, the information is available immediately for searching; however, it requires normal commit to ensure the document is available on a persistent store. Solr administrators can also enable autosoft commit through Apache Solr configuration.

  • Flexible query parsing: In Apache Solr, query parsers play an important role for parsing the query and allowing the search to apply the outcome on the indexes to identify whether the search keywords match. A parser may enable Solr users to add search keywords customizations such as support for regular expressions or enabling users with complex querying through the search interface. Apache Solr, by default, supports several query parsers, offering the enterprise architects to bring in flexibility in controlling how the queries are getting parsed. We are going to understand them in detail in the upcoming chapters.

  • Caching: Apache Solr is capable of searching on large datasets. When such searches are performed, the cost of time and performance become important factors for the scalability. Apache Solr does caching at various levels to ensure that the users get optimal performance out of the running instance. The caching can be performed at filter level (mainly used for filtering), field values (mainly used in facets), query results (top-k results are cached in certain order), and document level cache. Each cache implementation follows a different caching strategy, such as least recently used or least frequently used. Administrators can choose one of the available cache mechanisms for their search application.

  • Integration: Typically, enterprise search user interfaces appear as a part of the end user applications, as they only occupy limited screens. The open source Apache Solr community provides client libraries for integrating Solr with various technologies in the client-server model. Solr supports integration through different languages such as Ruby, PHP, .NET, Java, Scala, Perl, and JavaScript. Besides programming languages, Solr also integrates with applications, such as Drupal, WordPress, and Alfresco CMS.