Mastering Apache Solr 7.x

Mastering Apache Solr 7.x

By : Sandeep Nair, Chintan Mehta, Dharmesh Vasoya

Buy this Book

Mastering Apache Solr 7.x

By: Sandeep Nair, Chintan Mehta, Dharmesh Vasoya

Buy this Book

Overview of this book

Apache Solr is the only standalone enterprise search server with a REST-like application interface. providing highly scalable, distributed search and index replication for many of the world's largest internet sites. To begin with, you would be introduced to how you perform full text search, multiple filter search, perform dynamic clustering and so on helping you to brush up the basics of Apache Solr. You will also explore the new features and advanced options released in Apache Solr 7.x which will get you numerous performance aspects and making data investigation simpler, easier and powerful. You will learn to build complex queries, extensive filters and how are they compiled in your system to bring relevance in your search tools. You will learn to carry out Solr scoring, elements affecting the document score and how you can optimize or tune the score for the application at hand. You will learn to extract features of documents, writing complex queries in re-ranking the documents. You will also learn advanced options helping you to know what content is indexed and how the extracted content is indexed. Throughout the book, you would go through complex problems with solutions along with varied approaches to tackle your business needs. By the end of this book, you will gain advanced proficiency to build out-of-box smart search solutions for your enterprise demands.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Introduction to Solr 7

Introduction to Solr

Why choose Solr?

Solr use cases

What's new in Solr 7?

Summary

Getting Started

Solr installation

Understanding various files and the folder structure

Running Solr

Loading sample data

Understanding the browse interface

Using the Solr admin interface

Summary

Designing Schemas

How Solr works

Understanding field types

Field management

Mastering Schema API

Deciphering schemaless mode

Summary

Mastering Text Analysis Methodologies

Understanding text analysis

Understanding analyzer

Understanding tokenizers

Understanding filters

Understanding multilingual analysis

Understanding phonetic matching

Summary

Data Indexing and Operations

Basics of Solr indexing

Understanding index handlers

Apache Tika and indexing

Language detection

Client APIs

Summary

Advanced Queries – Part I

Search relevance

Velocity search UI

Query parsing and syntax

Response writer

Faceting

Highlighting

Summary

Advanced Queries – Part II

Summary

Managing and Fine-Tuning Solr

JVM configuration

Managing solrconfig.xml

Managing backups

JMX with Solr

Logging configuration

SolrCloud overview

Enabling SSL – Solr security

Performance statistics

Summary

Client APIs – An Overview

Client API overview

JavaScript Client API

SolrJ Client API

Ruby Client API

Python Client API

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What's new in Solr 7?

With a major release of Solr, lots of new features have been introduced. Overall, there are 51 new small-to-major features introduced in Solr 7. Along with these features, lots of bug fixes, optimization, and updates have been introduced. Let us go through some of the major changes introduced in Solr 7.

Replication for SolrCloud

Before we understand the new replication methods introduced in Solr 7, let's go through what was available for replication before Solr 7.

Until Solr 7, Solr had two options for replication purposes:

Master-slave replication or index replication
Solr Cloud

In master-slave replication, also known as index replication, the master shares a copy of indexed data with one or more slave servers. The master server's job is to index the data that is being added into Solr and share it with the slave servers while all read operations are performed in the slaves.

SolrCloud is a clustered environment of Solr that provides high availability and failover capability so that the content indexed using Solr can be distributed equally among multiple servers for scaling. In SolrCloud, one of the servers act as the leader and the rest of the servers in the cluster work as replica shards. Until Solr 7, in case of any issue on the leader server, any of the replica servers could act as a leader and form the leader-replica cluster. So in that case, data had to be shared with each of the nodes in the cluster, as leader shards and replica shards must remain in sync at any time. Each replica node performed the same operations as the leader. This replication, method available in SolrCloud before Solr 7, was known as NRT replicas.

In Solr 7, two new replication methods have been introduced:

TLOG replicas
PULL replicas

TLOG replicas

TLOG replica means transaction log replica. Instead of indexing the data again, a TLOG replica reads the transactions logs of the master or leader shard and replicates the segment or indexed data using a replication handler. In case of failure of the leader shard, one of the TLOG replicas acts as a leader and performs real-time indexing. It also makes a copy of the transaction log. Once the leader shard is available again, it again goes to the replica shard mode and performs only binary replication of segments. Replication done using the TLOG replication method is not as real-time as the one done using NRT replicas.

PULL replicas

A PULL replica pulls the data from the leader shard instead of indexing data locally as in NTR replicas or maintaining the transaction logs as in TLOG replicas. In case of failure of the leader shard, a PULL replica cannot become the new leader shard. For that, we may have to use either TLOG or NRT only. PULL replicas provide faster data replication from leader shards to replica shards.

Schemaless improvements

Solr has improved its schemaless mode functionality, the way it now detects data for indexing of an incoming field would be text based. By default, it will now be indexed as text_general for incoming fields, which can be modified if required. The name of the field will be the one defined in the document. A copy field rule will now be added in the schema when a collection is created if config set is not defined. It is now schemaless, which would insert the first 256 characters from the text field in a new strings field. It would be named as <name>_str.

The relevant schemaless behavior can be customized to remove a copy field rule or to update the number of characters added into the strings field or type of field used.

Copy field rules can impact the index size as well as slow down the indexing process. It is recommended to use the copy field rule when it is required. If there is no need to do a sort or facet on a field, you should ideally disable the copy field rule that is generated automatically.

The field creation rule can be disabled via the update.autoCreateFields property. You can also use the configuration API with the following command to disable it:

curl http://hostname:8983/solr/collection/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'

Autoscaling

As termed in the documentation of http://lucene.apache.org, the goal of autoscaling is to make SolrCloud cluster management easier by providing a way for changes to the cluster to be more automatic and more intelligent.

So in Solr 7, there are some APIs that monitor some predefined preferences and policies. If any of the rules provided in the policies are violated, Solr changes its configuration automatically as defined in the preferences. With the updated autoscaling feature, we can now have Solr spin up new replicas depending on the monitoring metrics, such as disk space.

Default numeric types

Trie*-based numeric files are now replaced by *PointField from Solr 7 onwards. Going forward, from Solr 8, all *PointField types will be removed. You need to work towards moving from *PointFields to the new Trie* fields for your schema. After changing to the new *Pointfields type, data will need to be reindexed in Solr.

Spatial fields

Here is the list of spatial fields that have been deprecated:

SpatialVectorFieldType
SpatialTermQueryPrefixTreeFieldType
LatLonType
GeoHashField

The following is the list of spatial fields that can be used moving forward:

SpatialRecursivePrefixTreeField
RptWithGeometrySpatialField
LatLonPointSpatialField

SolrJ

Here are the changes made in SolrJ:

HttpClientBuilderPlugin is replaced with HttpClientInterceptorPlugin and would work with a SolrHttpClientBuilder rather than HttpClientConfigurer that was the case earlier.
HttpClient instances configuration can be done now with help of SolrHttpClientBuilder rather than the earlier HttpClientConfigurer with the help of HttpClientUtil.
SOLR_AUTHENTICATION_CLIENT_BUILDER is now being used in variable instead of SOLR_AUTHENTICATION_CLIENT_CONFIGURER in environment variable.
HttpSolrClient#setMaxTotalConnections along with HttpSolrClient#setDefaultMaxConnectionsPerHost has now been removed. By default, these parameters are now set on the higher side and can be changed with the help of parameters when an HttpClient instance is created.

JMX and MBeans

Here are the changes made in Java Management Extensions (JMX) and MBeans:

We notice there is now a hierarchical format for names used in metrics in MBeans attributes. For reference we can have look at /admin/plugins and /admin/mbeans. And the UI plugins tab is now using a similar approach as now all the APIs fetch data from a metrics API. The earlier approach of having a flat JMX view has been removed.
<metric><reporter> has now replaced <jmx> elements in solrconfig.xml. And <metric><reporter> needs to be defined in the solr.xml configuration file. Default instances of SolrJmxReporter supports automatically limited backward compatibility when a local MBean server is discovered. If we want to enable a local MBean server we can use ENABLE_REMOTE_JMX_OPTS in solr.sh configuration file or via system properties that uses -Dcom.sun.management.jmxremote. With default instance all registries are exported using Solr metrics.
If we want to disable the behavior of SolrJmxReporter we can do it by using SolrJmxReporter configuration with a Boolean argument set to false. Backward compatibility support might be removed from Solr 8 for SolrJmxReporter.

Other changes

Apart from these changes, there are many other features and improvements that have been made in Solr 7:

In Solr 7 the default response type is set to JSON that was previously in XML format. If you want a response in XML then you will need to defined wt=xml in the request parameter.
Default value for the legacyCloud parameter is set to false. That means if an entry is not found for the replica in state.json, it will not be registered in the cluster shard.
By default, the new incoming field will be indexed as text_general. The name of the field will be the same as defined in the incoming document.
The _default config set is introduced to replace data_driven_configset and basic_configset. So while creating a new collection if no configuration value is defined, it will use _default configuration. In case of SolrCloud, ZooKeeper will use _default configuration if no configuration parameter is defined. While in standalone mode, instanceDir will be created using the _default configuration parameter.
New configuration set is defined for the SolrClient. So now configuration of socket timeout or connect timeouts are not dependent on HttpClient and can be defined specifically for SolrClient.
In SolrJ, HttpSolrClient#setAllowCompression that was earlier used to define enabling compression has been removed. Now this parameter must be enabled from the Constructor parameter only.
New V2 Application Program Interface (API) is available at /api/ as a preferred method and to leverage old API /solr/ continues to be available.
The standard query parser now has the default sow=false which means that text fields will not split on whitespace before handing text to the analyzer. It will help analyzer to match synonyms of multi-words.

Mastering Apache Solr 7.x

By : Sandeep Nair, Chintan Mehta, Dharmesh Vasoya

Mastering Apache Solr 7.x

By: Sandeep Nair, Chintan Mehta, Dharmesh Vasoya

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Apache Solr 7.x

Elasticsearch 7 Quick Start Guide

Mastering Elasticsearch 5.x

What's new in Solr 7?

Replication for SolrCloud

TLOG replicas

PULL replicas

Schemaless improvements

Autoscaling

Default numeric types

Spatial fields

SolrJ

JMX and MBeans

Other changes