Book Image

Mastering Apache Solr 7.x

By : Sandeep Nair, Chintan Mehta, Dharmesh Vasoya
Book Image

Mastering Apache Solr 7.x

By: Sandeep Nair, Chintan Mehta, Dharmesh Vasoya

Overview of this book

Apache Solr is the only standalone enterprise search server with a REST-like application interface. providing highly scalable, distributed search and index replication for many of the world's largest internet sites. To begin with, you would be introduced to how you perform full text search, multiple filter search, perform dynamic clustering and so on helping you to brush up the basics of Apache Solr. You will also explore the new features and advanced options released in Apache Solr 7.x which will get you numerous performance aspects and making data investigation simpler, easier and powerful. You will learn to build complex queries, extensive filters and how are they compiled in your system to bring relevance in your search tools. You will learn to carry out Solr scoring, elements affecting the document score and how you can optimize or tune the score for the application at hand. You will learn to extract features of documents, writing complex queries in re-ranking the documents. You will also learn advanced options helping you to know what content is indexed and how the extracted content is indexed. Throughout the book, you would go through complex problems with solutions along with varied approaches to tackle your business needs. By the end of this book, you will gain advanced proficiency to build out-of-box smart search solutions for your enterprise demands.
Table of Contents (14 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

What's new in Solr 7?


With a major release of Solr, lots of new features have been introduced. Overall, there are 51 new small-to-major features introduced in Solr 7. Along with these features, lots of bug fixes, optimization, and updates have been introduced. Let us go through some of the major changes introduced in Solr 7.

Replication for SolrCloud

Before we understand the new replication methods introduced in Solr 7, let's go through what was available for replication before Solr 7.

Until Solr 7, Solr had two options for replication purposes:

  • Master-slave replication or index replication
  • Solr Cloud

In master-slave replication, also known as index replication, the master shares a copy of indexed data with one or more slave servers. The master server's job is to index the data that is being added into Solr and share it with the slave servers while all read operations are performed in the slaves.

SolrCloud is a clustered environment of Solr that provides high availability and failover capability so that the content indexed using Solr can be distributed equally among multiple servers for scaling. In SolrCloud, one of the servers act as the leader and the rest of the servers in the cluster work as replica shards. Until Solr 7, in case of any issue on the leader server, any of the replica servers could act as a leader and form the leader-replica cluster. So in that case, data had to be shared with each of the nodes in the cluster, as leader shards and replica shards must remain in sync at any time. Each replica node performed the same operations as the leader. This replication, method available in SolrCloud before Solr 7, was known as NRT replicas.

In Solr 7, two new replication methods have been introduced:

  • TLOG replicas
  • PULL replicas

TLOG replicas

TLOG replica means transaction log replica. Instead of indexing the data again, a TLOG replica reads the transactions logs of the master or leader shard and replicates the segment or indexed data using a replication handler. In case of failure of the leader shard, one of the TLOG replicas acts as a leader and performs real-time indexing. It also makes a copy of the transaction log. Once the leader shard is available again, it again goes to the replica shard mode and performs only binary replication of segments. Replication done using the TLOG replication method is not as real-time as the one done using NRT replicas.

PULL replicas

A PULL replica pulls the data from the leader shard instead of indexing data locally as in NTR replicas or maintaining the transaction logs as in TLOG replicas. In case of failure of the leader shard, a PULL replica cannot become the new leader shard. For that, we may have to use either TLOG or NRT only. PULL replicas provide faster data replication from leader shards to replica shards.

Schemaless improvements

Solr has improved its schemaless mode functionality, the way it now detects data for indexing of an incoming field would be text based. By default, it will now be indexed as text_general for incoming fields, which can be modified if required. The name of the field will be the one defined in the document. A copy field rule will now be added in the schema when a collection is created if config set is not defined. It is now schemaless, which would insert the first 256 characters from the text field in a new strings field. It would be named as <name>_str.

The relevant schemaless behavior can be customized to remove a copy field rule or to update the number of characters added into the strings field or type of field used.

Copy field rules can impact the index size as well as slow down the indexing process. It is recommended to use the copy field rule when it is required. If there is no need to do a sort or facet on a field, you should ideally disable the copy field rule that is generated automatically.

The field creation rule can be disabled via the update.autoCreateFields property. You can also use the configuration API with the following command to disable it:

curl http://hostname:8983/solr/collection/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'

Autoscaling

As termed in the documentation of http://lucene.apache.org, the goal of autoscaling is to make SolrCloud cluster management easier by providing a way for changes to the cluster to be more automatic and more intelligent.

So in Solr 7, there are some APIs that monitor some predefined preferences and policies. If any of the rules provided in the policies are violated, Solr changes its configuration automatically as defined in the preferences. With the updated autoscaling feature, we can now have Solr spin up new replicas depending on the monitoring metrics, such as disk space.

Default numeric types

Trie*-based numeric files are now replaced by *PointField from Solr 7 onwards. Going forward, from Solr 8, all *PointField types will be removed. You need to work towards moving from *PointFields to the new Trie* fields for your schema. After changing to the new *Pointfields type, data will need to be reindexed in Solr.

Spatial fields

Here is the list of spatial fields that have been deprecated:

  • SpatialVectorFieldType
  • SpatialTermQueryPrefixTreeFieldType
  • LatLonType
  • GeoHashField

The following is the list of spatial fields that can be used moving forward:

  • SpatialRecursivePrefixTreeField
  • RptWithGeometrySpatialField
  • LatLonPointSpatialField

SolrJ

Here are the changes made in SolrJ:

  • HttpClientBuilderPlugin is replaced with HttpClientInterceptorPlugin and would work with a SolrHttpClientBuilder rather than HttpClientConfigurer that was the case earlier.
  • HttpClient instances configuration can be done now with help of SolrHttpClientBuilder rather than the earlier HttpClientConfigurer with the help of HttpClientUtil.
  • SOLR_AUTHENTICATION_CLIENT_BUILDER is now being used in variable instead of SOLR_AUTHENTICATION_CLIENT_CONFIGURER in environment variable.
  • HttpSolrClient#setMaxTotalConnections along with HttpSolrClient#setDefaultMaxConnectionsPerHost has now been removed. By default, these parameters are now set on the higher side and can be changed with the help of parameters when an HttpClient instance is created.

JMX and MBeans

Here are the changes made in Java Management Extensions (JMX) and MBeans:

  • We notice there is now a hierarchical format for names used in metrics in MBeans attributes. For reference we can have look at /admin/plugins and /admin/mbeans. And the UI plugins tab is now using a similar approach as now all the APIs fetch data from a metrics API. The earlier approach of having a flat JMX view has been removed.
  • <metric><reporter> has now replaced <jmx> elements in solrconfig.xml. And <metric><reporter> needs to be defined in the solr.xml configuration file. Default instances of SolrJmxReporter supports automatically limited backward compatibility when a local MBean server is discovered. If we want to enable a local MBean server we can use ENABLE_REMOTE_JMX_OPTS in solr.sh configuration file or via system properties that uses -Dcom.sun.management.jmxremote. With default instance all registries are exported using Solr metrics.
  • If we want to disable the behavior of SolrJmxReporter we can do it by using SolrJmxReporter configuration with a Boolean argument set to false. Backward compatibility support might be removed from Solr 8 for SolrJmxReporter.

Other changes

Apart from these changes, there are many other features and improvements that have been made in Solr 7:

  • In Solr 7 the default response type is set to JSON that was previously in XML format. If you want a response in XML then you will need to defined wt=xml in the request parameter.
  • Default value for the legacyCloud parameter is set to false. That means if an entry is not found for the replica in state.json, it will not be registered in the cluster shard.
  • By default, the new incoming field will be indexed as text_general. The name of the field will be the same as defined in the incoming document.
  • The _default config set is introduced to replace data_driven_configset and basic_configset. So while creating a new collection if no configuration value is defined, it will use _default configuration. In case of SolrCloud, ZooKeeper will use _default configuration if no configuration parameter is defined. While in standalone mode, instanceDir will be created using the _default configuration parameter.
  • New configuration set is defined for the SolrClient. So now configuration of socket timeout or connect timeouts are not dependent on HttpClient and can be defined specifically for SolrClient.
  • In SolrJ, HttpSolrClient#setAllowCompression that was earlier used to define enabling compression has been removed. Now this parameter must be enabled from the Constructor parameter only.
  • New V2 Application Program Interface (API) is available at /api/ as a preferred method and to leverage old API /solr/ continues to be available.
  • The standard query parser now has the default sow=false which means that text fields will not split on whitespace before handing text to the analyzer. It will help analyzer to match synonyms of multi-words.