Book Image

Solr Cookbook - Third Edition - Third Edition

By : Rafal Kuc
Book Image

Solr Cookbook - Third Edition - Third Edition

By: Rafal Kuc

Overview of this book

This book is for intermediate Solr Developers who are willing to learn and implement Pro-level practices, techniques, and solutions. This edition will specifically appeal to developers who wish to quickly get to grips with the changes and new features of Apache Solr 5.
Table of Contents (18 chapters)
Solr Cookbook Third Edition
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Using Solr in a schemaless mode


Many use cases allow us to define our index structure upfront. We can look at the data, see which parts are important, which we want to search, how we want to do it, and finally, we can create the schema.xml file that we will use. However, this is not always possible. Sometimes, you don't know the data structure before you go into production, or you know very little about it. Of course, we can use dynamic fields, but such an approach is limited. This is why the newest versions of Solr allow us to use the so-called schemaless mode in which Solr is able to guess the type of data and create a field for it.

How to do it...

Let's assume that we don't know anything about the data and we want to fully rely on Solr when it comes to it.

  1. To do this, we start with the schema.xml file—the fields section of it. We need to include two fields, so our schema.xml file looks as follows:

    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="_version_" type="long" indexed="true" stored="true"/>
  2. In addition to this, we need to specify the unique identifier. We do this by including the following section in the schema.xml file:

    <uniqueKey>id</uniqueKey>
  3. In addition, we need to have the field types defined. To do this we add a section that looks as follows:

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>
    <fieldType name="tlongs" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0" multiValued="true"/>
    <fieldType name="tdoubles" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0" multiValued="true"/>
    <fieldType name="tdates" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0" multiValued="true"/>
    
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100" multiValued="true">
     <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
    </fieldType>
  4. Now, we can switch to the solrconfig.xml file to add the so-called managed index schema. We do this by adding the following configuration snippet to the root section of the solrconfig.xml file:

    <schemaFactory class="ManagedIndexSchemaFactory">
     <bool name="mutable">true</bool>
     <str name="managedSchemaResourceName">managed-schema</str>
    </schemaFactory>
  5. We alter our update request handler to include additional update chains (we can just alter the same section in the solrconfig.xml file we already have):

    <requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
      <str name="update.chain">add-unknown-fields</str>
     </lst>
    </requestHandler>
  6. Finally, we define the used update request processor chain by adding the following section to the solrconfig.xml file:

    <updateRequestProcessorChain name="add-unknown-fields">
     <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>
     <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>
     <processor  class="solr.ParseLongFieldUpdateProcessorFactory"/>
     <processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>
     <processor class="solr.ParseDateFieldUpdateProcessorFactory">
      <arr name="format">
       <str>yyyy-MM-dd</str>
      </arr>
     </processor>
     <processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
      <str name="defaultFieldType">text</str>
      <lst name="typeMapping">
       <str name="valueClass">java.lang.Boolean</str>
       <str name="fieldType">booleans</str>
      </lst>
      <lst name="typeMapping">
       <str name="valueClass">java.util.Date</str>
       <str name="fieldType">tdates</str>
      </lst>
      <lst name="typeMapping">
       <str name="valueClass">java.lang.Long</str>
       <str name="valueClass">java.lang.Integer</str>
       <str name="fieldType">tlongs</str>
      </lst>
      <lst name="typeMapping">
       <str name="valueClass">java.lang.Number</str>
       <str name="fieldType">tdoubles</str>
      </lst>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory"/>
     <processor class="solr.RunUpdateProcessorFactory"/>
    </updateRequestProcessorChain>

    Now, if we index a document, it looks like this:

    <add>
     <doc>
      <field name="id">1</field>
      <field name="title">Test document</field>
      <field name="published">2014-04-21</field>
      <field name="likes">12</field>
     </doc>
    </add>

    Solr will index it without any problem, creating fields such as titles, likes, or published, with a proper format. We can check them by running a q=*:* query, which will result in the following response:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">1</int>
      <lst name="params">
       <str name="q">*:*</str>
      </lst>
     </lst>
    <result name="response" numFound="1" start="0">
     <doc>
      <str name="id">1</str>
      <arr name="title">
       <str>Test document</str>
      </arr>
      <arr name="published">
       <date>2014-04-21T00:00:00Z</date>
      </arr>
      <arr name="likes">
       <long>12</long>
      </arr>
      <long name="_version_">1466477993631154176</long></doc>
     </result>
    </response>

How it works...

We start with our index having two fields, id and _version_. The id field is used as the unique identifier; we informed Solr about this by adding the unqiueKey section in schema.xml. We will need it for functionalities such as document updates, deletes by identifiers, and so forth. The _version_ field is used by Solr internally, and is required by some Solr functionalities (such as optimistic locking); this is why we include it. The rest of the fields will be added automatically.

We also need to define the field types that we will use. Apart from the string type used by the id field, and the long type used by the _version_ field, it contains types our documents will use. We will also define these types in our custom processor chain in the solrconfig.xml file.

The next thing is very important; the managed schema factory that we defined in solrconfig.xml, which is a ManagedIndexSchemaFactory type (the class property set to this value). By adding this section, we say that we want Solr to manage our schema.xml file. This means that Solr will load the schema.xml file during startup, change its name to schema.xml.bak, and will then create a file called managed-schema (the value of the managedSchemaResourceName property). From this point, we shouldn't modify our index structure manually—we should either let Solr do it during indexation or add and alter fields using the schema API (we will talk about this in the Altering the index structure on a live collection recipe in Chapter 8, Using Additional Functionalities). Since I assume that we will use the schema API, I've set the mutable property to true. If we want to disallow using the schema API, we should set the mutable property to false.

Note

Note that you need to have a single schemaFactory defined, and it needs to be set to the ManagedIndexSchemaFactory type. If it is not set to this type, field discovery will not work and the indexation will result in an error.

We also need to include an update request processor chain. Since we want all index requests to use our custom request chain, we add the update.chain property and set it to add-unknown-fields in the defaults section of our update request handler configuration.

Finally, the second most important thing in this recipe is our update request processor chain called add-unknown-fields (the same as we used in the update processor configuration). It defines several update processors that allow us to get the functionality of fields and their types' discoveries. The solr.RemoveBlankFieldUpdateProcessorFactory processor factory removes empty fields from the documents we send to indexation. The solr.ParseBooleanFieldUpdateProcessorFactory processor factory is responsible for parsing Boolean fields; solr.ParseLongFieldUpdateProcessorFactory parses fields that have data that uses the long type; solr.ParseDoubleFieldUpdateProcessorFactory parses fields with data of double type; and solr.ParseDateFieldUpdateProcessorFactory parses the date-based fields. We specify the format we want Solr to recognize (we will discuss this in more detail in the Using parsing update processors to parse data recipe in Chapter 2, Indexing Your Data).

Finally, we include the solr.AddSchemaFieldsUpdateProcessorFactory processor factory that adds the actual fields to our managed schema. We specify the default field type to text by adding the defaultFieldType property. This type will be used when no other type will match the field. After the default field type definition, we see four lists called typeMapping. These sections define the field type mappings Solr will use. Each list contains at least one valueClass property and one fieldType property. The valueClass property defines the type of data Solr will assign to the field type defined by the fieldType property.

In our case, if Solr finds a date (<str name="valueClass">java.util.Date</str>) value in a field, it will create a new field using the tdates field type (<str name="fieldType">tdates</str>). If Solr finds a long or an integer value, it creates a new field using the tlongs field type. Of course, a field won't be created if it already exists in our managed schema. The name of the field created in our managed schema will be the same as the name of the field in the indexed document.

Finally, the solr.LogUpdateProcessorFactory processor factory tells Solr to write information about the update to log, and the solr.RunUpdateProcessorFactory processor factory tells Solr to run the update itself.

As we can see, our data includes fields that we didn't specify in the schema.xml file, and the document was indexed properly, which allows us to assume that the functionality works. If you want to check how our index structure looks like after indexation, use the schema API; you can do it yourself after reading the Retrieving information about the index structure recipe in Chapter 8, Using Additional Functionalities.

One thing to remember is that by default, Solr is able to automatically detect field types such as Boolean, integer, float, long, double, and date.

Note

Take a look at https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode for further information regarding the Solr schemaless mode.