Book Image

Apache Solr Search Patterns

By : Jayant Kumar
Book Image

Apache Solr Search Patterns

By: Jayant Kumar

Overview of this book

Table of Contents (17 chapters)
Apache Solr Search Patterns
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Implementing the information gain model


The problem with the information gain model is that, for each term in the index, we will have to evaluate the occurrence of every other term. The complexity of the algorithm will be of the order of square of the two terms, square(xy). It is not possible to compute this using a simple machine. What is recommended is that we create a map-reduce job and use a distributed Hadoop cluster to compute the information gain for each term in the index.

Our distributed Hadoop cluster would do the following:

  • Count all occurrences of each term in the index

  • Count all occurrences of each co-occurring term in the index

  • Construct a hash table or a map of co-occurring terms

  • Calculate the information gain for each term and store it in a file in the Hadoop cluster

In order to implement this in our scoring algorithm, we will need to build a custom scorer where the IDF calculation is overwritten by the algorithm for deriving the information gain for the term from the Hadoop cluster...