Book Image

Scaling Apache Solr

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Apache Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Table of Contents (18 chapters)
Scaling Apache Solr
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Practical use cases for Apache Solr


Publicly there are plenty of public sites who claim the use of Apache Solr as the server. We are listing a few here, along with how Solr is used:

  • Instagram: Instagram (a Facebook company) is one of the famous sites, and it uses Solr to power its geosearch API

  • WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and Solr

  • Netflix: Solr powers basic movie searching on this extremely busy site

  • Internet archive: Search this vast repository of music, documents, and video using Solr

  • StubHub.com: This ticket reseller uses Solr to help visitors search for concerts and sporting events.

  • The Smithsonian Institution: Search the Smithsonian's collection of over 4 million items

You can find the complete list of Solr usage (although a bit outdated) at http://wiki.apache.org/solr/PublicServers. You may also visit to understand interesting case studies about Contextual Search for Volkswagen and the Automotive Industry. The scope of this study is beyond Apache Solr, and talks about semantic search (RDF-based) to empower your overall enterprise industry.

Let's look at how Apache Solr can be used as an enterprise search in two different industries. We will look at one case study in detail, and we will understand how Solr can play a role in the other case study in brief.

Now that we have understood Apache Solr architecture and the use cases, let's look at how Apache Solr can be used as an enterprise search in two different industries.

Enterprise search for a job search agency

In this case, we will go through a case study for the job search agency, and how it can benefit using Apache Solr as an enterprise search platform.

Problem statement

In many job portal agencies, the enterprise search helps reduce the overall time employees spend in matching the expectations from customers with the resumes. Typically, for each vacancy, customers provide a job description. Many times, job description is a lengthy affair, and given the limited time each employee gets, he has to bridge the gap between these two. A job search agency has to deal with various applications as follows:

  • Internal CMS containing past information, resumes of candidates, and so on

  • Access to market analysis to align the business with expectation

  • Employer vacancies may come through e-mails or online vacancy portal

  • Online job agencies are a major source for supplying new resumes

  • An external public site of the agency where many applicants upload their resumes

Since a job agency deals with multiple systems due to their interaction patterns, having unified enterprise search on top of these systems is the objective to speed up the overall business.

Approach

Here, we have taken a fictitious job search agency who would like to improve the candidate identification time using enterprise search. Given the system landscape, Apache Solr can play a major role here in helping them speed up the process. The following screenshot depicts interaction between unified enterprise searches powered by Apache Solr with other systems:

The figure demonstrates how enterprise search powered by Apache Solr can interact with different data sources. The job search agency interacts with various internal as well as third-party applications. This serves as input for Apache Solr-based enterprise search operations. It would require Solr to talk with these systems by means of different technology-based interaction patterns such as web services, database access, crawlers, and customized adapters as shown in the right-hand side. Apache Solr provides support for database; for the rest, the agency has to build an event-based or scheduled agent, which can pull information from these sources and feed them in Solr. Many times, this information is raw, and the adapter should be able to extract field information from this data, for example, technology expertise, role, salary, or domain expertise. This can be done through various ways. One way is by applying a simple regular expression-based pattern on each resume, and then extracting the information. Alternatively, one can also let it run through the dictionary of verticals and try matching it. Tag-based mechanism also can be used for tagging resumes directly from information contained in the text.

Based on the requirements, now Apache Solr must provide rich facets for candidate searches as well as job searches, which would have the following facets:

  • Technology-based dimension

  • Vertical- or domain-based dimension

  • Financials for candidates

  • Timeline of candidates' resume (upload date)

  • Role-based dimension

Additionally, mapping similar words (J2EE—Java Enterprise Edition—Java2 Enterprise Edition) through Solr really helps ease the job of agency's employees for automatically producing the proximity among these words, which have the same meaning through the Apache Solr synonym feature. We are going to look at how it can be done in the upcoming chapters.

Enterprise search for energy industry

In this case study, we will learn how enterprise search can be used within the energy industry.

Problem statement

In large cities, the energy distribution network is managed by companies, which are responsible for laying underground cables, and setting up power grids at different places, and transformers. Overall, it's a huge chunk of work that any industry will do for a city. Although there are many bigger problems in this industry where Apache Solr can play a major role, we will try to focus on this specific problem.

Many times, the land charts will show how the assets (for example, pipe, cable, and so on) are placed under the roads and the information about lamps are drawn and kept in a safe. This has been paper-based work for long time, and it's now computerized. The field workers who work in the fields for repairs or maintenance often need access to this information, such as assets, pipe locations, and so on.

The demand for this information is to locate a resource geographically. Additionally, the MIS information is part of the documents lying on CMS, and it's difficult to locate this information and link it with geospatial search. This in turn drives the need for the presence of the enterprise search. Additionally, there is also a requirement for identifying the closest field workers to the problematic area to ensure quick resolution.

Approach

For this problem, we are dealing with information coming in totally different forms. The real challenge is to link this information together, and then apply a search that can provide a unified access to information with rich query. We have the following information:

  • Land Charts: These are PDFs, paper-based documents, and so on, which are fixed information

  • GIS information: These are coordinates, which are fixed for assets such as transformers, and cables

  • Field engineers' information: This gives the current location and is continuously flowing

  • Problem/Complaints: This will be continuous, either through some portal, or directly fed through the web interface

The challenges that we might face with this approach include:

  • Loading and linking data in various formats

  • Identifying assets on map

  • Identifying the proximity between field workers and assets

  • Providing better browsing experience on all this information

Apache Solr supports geospatial search. It can bring a rich capacity by linking assets.

Information with geospatial world creates a confluence to enable the users to access this information. It can bring a rich capacity by linking asset information with the geospatial world and creating a confluence to enable the users to access this information at their finger tips.

However, Solr has its own limitations in terms of geospatial capabilities. For example, it supports only point data (latitude, longitude) directly; all the other data types are supported through JTS.

Note

Java Topology Suite (JTS) is a java-based API toolkit for GIS. JTS provides a foundation for building further spatial applications, such as viewers, spatial query processors, and tools for performing data validation, cleaning, and integration.

For the given problem, GIS and land chart will feed information in the Solr server once. This will include linking all assets with GIS information through the custom adapter. The complaint history as well as the field engineers' data will be continuous, and the old data will be overwritten; this can be a scheduled event or a custom event, based on the new inputs received by the system. To meet the expectations, the following application components will be required (minimum):

  • Custom adapter with scheduler/event for field engineers' data and complaint register information providing integration with gateways (for tapping GIS information of field engineers) and portals (for the complaint register)

  • Lightweight client to scan the existing system (history, other documentation) and load in Solr

  • Client application to provide end user interface for enterprise search with URL integration for maps

  • Apache Solr with superset schema definition and configuration with support for spatial data types

The following screenshot provides one of the possible visualizations for this system. This system can be extended to further provide more advanced capabilities such as integration with Optical Character Recognition (OCR) software to search across paper-based information, or even to generate dynamic reports based on filters using Solr. Apache Solr also supports output in XML form, which can be applied with any styling and the same can be used to develop nice reporting systems.