Scaling Apache Solr

The presence of a good enterprise search solution in any organization is an important aspect of information availability. Absence of such a mechanism can possibly result in poor decision making, duplicated efforts, and lost productivity due to the inability to find the right information at any time. Any search engine typically comprises the following components:

Crawlers or data collectors focus mainly on gathering the information on which a search should run.
Once the data is collected, it needs to be parsed and indexed. So parsing and indexing is another important component of any enterprise search.
The search component is responsible for runtime search on a user-chosen dataset.
Additionally, many search engine vendors provide a plethora of components around search engines, such as administration and monitoring, log management, and customizations.

Today public web search engines have become mature. More than 90 percent of online activities begin with search engines (http://searchengineland.com/top-internet-activities-search-email-once-again-88964) and more than 100 billion global searches are being made every month (http://searchenginewatch.com/article/2051895/Global-Search-Market-Tops-Over-100-Billion-Searches-a-Month). While the focus of web-based search is more on finding out content on the Web, enterprise searches focus on helping employees find out the relevant information stored in their corporate network in any form. Corporate information lacks useful metadata that an enterprise search can use to relate, unlike web searches, which have access to HTML pages that carry a lot of useful metadata for best results. Overall, building an enterprise search engine becomes a big challenge.

Many enterprise web portals provide searches over their own data; however, they do not really solve the problem of unified data access because most of the enterprise data that is outside the purview of these portals largely remains invisible to these search solutions. This data is mainly part of various sources such as external data sources, other departmental data, individual desktops, secured data, proprietary format data, and media files. Let's look at the challenges faced in the industry for enterprise search as shown in the following figure:

Let's go through each challenge in the following list and try to understand what they mean:

Diverse repositories: The repositories for processing the information vary from a simple web server to a complex content management system. The enterprise search engine must be capable of dealing with diverse repositories.
Security: Security in the enterprise has been one of the primary concerns along with fine-grained access control while dealing with enterprise search. Corporates expect data privacy from enterprise search solutions. This means two users running the same search on enterprise search may get two different sets of results based on the document-level access.
Variety of information: The information in any enterprise is diverse and has different dimensions, such as different types (including PDF, doc, proprietary formats, and so on) of document or different locale (such as English, French, and Hindi). An enterprise search would be required to index this information and provide a search on top of it. This is one of the challenging areas of enterprise searches.
Scalability: The information in any enterprise is always growing and enterprise search has to support its growth without impacting its search speed. This means the enterprise search has to be scalable to address the growth of an enterprise.
Relevance: Relevance is all about how closely the search results match the user expectations. Public web searches can identify relevance from various mechanisms such as links across web pages, whereas enterprise search solutions differ completely in the relevance of entities. The relevance in case of enterprise search involves understanding of current business functions and their contributions in the relevance ranking calculations. For example, a research paper publication would carry more prestige in an academic institution search engine than an on-the-job recruitment search engine.
Federation: Any large organization would have a plethora of applications. Some of them carry technical limitations, such as proprietary formats and inability to share the data for indexing. Many times, enterprise applications such as content management systems provide inbuilt search capabilities on their own data. Enterprise search has to consume these services and it should provide a unified search mechanism for all applications in an enterprise. A federated search plays an important role while searching through various resources.
Tip
A federated search enables users to run their search queries on various applications simultaneously in a delegated manner. Participating applications in a federated search perform the search operation using their own mechanism. The results are then combined and ranked together and presented as a single search result (unified search solution) to the user.

Let's take a look at fictitious enterprise search implementation for a software product development company called ITWorks Corporation. The following screenshot depicts how a possible search interface would look:

A search should support basic keyword searching, as well as advanced searching across various data sources directly or through a federated search. In this case, the search is crawling through the source code, development documentation, and resources capabilities, all at once. Given such diverse content, a search should provide a unified browsing experience where the result shows up together, hiding the underlying sources. To enable rich browsing, it may provide refinements based on certain facets as shown in the screenshot. It may provide some interesting features such as sorting, spell checking, pagination, and search result highlighting. These features enhance the user experience while searching for information.

Scaling Apache Solr

By : Hrishikesh Vijay Karambelkar

Scaling Apache Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Related Content you might be interested in

Current Title:

Scaling Apache Solr

Challenges in enterprise search

Tip