Book Image

Apache Solr Beginner's Guide

By : Alfredo Serafini
Book Image

Apache Solr Beginner's Guide

By: Alfredo Serafini

Overview of this book

<p>With over 40 billion web pages, the importance of optimizing a search engine's performance is essential.<br /><br />Solr is an open source enterprise search platform from the Apache Lucene project. Full-text search, faceted search, hit highlighting, dynamic clustering, database integration, and rich document handling are just some of its many features. Solr is highly scalable thanks to its distributed search and index replication.<br /><br />Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable with most popular programming languages. Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has a plugin architecture to support more advanced customization.<br /><br />With Apache Solr Beginner's Guide you will learn how to configure your own search engine experience. Using real data as an example, you will have the chance to start writing step-by-step, simple, real-world configurations and understand when and where to adopt this technology.<br /><br />Apache Solr Beginner's Guide will start by letting you explore a simple search over real data. You will then go through a step-by-step description that gives you the chance to explore several practical features. At the end of the book you will see how Solr is used in different real-world contexts.<br /><br />Using data from public domains like DBpedia, you will define several different configurations, exploring some of the most interesting Solr features, such as faceted search and navigation, auto-suggestion, and rich document indexing. You will see how to configure different analysers for handling different data types, without programming.<br /><br />You will learn the basics of Solr, focusing on real-world examples and practical configurations.</p>
Table of Contents (19 chapters)
Apache Solr Beginner's Guide
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Time for action – using Bean Scripting Framework and JavaScript


At this point, it should be clear that it's simple to index data with the Java API (we simply have to call the methods .add() and .addBean() on a server instance), thereby making querying flexible, and we can use some different wrappers of a SolrServer instance that are functionally equivalent.

This is not the end of the story, however, as it's possible to use the SolrJ library's APIs in JavaScript by using the Bean Scripting Framework (BSF), which is an Apache library designed to integrate scripting languages' code into a Java application. Note that there are other languages supported by BSF, such as Groovy or Clojure, and the same approach used here with JavaScript can be adopted for one of those. For a complete list of the languages supported by BSF, please refer to: http://commons.apache.org/proper/commons-bsf/index.html. If you are curious about the current support for an external scripting language in the standard Java distribution, please refer to: http://download.java.net/jdk8/docs/technotes/guides/scripting/prog_guide/api.html.

  1. For example, to use JavaScript to index a document, you will need to write some snippet of code similar to the following one (I left only the essential parts):

    importClass(org.apache.solr.client.solrj.impl.HttpSolrServer)
    importClass(org.apache.solr.common.SolrInputDocument)
    
    var url = "http://localhost:8983/solr/arts"
    var server = new HttpSolrServer( url );
    
    var doc = new SolrInputDocument()
    ...
    doc.addField("note", "TEST document added to the index by javascript")
        
    server.add(doc)
    
    server.commit()
  2. If we want to save this code in a file named example.js under the scripts/javascript/ folder of our client's example project, we only need to write the following three-lined Java class to execute it:

    String myScript = new Scanner(new FileInputStream("scripts/javascript/example.js")).useDelimiter("\\Z").next();
    BSFManager manager = new BSFManager();
    manager.eval("javascript", "example.js", 0, 0, myScript);
  3. All you have to do is prepare a code similar to the previous one, and you can write code to communicate to a remote Solr server. You will need to have an appropriate interpreter library available for the BSF; for example, I added Rhino in my dependencies.

In a really similar way, we can also handle Groovy, JRuby, Jython, or other languages.

What just happened?

Starting from the end, you can easily recognize the declaration of the interpreter to be used with the BSF manager. We loaded the content of the file as a common string, and then let the BSF transparently execute it.

If you look at the JavaScript code, it's almost identical to what we would have written in Java. There are very minor syntactical differences, such as the use of a specific function for the imports or for the var declaration of variables, but the APIs called are the same.

Jahia CMS

Jahia (http://www.jahia.com) is an open source Content Management System (CMS), which exposes the Restful API and uses Solr as an internal search framework. The Jahia content platform (http://www.jahia.com/tech/jahia-content-platform) supports development of apps and includes a workflow engine based on rules with the support of Drools. The platform also has multiscripting support to enable the development of templates using different languages or frameworks such as PHP, Freemarker, JavaScript, and others.

Magnolia CMS

Another good open source CMS on the Java platform is Magnolia (http://www.magnolia-cms.com/). This CMS supports management of different configurations and revisions, inline editing, development of apps, support for user-generated content, and concurrent collaboration editing. It also has facility classes for JBPM workflow integration and writing of code with Groovy.

In this case, the Solr integration is enabled by a module (http://wiki.magnolia-cms.com/display/WIKI/Magnolia+Apache+Solr+integration) that adds not only full-text search but also spellchecking and access-controlled search based on metadata. Magnolia also extends these features with article categorizations, support for Dublin core metadata, and Digital Asset Management (DAM) capabilities as well as a standard Content Management Interoperability Services (CMIS) interface at http://www.magnolia-cms.com/product/features/digital-asset-management.html.

Alfresco DMS and CMS

Alfresco is an open source Document Management System (DMS) that also offers modules for CMS, CMIS integration, team collaboration, design of document workflow with activities, office, or GoogleDocs integration, multiscripting, and much more. For a complete list of features, you can refer to the official site: http://www.alfresco.com/.

The Solr integration module is based on an internal local Solr web application (http://wiki.alfresco.com/wiki/Alfresco_And_SOLR#Configuring_the_Solr_web_app), which is fully functional and integrated with Alfresco's advanced structured content handling with metadata Catalog and Archive Support (CAS).

Liferay

Liferay is an open source web portal (http://www.liferay.com/products/liferay-portal/features/portal) based on a service oriented architecture . The platform is open source, and it is possible to write new components, modules, and portlets on top of the existing services. There is already a CMS module, and the Solr integration is provided as an app (http://www.liferay.com/it/marketplace/-/mp/application/15193648), which, once installed, is accessible as a Liferay service.

Broadleaf

Broadleaf is an open source, e-commerce CMS that is built on top of a technology stack that includes Spring, Maven, Google web toolkit, and Thymeleaf for writing the templates easily. Solr is integrated not only to provide full-text searches over the description of articles, but also an easy customization of per-category facets.

You can find more information on the official site: http://www.broadleafcommerce.org/.

Apache Jena

Apache Jena (http://jena.apache.org/) is an open source Java framework for handling RDF data and building linked data and semantic web applications. It provides integrations with other frameworks and triple stores, and it is internally divided in modules for storing data (TDB triple store), exposing the SPARQL end point (ARQ and Fuseki), and parsing RDF or OWL. There also exists an Solr integration for full-text search into the SPARQL queries at http://jena.apache.org/documentation/query/text-query.html.

Solr Groovy or the Grails plugin

Solr can be easily handled by Groovy, and there also exists a Grails plugin (http://www.grails.org/plugin/solr) built on top of the SolrJ library, and it is installable directly from the Grails command line.

Solr scala

With the Scala language, it is possible to use the SolrJ library directly, and there also exist some different third-party implementations, such as the solr-scala-client (https://github.com/takezoe/solr-scala-client) or the excellent Solr DSL for the spray.io web framework (http://bathtor.github.io/spray-solr/api/index.html#spray.solr.package) that exposes the Solr service as an actor-based asynchronous service with the Akka framework (http://akka.io/). Note how it is possible to access this actor-based, full-text service from any web framework based on Akka.

Spring data

If you use the popular spring framework (or the project built on top of spring, as some of the previous ones) you should look at the spring-data module. A good place to start using it could be the excellent tutorial by Petri Kainulainen at http://www.petrikainulainen.net/programming/solr/spring-data-solr-tutorial-configuration/. This library greatly simplifies the interaction with NoSQL databases and MapReduce frameworks, and offers a simple object-oriented access on relational databases, integrating Solr capabilities on the mapped object.

You can find a more detailed description of all the possible integrations supported on this site: http://projects.spring.io/spring-data/.