Book Image

OpenStack Sahara Essentials

By : Omar Khedher
Book Image

OpenStack Sahara Essentials

By: Omar Khedher

Overview of this book

The Sahara project is a module that aims to simplify the building of data processing capabilities on OpenStack. The goal of this book is to provide a focused, fast paced guide to installing, configuring, and getting started with integrating Hadoop with OpenStack, using Sahara. The book should explain to users how to deploy their data-intensive Hadoop and Spark clusters on top of OpenStack. It will also cover how to use the Sahara REST API, how to develop applications for Elastic Data Processing on Openstack, and setting up hadoop or spark clusters on Openstack.
Table of Contents (14 chapters)

OpenStack crossing big data


OpenStack is a very promising open source cloud computing solution that does not stop adumbrating and joining different projects related to the cloud environment. OpenStack kept growing its ecosystem thanks to the conglomeration of many projects that make it a very rich cloud platform. OpenStack exposes several infrastructure management services that work in tandem to provide a complete suite of infrastructure management software. Most of its modules have been refined and become more mature within the Havana release. It might be essential first to itemize the most basic ones briefly:

  • Keystone: The identity management service. Connecting and using OpenStack services requires in the first place authentication.

  • Glance: The image management service. Instances will be launched from disk images that glance stores them in its image catalogue.

  • Nova: The instance management service. Once authenticated, a user can create an instance by defining basic resources such as image and network.

  • Cinder: The block storage management service. It allows creating and attaching volumes to instances. It also handles snapshots, which can be used as a boot source.

  • Neutron: The network management service. It allows creating and managing an isolated virtual network for each tenant in an OpenStack deployment.

  • Swift: The object storage management service. Any form of data in Swift is stored in a redundant, scalable, distributed object storage using a cluster of servers.

  • Heat: The orchestration service. It provides a fast-paced way to launch a complete stack from one single template file.

  • Ceilometer: The telemetry service. It monitors the cluster resources used in an OpenStack deployment.

  • Horizon: The OpenStack Dashboard. It provides a web-based interface to different OpenStack services such as Keystone, Glance, Nova, Cinder, Neutron, Swift, Heat, and so on.

  • Trove: The Database as a Service (DBaaS) component in OpenStack. It enables users to consume relational and non-relational database engines on top of OpenStack.

Note

At the time of writing, more incubated projects are being integrated in the OpenStack ecosystem with the Liberty release such as Ironic, Zaqar, Manilla, Designate, Barbican, Murano, Magnum, Kolla, and Congress. To read more about those projects, refer to the official OpenStack website at: https://www.openstack.org/software/project-navigator/

The awesomeness of OpenStack comes not only from its modular architecture but also the contribution of its large community by developing and integrating a new project in nearly every new OpenStack release. Within the Icehouse release, OpenStack contributors turned on the light to meet the big data world: the Elastic Data Processing service. That becomes even more amazing to see a cloud service similar to EMR in Amazon running by OpenStack.

Well, it is time to open the curtains and explore the marriage of one of the most popular big data programs, Hadoop, with one of the most successful cloud operating system OpenStack: Sahara. As shown in the next diagram of the OpenStack IaaS (short for Infrastructure as a Service) layering schema, Sahara can be expressed as an optional service that sits on top of the base components of OpenStack. It can be enabled or activated when running a private cloud based on OpenStack.

Note

More details on Sahara integration in a running OpenStack environment will be discussed in Chapter 2, Integrating OpenStack Sahara.

Sahara: bringing big data to the cloud

Sahara is an incubated project for big data processing since the OpenStack Icehouse release. It has been integrated since the OpenStack Juno release. The Sahara project was a joint effort and contribution between Mirantis, a major OpenStack integration company, Red Hat, and Hortonworks. The Sahara project enables users to run Hadoop/Spark big data applications on top of OpenStack.

Note

The Sahara project was named Savanna and has been renamed due to trademark issues.

Sahara in OpenStack

The main reason the Sahara project was born is the need for agile access to big data. By moving big data to the cloud, we can capture many benefits for the user experience in this case:

  • Unlimited scalability: Sahara sits on top of the OpenStack Cloud management platform. By its nature, OpenStack services scale very well. As we will see, Sahara lets Hadoop clusters scale on OpenStack.

  • Elasticity: Growing or shrinking, as required, a Hadoop cluster is obviously a major advantage of using Sahara.

  • Data availability: Sahara is tightly integrated with core OpenStack services as we will see later. Swift presents a real cloud storage solution and can be used by Hadoop clusters for data source storage. It is a highly durable and available option when considering the input/output of processing a data workflow.

Note

Swift can be used for input and output data source access in a Hadoop cluster for all job types except Hive.

For an intimate understanding of the benefits cited previously, it might be essential to go through a concise architectural overview of Sahara in OpenStack. As depicted in the next diagram, a user can access and manage big data resources from the Horizon web UI or the OpenStack command-line interface. To use any service in OpenStack, it is required to authenticate against the Keystone service. It also applies to Sahara, which it needs to be registered with the Keystone service catalogue.

To be able to create a Hadoop cluster, Sahara will need to retrieve and register virtual machine images in its own image registry by contacting Glance. Nova is also another essential OpenStack core component to provision and launch virtual machines for the Hadoop cluster. Additionally, Heat can be used by Sahara in order to automate the deployment of a Hadoop cluster, which will be covered in a later chapter.

Note

In OpenStack within the Juno release, it is possible to instruct Sahara to use block storage as nodes backend.

The Sahara OpenStack mission

In addition to sharing the aforementioned generic big data in OpenStack, OpenStack Sahara has some unique characteristics that can be itemized as the following:

  • Fast provisioning: Deploying a Hadoop/Spark cluster becomes an easy task by performing a few push-button clicks or via command line interface.

  • Centralized management: Controlling and monitoring a Hadoop/Spark cluster from one single management interface efficiently.

  • Cluster management: Sahara offers an amazing templating mechanism. Starting, stopping, scaling, shaping, and resizing actions may form the life cycle of a Hadoop/Spark cluster ecosystem. Performing such a life cycle in a repeatable way can be simplified by using a template in which will be defined the Hadoop configuration. All the proper cluster node setup details just get out of the way of the user.

  • Workload management: This is another key feature of Sahara. It basically defines the Elastic Data Processing, the running and queuing jobs, and how they should work in the cluster. Several types of jobs for data processing such as MapReduce job, Pig script, Oozie, JAR file, and many others should run across a defined cluster. Sahara enables the provisioning of a new ephemeral cluster and terminates it on demand, for example, running the job for some specific analysis and shutting down the cluster when the job is finished. Workload management encloses data sources that defines where the job is going to read data from and write them to.

    Note

    Data sources URLs into Swift and URLs into HDFS will be discovered in more details in Chapter 5, Discovering Advanced Features with Sahara.

  • No deep expertise: Administrators and operators will not wonder anymore about managing the infrastructure running underneath the Hadoop/Spark cluster. With Sahara, managing the infrastructure does not require real big data operational expertise.

  • Multi-framework support: Sahara exposes the possibility to integrate diverse data processing frameworks using provisioning plugins. A user can choose to deploy a specific Hadoop/Spark distribution such as the Hortonworks Data Platform (HDP) plugin via Ambari, Spark, Vanilla, MapR Distribution, and Cloudera plugins.

  • Analytics as a Service: Bursty analytics workloads can utilize free computing infrastructure capacity for a limited period of time.

The Sahara's architecture

We have seen in the previous diagram how Sahara has been integrated in the OpenStack ecosystem from a high-level perspective. As it is a new OpenStack service, Sahara exposes different components that interact as the client of other OpenStack services such as Keystone, Swift, Nova, Neutron, Glance, and Cinder. Every request initiated from the Sahara endpoint is performed on the OpenStack services public APIs. For this reason, it is essential to put under scope the Sahara architecture as shown in the following diagram:

The OpenStack Sahara architecture consists essentially of the following components:

  • REST API: Every client request initiated from the dashboard will be translated to a REST API call.

  • Auth: Like any other OpenStack service, Sahara must authenticate against the authentication service Keystone. This also includes client and user authorization to use the Sahara service.

  • Vendor Plugins: The vendor plugins sit in the middle of the Sahara architecture that exposes the type of cluster to be launched. Vendors such as Cloudera and Apache Ambari provide their distributions in Sahara so users can configure and launch a Hadoop based on their plugin mechanism.

  • Elastic Data Processing (EDP): Enables the running of jobs on an existing and launched Hadoop or Spark cluster in Sahara. EDP makes sure that jobs are scheduled to the clusters and maintain the status of jobs, their sources, from where the data sources should be extracted, and to where the output of the treated data sources should be written.

  • Orchestration Manager/Provisioning Engine: The core component of the Sahara cluster provisioning and management. It instructs the Heat engine (OpenStack orchestrator service) to provision a cluster by communicating with the rest of the OpenStack services including compute, network, block storage, and images services.

  • Data Access Layer (DAL): Persistent internal Sahara data store.

Note

It is important to note that Sahara was configured to use a direct engine to create instances of the cluster which initiate calls to the required OpenStack services to provision the instances. It is also important to note that Direct Engine in Sahara will be deprecated from OpenStack Liberty release where Heat becomes the default Sahara provisioning engine.