Book Image

OpenStack Sahara Essentials

By : Omar Khedher
Book Image

OpenStack Sahara Essentials

By: Omar Khedher

Overview of this book

The Sahara project is a module that aims to simplify the building of data processing capabilities on OpenStack. The goal of this book is to provide a focused, fast paced guide to installing, configuring, and getting started with integrating Hadoop with OpenStack, using Sahara. The book should explain to users how to deploy their data-intensive Hadoop and Spark clusters on top of OpenStack. It will also cover how to use the Sahara REST API, how to develop applications for Elastic Data Processing on Openstack, and setting up hadoop or spark clusters on Openstack.
Table of Contents (14 chapters)

It is all about data


A world of information, sitting everywhere, in different formats and locations, generates a crucial question: where is my data?

During the last decade, most companies and organizations have started to realize the increasing rate of data generated every moment and have begun to switch to a more sophisticated way of handling the growing amount of information. Performing a given customer-business relationship in any organization depends strictly on answers found in their documents and files sitting on their hard drives. It is even wider, with data generating more data, where there comes the need to extract from it particular data elements. Therefore, the filtered elements will be stored separately for a better information management process, and will join the data space. We are talking about terabytes and petabytes of structured and unstructured data: that is the essence of big data.

The dimensions of big data

Big data refers to the data that overrides the scope of traditional data tools to manage and manipulate them.

Gartner analyst Doug Laney described big data in a research publication in 2001 in what is known as the 3Vs:

  • Volume: The overall amount of data

  • Velocity: The processing speed of data and the rate at which data arrives

  • Variety: The different types of structured and unstructured data

Note

To read more about the 3Vs concept introduced by Doug Laney, check the following link: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

The big challenge of big data

Another important question is how will the data be manipulated and managed in a big space? For sure, traditional tools might need to be revisited to meet the large volume of data. In fact, loading and analyzing them in a traditional database means the database might become overwhelmed by the unstoppable massive surge of data.

Additionally, it is not only the volume of data that presents a challenge but also time and cost. Merging big data by using traditional tools might be too expensive, and the time taken to access data can be infinite. From a latency perspective, users need to run a query and get a response in a reasonable time. A different approach exists to meet those challenges: Hadoop.

The revolution of big data

Hadoop tools come to the rescue and answer a few challenging questions raised by big data. How can you store and manage a mixture of structured and unstructured data sitting across a vast storage network? How can given information be accessed quickly? How can you control the big data system in an enhanced scalable and flexible fashion?

The Hadoop framework lets data volumes increase while controlling the processing time. Without diving into the Hadoop technology stack, which is out of the scope of this book, it might be important to examine a few tools available under the umbrella of the Hadoop project and within its ecosystem:

  • Ambari: Hadoop management and monitoring

  • Hadoop: Hadoop distributed storage platform

  • HBase: Hadoop NoSQL non-relational database

  • Hive: Hadoop data warehouse

  • Hue: Hadoop web interface for analyzing data

  • MapReduce: Algorithm used by Hadoop MR component

  • Pig: Data analysis high-level language

  • Storm: Distributed real-time computation system

  • Yarn: MapReduce in Hadoop version 2

  • ZooKeeper: Hadoop centralized configuration system

  • Flume: Service mechanism for data collection and streaming

  • Mahout: Scalable machine learning platform

  • Avro: Data serialization platform

Apache Spark is another amazing alternative to process large amounts of data that a typical MapReduce cannot provide. Typically, Spark can run on top of Hadoop or standalone. Hadoop uses HDFS as its default file system. It is designed as a distributed file system that provides a high throughput access to application data.

The big data tools (Hadoop/Spark) sound very promising. On the other hand, while launching a project on a terabyte-scale, it might go quickly into a petabyte-scale. A traditional solution is found by adding more clusters. However, operational teams may face more difficulties with manual deployment, change management and most importantly, performance scaling. Ideally, when actively working on a live production setup, users should not experience any sort of service disruption. Adding then an elasticity flavor to the Hadoop infrastructure in a scalable way is imperative. How can you achieve this? An innovative idea is using the cloud.

Note

Some of the most recent functional programming languages are Scala and R. Scala can be used to develop applications that interact with Hadoop and Spark. R language has become very popular for data analysis, data processing, and descriptive statistics. Integration of Hadoop with R is ongoing; RHadoop is one of the R open source projects that exposes a rich collection of packages to help the analysis of data with Hadoop. To read more about RHadoop, visit the official GitHub project page found at https://github.com/RevolutionAnalytics/RHadoop/wiki

A key of big data success

Cloud computing technology might be a satisfactory solution by eliminating large upfront IT investments. A scalable approach is essential to let businesses easily scale out infrastructure. This can be simple by putting the application in the cloud and letting the provider supports and resolves the big data management scalability problem.

Use case: Elastic MapReduce

One shining example is the popular Amazon service named Elastic MapReduce (EMR), which can be found at https://aws.amazon.com/elasticmapreduce/. Amazon EMR in a nutshell is Hadoop in the cloud. Before taking a step further and seeing briefly how such technology works, it might be essential to check where EMR sits in Amazon from an architectural level.

Basically, Amazon offers the famous EC2 service (which stands for Elastic Compute Cloud) that can be found at https://aws.amazon.com/ec2/. It's a way that you can demand a certain size of computations resources, servers, load balancers, and many more. Moreover, Amazon exposes a simple key/value storage model named Simple Storage Service (S3) that can be found at https://aws.amazon.com/s3/.

Using S3, storing any type of data is very simple and straightforward using web or command-line interfaces. It is the responsibility of Amazon to take care of the scaling, data availability, and the reliability of the storage service.

We have used a few acronyms: EC2, S3 and EMR. From high-level architecture, EMR sits on top of EC2 and S3. It uses EC2 for processing and S3 for storage. The main purpose of EMR is to process data in the cloud without managing your own infrastructure. As described briefly in the following diagram, data is being pulled from S3 and is going to automatically spin up an EC2 cluster within a certain size. The results will be piped back to S3. The hallmark of Hadoop in the cloud is zero touch infrastructure. What you need to do is just specify what kind of job you intend to run, the location of the data, and from where to pick up the results.