OpenStack Sahara Essentials

OpenStack Sahara Essentials

By : Omar Khedher

Buy this Book

OpenStack Sahara Essentials

By: Omar Khedher

Buy this Book

Overview of this book

The Sahara project is a module that aims to simplify the building of data processing capabilities on OpenStack. The goal of this book is to provide a focused, fast paced guide to installing, configuring, and getting started with integrating Hadoop with OpenStack, using Sahara. The book should explain to users how to deploy their data-intensive Hadoop and Spark clusters on top of OpenStack. It will also cover how to use the Sahara REST API, how to develop applications for Elastic Data Processing on Openstack, and setting up hadoop or spark clusters on Openstack.

OpenStack Sahara Essentials

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

The Essence of Big Data in the Cloud

It is all about data

OpenStack crossing big data

Summary

Integrating OpenStack Sahara

Preparing the test infrastructure environment

Installing OpenStack

Integrating Sahara

Summary

Using OpenStack Sahara

Planning a Hadoop deployment

Creating a Hadoop cluster

Summary

Executing Jobs with Sahara

Job glossary in Sahara

Running jobs in Sahara

Summary

Discovering Advanced Features with Sahara

Sahara plugins

Boosting Elastic Data Processing performance

Defining the network

Increasing data reliability

Summary

Hadoop High Availability Using Sahara

HDP high-availability support

CDH high-availability support

Summary

Troubleshooting

Troubleshooting OpenStack

Troubleshooting data processing

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

OpenStack, the ultimate cloud computing operating system, keeps growing and gaining more popularity around the globe. One of the main reasons of OpenStack's success is the collaboration of several big enterprises and companies worldwide. Within every new release, the OpenStack community brings a new incubated project to the cloud computing open source world. Lately, big data has also taken a very important role in the OpenStack journey. Within its broad definition of the complexity of data management and its value extraction, the big-data business faces several challenges that need to be tackled. With the growth of the concept of cloud paradigm in the last decade, the big-data world can also be offered as a service. Specifically, the OpenStack community has taken on such a challenge to turn it into a very unique opportunity: Big Data as a Service. The Sahara project makes provisioning a complete elastic Hadoop cluster a very seamless operation with no need for touching the underlying infrastructure. Running on OpenStack, Sahara becomes a very mature project that supports Hadoop and Spark, the open source in-memory computing framework. That becomes a very good deal to find a parallel world about Big Data and Data Processing in Sahara named Elastic Data Processing. Sahara, formerly known as Savanna, has become a very attractive project, mature and supporting several big data providers.

In this book, we will explore the main motivation of using Sahara and how it interacts with other services of OpenStack. The main motivation of using Sahara is the facilities exposed from a central dashboard to manage big-data infrastructure and simplify data-processing tasks. We will walk through the installation and integration of Sahara OpenStack, launch clusters, execute sample jobs, explore more functions, and troubleshoot some common errors. By the end of this book, you should not only understand how Sahara operates and functions within the OpenStack ecosystem but also realize its major use cases of cluster and workload management.

What this book covers

Chapter 1, The Essence of Big Data in the Cloud, introduces the motivation of using the cloud computing paradigm in big-data management. The chapter will focus on the need of a different way to resolve big-data analysis complexity by looking at the Sahara project and its internal architectural design.

Chapter 2, Integrating OpenStack Sahara, walks through all the necessary steps for installing a multi-node OpenStack environment and integrating Sahara, and it shows you how to run it successfully along with the existing OpenStack environment.

Chapter 3, Using OpenStack Sahara, describes the workflow of Hadoop cluster creation using Sahara. The chapter shows you how to speed up launching clusters using templates through Horizon and via the command line in OpenStack.

Chapter 4, Executing Jobs with Sahara, focuses on executing sample jobs for elastic data processing based on the example in the previous chapter using Sahara. It also gives you the opportunity to execute jobs using the Sahara REST API and shows what is going on under the hood from the API's call level in OpenStack.

Chapter 5, Discovering Advanced Features with Sahara, dives into more advanced Sahara functionalities, such as anti-affinity and data-locality concepts. This chapter also covers the different supported plugins existing in Sahara and tells you why you need each of them. In addition, you will learn how to customize the Sahara setup based on several storage and network configurations in the OpenStack environment.

Chapter 6, Hadoop High Availability Using Sahara, discusses building a highly available Hadoop cluster using Sahara. This option is available at the time of writing this book only for HDP and CDH clusters, which the chapter focuses on. It provides for each plugin a sample example by highlighting the requirements for each setup.

Chapter 7, Troubleshooting, provides best practices for troubleshooting Sahara when it generates errors during its setup and utilization. It starts by tackling major issues present in OpenStack that reflect many other components and how to escalate problem resolution using debugging tools and on-hand tips.

What you need for this book

This book assumes medium-level knowledge of Linux operating systems, basic knowledge of cloud computing and big data, and moderate experience with OpenStack software. The book will go through a simple multi-node setup of an OpenStack environment, which may require basic understanding of networking and virtualization concepts. If you have experience with Hadoop and Spark processes, it is a big plus. Although the book uses VirtualBox, feel free to use any other lab environment, such as VMware workstation or other tools.

OpenStack can be installed and runs either on bare metal or virtual machine. However, this book requires that you have enough resources for the whole setup. The minimum hardware or virtual requirements are listed as follows:

CPU: 4 cores
Memory: 8 GB of RAM
Disk space: 80 GB

You will need the following software:

Linux operating system: Centos 7.x.
VirtualBox.
The OpenStack RDO distribution, preferably the Liberty release. If you intend to use Juno or Kilo releases, make sure to change the plugin versions when launching clusters to comply within the right supported OpenStack version.

Internet connectivity is required to install the necessary OpenStack packages, Sahara images, and Sahara image packages for specific plugins.

Who this book is for

To use the content of this book, basic prior knowledge of OpenStack is expected. If you don't have that knowledge, it is always possible to catch up with the basic requirements by having a fast reading of the major components from the OpenStack community (http://docs.openstack.org/admin-guide-cloud). This covers the previous updates of OpenStack software, including the Juno, Kilo, and Liberty releases. This book is essentially intended for data scientists, big data architects, cloud developers, and DevOps engineers. If you are also willing to run your Hadoop and/or Spark clusters on top of OpenStack, then this book is ideal for you. If you already have a running OpenStack infrastructure, this book can help you quickly speed it up with Sahara.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Optionally, make sure to disable NetworkManager in your CentOS boxes."

A block of code is set as follows:

# nano worker_template_pp.json
{
 " name": "PP-Worker-Template",
  "flavor_id": "2",
  "plugin_name": "vanilla",
  "hadoop_version": "2.7.1",
  "node_processes": ["nodemanager", "datanode"],
  "auto_security_group": true
}

Any command-line input or output is written as follows:

# export IMAGE_ID=49fa54c0-18c0-4292-aa61-fa1a56dbfd24
# sahara image-register --id $IMAGE_ID --username centos 
--description 'Sahara image CentOS 7 Hadoop Vanilla 2.7.1'

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "The first window in the wizard exposes the Plugin name and its version."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

OpenStack Sahara Essentials

By : Omar Khedher

OpenStack Sahara Essentials

By: Omar Khedher

Overview of this book

Related Content you might be interested in

Current Title:

OpenStack Sahara Essentials

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Errata

Piracy

Questions