Book Image

Fast Data Processing Systems with SMACK Stack

By : Raúl Estrada
Book Image

Fast Data Processing Systems with SMACK Stack

By: Raúl Estrada

Overview of this book

SMACK is an open source full stack for big data architecture. It is a combination of Spark, Mesos, Akka, Cassandra, and Kafka. This stack is the newest technique developers have begun to use to tackle critical real-time analytics for big data. This highly practical guide will teach you how to integrate these technologies to create a highly efficient data analysis system for fast data processing. We’ll start off with an introduction to SMACK and show you when to use it. First you’ll get to grips with functional thinking and problem solving using Scala. Next you’ll come to understand the Akka architecture. Then you’ll get to know how to improve the data structure architecture and optimize resources using Apache Spark. Moving forward, you’ll learn how to perform linear scalability in databases with Apache Cassandra. You’ll grasp the high throughput distributed messaging systems using Apache Kafka. We’ll show you how to build a cheap but effective cluster infrastructure with Apache Mesos. Finally, you will deep dive into the different aspect of SMACK using a few case studies. By the end of the book, you will be able to integrate all the components of the SMACK stack and use them together to achieve highly effective and fast data processing.
Table of Contents (15 chapters)
Fast Data Processing Systems with SMACK Stack
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface

Changing the data center operations


And here is the point where data processing changes data center operation.

From scale-up to scale-out

Throughout businesses we are moving from specialized, proprietary, and typically expensive supercomputers to the deployment of clusters of commodity machines connected with a low cost network.

The Total Cost of Ownership (TCO) determines the fate, quality, and size of a DataCenter. If the business is small, the DataCenter should be small; as the business demands, the DataCenter will grow or shrink.

Currently, one common practice is to create a dedicated cluster for each technology. This means you have a Spark cluster, a Kafka cluster, a Storm cluster, a Cassandra cluster, and so on, because the overall TCO tends to increase.

The open-source predominance

Modern organizations adopt open source to avoid two old and annoying dependencies: vendor lock-in and external entity bug fixing.

In the past, the rules were dictated from the classically large high-tech enterprises or monopolies. Today, the rules come from the people, for the people; transparency is ensured through community-defined APIs and various bodies, such as the Apache Software Foundation or the Eclipse Foundation, which provide guidelines, infrastructure, and tooling for the sustainable and fair advancement of these technologies.

There is no such thing as a free lunch. In the past, larger enterprises used to hire big companies in order to be able to blame and sue someone in the case of failure. Modern industries should take the risk and invest in training their people in open technologies.

Data store diversification

The dominant and omnipotent era of the relational database is challenged by the proliferation of NoSQL .

You have to deal with the consequences: systems of recording determination, synchronizing different stores, and correct data store selection.

Data gravity and data locality

Data gravity is related to considering the overall cost associated with data transfer, in terms of volume and tooling, for example, trying to restore hundreds of terabytes in a disaster recovery case.

Data locality is the idea of bringing the computation to the data rather than the data to the computation. As a rule of thumb, the more different services you have on the same node, the better prepared you are.

DevOps rules

DevOps refers to the best practices for collaboration between the software development and operational sides of a company.

The developer team should have the same environment for local testing as is used in production. For example, Spark allows you to go from testing to cluster submission.

The tendency is to containerize the entire production pipeline.