Book Image

Fast Data Processing Systems with SMACK Stack

By : Raúl Estrada
Book Image

Fast Data Processing Systems with SMACK Stack

By: Raúl Estrada

Overview of this book

SMACK is an open source full stack for big data architecture. It is a combination of Spark, Mesos, Akka, Cassandra, and Kafka. This stack is the newest technique developers have begun to use to tackle critical real-time analytics for big data. This highly practical guide will teach you how to integrate these technologies to create a highly efficient data analysis system for fast data processing. We’ll start off with an introduction to SMACK and show you when to use it. First you’ll get to grips with functional thinking and problem solving using Scala. Next you’ll come to understand the Akka architecture. Then you’ll get to know how to improve the data structure architecture and optimize resources using Apache Spark. Moving forward, you’ll learn how to perform linear scalability in databases with Apache Cassandra. You’ll grasp the high throughput distributed messaging systems using Apache Kafka. We’ll show you how to build a cheap but effective cluster infrastructure with Apache Mesos. Finally, you will deep dive into the different aspect of SMACK using a few case studies. By the end of the book, you will be able to integrate all the components of the SMACK stack and use them together to achieve highly effective and fast data processing.
Table of Contents (15 chapters)
Fast Data Processing Systems with SMACK Stack
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface

Modern data-processing challenges


We can enumerate four modern data-processing problems as follows:

  • Size matters: In modern times, data is getting bigger or, more accurately, the number of available data sources is increasing. In the previous decade, we could precisely identify our company's internal data sources: Customer Relationship Management (CRM), Point of Sale (POS), Enterprise Resource Planning (ERP), Supply Chain Management (SCM), and all our databases and legacy systems. Easy, a system that is not internal is external. Today, it is exactly the same, except not do the data sources multiply over time, the amount of information flowing from external systems is also growing at almost logarithmic rates. New data sources include social networks, banking systems, stock systems, tracking and geolocation systems, monitoring systems, sensors, and the Internet of Things; if a company's architecture is incapable of handling these use cases, then it can't respond to upcoming challenges.
  • Sample data: Obtaining a sample of production data is becoming more difficult. In the past, data analysts could have a fresh copy of production data on their desks almost daily. Today, it becomes increasingly more difficult, either because of the amount of data to be moved or by the expiration date; in many modern business models data from an hour ago is practically obsolete.
  • Data validity: The validity of an analysis becomes obsolete faster. Assuming that the fresh-copy problem is solved, how often is new data needed? Looking for a trend in the last year is different from looking for one in the last few hours. If samples from a year ago are needed, what is the frequency of these samples? Many modern businesses don't even have this information, or worse, they have it but it is only stored.
  • Data Return on Investment (ROI): Data analysis becomes too slow to get any return on investment from the info. Now, suppose you have solved the problems of sample data and data validity. The challenge is to be able to analyze information in a timely manner so that the return on investment of all our efforts is profitable. Many companies invest in data, but never get the analysis to increase their income.

We can enumerate modern data needs which are as follows:

  • Scalable infrastructure: Companies, every time, have to weigh the time and money spent. Scalability in a data center means the center should grow in proportion to the business growth. Vertical scalability involves adding more layers of processing. Horizontal scalability means that once a layer has more demands and requires more infrastructures, hardware can be added so that processing needs are met. One modern requirement is to have horizontal scaling with low-cost hardware.
  • Geographically dispersed data centers: Geographically centralized data centers are being displaced. This is because companies need to have multiple data centers in multiple locations for several reasons: cost, ease of administration, or access to users. This implies a huge challenge for data center management. On the other hand, data center unification is a complex task.
  • Allow data volumes to be scaled as the business needs: The volume of data must scale dynamically according to business demands. So, as you can have a lot of demand at a certain time of day, you can have high demand in certain geographic regions. Scaling should be dynamically possible in time and space especially horizontally.
  • Faster processing: Today, being able to work in real time is fundamental. We live in an age where data freshness matters many times more than the amount or size of data. If the data is not processed fast enough, it becomes stale quickly. Fresh information not only needs to be obtained in a fast way, it has to be processed quickly.
  • Complex processing: In the past, the data was smaller and simpler. Raw data doesn't help us much. The information must be processed by several layers, efficiently. The first layers are usually purely technical and the last layers mainly business-oriented. Processing complexity can kill of the best business ideas.
  • Constant data flow: For cost reasons, the number of data warehouses is decreasing. The era when data warehouses served just to store data is dying. Today, no one can afford data warehouses just to store information. Today, data warehouses are becoming very expensive and meaningless. The better business trend is towards flows or streams of data. Data no longer stagnates, it moves like large rivers. Make data analysis on big information torrents one of the objectives of modern businesses.
  • Visible, reproducible analysis: If we cannot reproduce phenomena, we cannot call ourselves scientists. Modern science data requires making reports and graphs in real time to take timely decisions. The aim of science data is to make effective predictions based on observation. The process should be visible and reproducible.