Preface
Architecting Data Intensive Applications is all about exploring the principles, capabilities, and patterns of a system that is being architected and designed to handle variety of workflows such as read, process, write, and analyze from a variety of data sources that are emitting different volumes of data at a consistent pace. This book educates its readers about various aspects, pitfalls to avoid and use cases that point to the need of a system capable of handling large data. It avoids the notion of comparison with Big Data systems. The reason is that in the reader’s opinion, "Big Data" phrase is quite overloaded already. How "Big" is really "Big" depends on the context in which the application is being built. Something that is "Big" for an organization with three employees that handles Twitter feeds of 10,000 users may not be "Big" for Twitter that handles millions of Twitter feeds every day. Therefore, this book tries to avoid any mention or comparison with the Big Data terminology. Readers will find this book as a technical guide and also a go-to book in situations where the reader wants to understand the aspects of dealing with data, such as Data Collection, Data Processing, Data Dissemination, Data Governance. This book will also contain example code at various places that will mostly be written in Java. All care has been taken to keep the examples simple and easy to understand with sufficient description, therefore, working knowledge of Java is not mandatory, although it will speed up the process of grasping the concept. Knowledge of OOP is essential though.
Who this book is for
This book is for developers and data architects who have to code, test, deploy, and/or maintain large-scale, high data volume applications. It is also useful for system architects who need to understand various non-functional aspects revolving around Data Intensive Systems.
What this book covers
Chapter 1, Exploring the Data Ecosystem, will start with data ecosystem and also helps us in understanding its characteristics. You will take a look at the 3Vs of data ecosystem and discuss some data and information sharing standards and frameworks.
Chapter 2, Defining a Reference Architecture for Data-Intensive Systems, will give you an insight into reference architecture for a data-intensive system and will then provide you with a variety of possible implementations of that framework in different scenarios. You will also take a look at the architectural principles and its capabilities.
Chapter 3, Patterns of the Data Intensive Architecture, will focus on various architectural patterns and discuss the application and the communication style in detail. You will learn how to combine different application styles and dive deep in various architectural patterns, enabling you to understand the why as well as the how of a data-centric architecture.
Chapter 4, Discussing Data-Centric Architectures, will discuss the various reference architectures for a data-intensive system. This chapter will also look at the functional components that make the foundation of a distributed system and understand why the Lambda architecture is so popular with distributed systems. It will also provide an insight into Kappa architecture, which is a simplified version of Lambda architecture.
Chapter 5, Understanding Data Collection and Normalization Requirements and Techniques, will provide an in-depth design of a data collection system that you want to build from the scratch and its requirements and techniques.
Chapter 6, Creating a Data Pipeline for Consistent Data Collection, Processing, and Dissemination, will help you to learn how to create a scalable and highly-available architecture for designing and implementing a data pipeline in your overall architecture. This chapter will also delve deeper into the different considerations of designing the data pipeline and take a look at various design patterns that will help you in creating a resilient-data pipeline.
Chapter 7, Building a Robust and Fault-Tolerant Data Collection System, will focus on data collection systems that are available in the open source community and NiFi, which is a highly-scalable and user-friendly system to define data flows. It will also deal with Sqoop, which addresses a very specific use case of transferring data between HDFS and relational systems.
Chapter 8, Challenges of Data Processing, will act as a backbone for the further chapters. This chapter will discuss various challenges that an architect can face while creating data processing system within their organization. You will learn how to enable the large-scale processing of data while keeping the overall system costs lower and how to keep the overall processing time within the defined SLA as the load on the processing system increases. You will also learn how to effectively consume the processed data.
Chapter 9, Let Us Process Data in Batches, will explore the creation of a batch processing system and the criteria necessary for designing a batch system. This will also discuss the Lambda architecture and its batch processing layer. Then, you’ll learn about how distributed processing works and how Hadoop and Map reduce is the go-to system to implement a batch processing system.
Chapter 10, Handling Streams of Data, will explore the concepts and capabilities of a streaming application and its association with the Lambda architecture. Also, this chapter discusses the various sub-components of a stream-based system. Also, you will take a look at the various design considerations when designing a stream-based application and take a walk through the different components of a stream-based system in action.
Chapter 11, Let's Store the Data, will help you understand how to store a huge dataset and discuss about HDFS and its storage formats and discuss HBase, a columnar data store, and take a look at the graph databases.
Chapter 12, When Data Dissemination is as Important as Data Itself, will explore how efficiently you can disseminate your data using indexing technologies and caching techniques. This chapter will also take a look at the data governance and teach you how to design a dissemination architecture.
To get the most out of this book
- Inform the reader of the things that they need to know before they start, and spell out what knowledge you are assuming.
- Any additional installation instructions and information they need for getting set up.
Get in touch
Feedback from our readers is always welcome.
General feedback: Email [email protected]
and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected]
.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected]
with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.