Book Image

Scalable Data Architecture with Java

By : Sinchan Banerjee
Book Image

Scalable Data Architecture with Java

By: Sinchan Banerjee

Overview of this book

Java architectural patterns and tools help architects to build reliable, scalable, and secure data engineering solutions that collect, manipulate, and publish data. This book will help you make the most of the architecting data solutions available with clear and actionable advice from an expert. You’ll start with an overview of data architecture, exploring responsibilities of a Java data architect, and learning about various data formats, data storage, databases, and data application platforms as well as how to choose them. Next, you’ll understand how to architect a batch and real-time data processing pipeline. You’ll also get to grips with the various Java data processing patterns, before progressing to data security and governance. The later chapters will show you how to publish Data as a Service and how you can architect it. Finally, you’ll focus on how to evaluate and recommend an architecture by developing performance benchmarks, estimations, and various decision metrics. By the end of this book, you’ll be able to successfully orchestrate data architecture solutions using Java and related technologies as well as to evaluate and present the most suitable solution to your clients.
Table of Contents (19 chapters)
1
Section 1 – Foundation of Data Systems
5
Section 2 – Building Data Processing Pipelines
11
Section 3 – Enabling Data as a Service
14
Section 4 – Choosing Suitable Data Architecture

Designing the solution

To design the solution for the current problem statement, let’s analyze the data points or facts that are available to us right now:

  • The current problem is a batch-based data engineering problem
  • The problem at hand is a data ingestion problem
  • Our source is CSV files containing structured data
  • Our target is a PostgreSQL data warehouse
  • Our data warehouse follows a star schema, with one fact table, two dynamic dimension tables, and three static dimension tables
  • We should choose a technology that is independent of the deployment platform, considering that our solution can be migrated to the cloud in the future
  • For the context and scope of this book, we will explore optimum solutions based on Java-based technologies

Based on the preceding facts, we can conclude that we have to build three similar data ingestion pipelines – one for the fact table and two others for the dynamic dimension tables. At this point, we...