Book Image

Architecting Data-Intensive Applications

By : Anuj Kumar
Book Image

Architecting Data-Intensive Applications

By: Anuj Kumar

Overview of this book

<p>Are you an architect or a developer who looks at your own applications gingerly while browsing through Facebook and applauding it silently for its data-intensive, yet ?uent and efficient, behaviour? This book is your gateway to build smart data-intensive systems by incorporating the core data-intensive architectural principles, patterns, and techniques directly into your application architecture.</p> <p>This book starts by taking you through the primary design challenges involved with architecting data-intensive applications. You will learn how to implement data curation and data dissemination, depending on the volume of your data. You will then implement your application architecture one step at a time. You will get to grips with implementing the correct message delivery protocols and creating a data layer that doesn’t fail when running high traffic. This book will show you how you can divide your application into layers, each of which adheres to the single responsibility principle. By the end of this book, you will learn to streamline your thoughts and make the right choice in terms of technologies and architectural principles based on the problem at hand.</p>
Table of Contents (18 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

Index

A

  • Airflow
    • about / AirFlow
    • features / AirFlow
  • Alternate Exchanges / Alternate Exchanges
  • AMQP, concepts
    • exchange / Event-Data Pipelines
    • exchange type / Event-Data Pipelines
    • message queue / Event-Data Pipelines
    • binding / Event-Data Pipelines
    • routing key / Event-Data Pipelines
    • broker / Event-Data Pipelines
    • channel / Event-Data Pipelines
    • Virtual Hosts / Event-Data Pipelines
  • Apache Atlas
    • about / Apache Atlas
    • high-level architecture / Apache Atlas high-level architecture
    • type system / Apache Atlas high-level architecture
    • Graph engine / Apache Atlas high-level architecture
    • ingest / Apache Atlas high-level architecture
    • export / Apache Atlas high-level architecture
  • Apache Falcon / Apache Falcon
  • Apache Flume
    • about / Apache Flume
    • event flow reliability / Flume event flow reliability
    • multi-agent flow / Flume multi-agent flow
    • multiplexer / Flow multiplexer
  • Apache HBase architecture
    • HMaster / Components of Apache HBase architecture
    • region server / Components of Apache HBase architecture
  • Apache Kafka
    • reference / Apache Kafka as an event bus
    • as event bus / Apache Kafka as an event bus
    • message persistence / Message persistence
    • Persistent Queue Design / Persistent Queue Design
    • message batch / Message batch
    • sendfile operation / Kafka and the sendfile operation
    • compression / Compression
    • properties / Compression
  • Apache Nifi
    • about / Apache Nifi
    • high-level use cases / Apache Nifi
    • components / Apache Nifi
  • Apache Sqoop
    • about / Apache Sqoop
    • use cases / Apache Sqoop
  • Apache YARN
    • about / Apache YARN
    • fundamentals, for configuration / Apache YARN
    • Resource Manager / Apache YARN
    • Scheduler / Apache YARN
    • Applications Manager / Apache YARN
    • Node Manager / Apache YARN
    • Application Master / Apache YARN
  • API Platform
    • about / API Platform
    • drawbacks / API Platform
    • benefits / API Platform
    • message-oriented application style / Message-oriented application style
    • Micro Services application styles / Micro Services application styles
    • Micro-Services Application Style, characteristics / Micro Services application styles
  • application styles
    • about / Application styles
    • combining / Combining different application styles
  • APTs (Advanced Persistent Threat Actors) / Elastic search and free text search queries
  • architectural assumptions
    • listing / Listing architectural assumptions
  • architectural capabilities
    • about / Architectural capabilities
    • logical layers / Architectural capabilities
    • UI capabilities / UI capabilities
    • service gateway/API gateway capabilities / Service gateway/API gateway capabilities
    • business service capabilities / Business service capabilities
    • data capabilities / Data partitioning
  • architectural patterns
    • about / Architectural patterns
    • retry pattern / The retry pattern
    • circuit breaker / The circuit breaker
    • throttling / Throttling
    • bulk heads / Bulk heads
    • Event-Sourcing / Event-sourcing
    • Command and Query Responsibility Segregation (CQRS) / Command and Query Responsibility Segregation
  • architectural principles
    • defining / Defining architectural principles, Principle 1, Principle 5, Principle 6, Principle 7
  • auto sharding / Horizontal scaling with automatic sharding of HBase tables
  • AWS (Amazon Web Services) / Data dissemination architecture in a threat intel sharing system
  • AWS API gateway / AWS API gateway
  • AWS Lambda / AWS Lambda
  • Azkaban
    • about / Azkaban
    • components / Azkaban
    • features / Azkaban

B

  • B-Tree / Relational Database Management Systems and Big data
  • Balanced Trees / Relational Database Management Systems and Big data
  • bandwidth / Relational Database Management Systems and Big data
  • basic shell component / Basic shell component
  • batch layer components
    • about / Batch layer components and subcomponents
    • read/extract component / Read/extract component
    • normalizer component / Normalizer component
    • Validation component / Validation component
    • processing component / Processing component
    • writer/formatter component / Writer/formatter component
  • batch processing
    • about / What do we mean by batch processing
    • defining / What do we mean by batch processing
    • principles / What do we mean by batch processing
    • and Lambda architecture / Lambda architecture and batch processing
  • beats
    • FileBeat / Beats
    • MetricBeat / Beats
    • PacketBeat / Beats
    • WinlogBeat / Beats
    • HeartBeat / Beats
  • Big data / Relational Database Management Systems and Big data
  • BITES
    • about / BITES – Unstructured/Semistructured document store
    • structured data extraction / Structured data extraction
    • text extraction / Text extraction
    • document queries / Document queries
    • highly-available clusters / Highly-available clusters
    • guarantees / Guarantees
    • scaling up / Scaling up
    • integration, with SPARQL / Integration with SPARQL
    • data formats / Data Formats
  • business service capabilities, architectural capabilities
    • about / Business service capabilities
    • microservices / Microservices
    • messaging / Messaging
    • distributed (batch/stream) processing / Distributed (batch/stream) processing

C

  • circuit breaker
    • about / The circuit breaker
    • closed state / The circuit breaker
    • open state / The circuit breaker
    • half-open / The circuit breaker
  • clustering
    • about / Clustering, Clustering and Network Partitions
    • mirrored queues / Mirrored queues
    • Persistent Messages / Persistent Messages
    • data manipulation / Data Manipulation and Security
    • security / Data Manipulation and Security
    • use cases / Use Case 1, Use Case 2
    • exchanges / Exchanges
    • guidelines, for selecting exchange type / Guidelines on choosing the right Exchange Type
    • headers, versus Topic exchanges / Headers versus Topic Exchanges
    • routing / Routing
  • communication protocol / Communication protocol
  • communication style
    • about / Communication styles
    • synchronous / Communication styles
    • asynchronous / Communication styles
    • Event-Driven / Communication styles
    • reactive / Communication styles
  • components, Apache Nifi
    • web server / Apache Nifi
    • flow controller / Apache Nifi
    • extensions / Apache Nifi
    • flow file repository / Apache Nifi
    • content repository / Apache Nifi
    • provenance repository / Apache Nifi
  • Consistency, Availability, and Partition Tolerance (CAP) theorem / Desired properties of a data-intensive system
  • coordination service
    • about / Coordination service
    • characteristics / Coordination service
    • use cases / Coordination service
  • customer premises equipment (CPEs) / Data insight

D

  • data
    • about / Making sense of the data
    • processing / What is data processing?
  • Data-Collection System
    • requisites / Data collection system requirements
    • architecture principles / Data collection system architecture principles
    • high-level component architecture / High-level component architecture
    • high-level architecture / High-level architecture
    • architecture technology mapping / Architecture technology mapping
  • data-intensive system, properties
    • robust and fault-tolerant / Desired properties of a data-intensive system
    • low latency reads and updates / Desired properties of a data-intensive system
    • salable / Desired properties of a data-intensive system
    • general / Desired properties of a data-intensive system
    • extensible / Desired properties of a data-intensive system
    • ad-hoc queries, allowing / Desired properties of a data-intensive system
    • minimal maintenance / Desired properties of a data-intensive system
    • CAP theorem / Desired properties of a data-intensive system
  • data capabilities, architectural capabilities
    • data partitioning / Data partitioning
    • data replication / Data replication
  • Data Collector/Normalizer / Threat intel share – backend
  • data dissemination
    • about / Data dissemination
    • considerations, for defining architecture / Data dissemination
    • communication protocol / Communication protocol
    • target audience / Target audience
    • use case / Use case
    • response schema / Response schema
    • communication channel / Communication channel
    • in threat intel sharing system / Data dissemination architecture in a threat intel sharing system
    • threat intel share backend architecture / Threat intel share – backend
    • threat intel share frontend architecture / Threat intel share – frontend
    • AWS Lambda / AWS Lambda
    • AWS API gateway / AWS API gateway
    • cache population / Cache population
    • cache eviction / Cache eviction
    • non-functional aspects / Discussing the non-functional aspects of the preceding architecture
    • non-functional use cases / Non-functional use cases for dissemination architecture
  • data ecosystem
    • about / What is a data ecosystem?, What constitutes a data ecosystem?
    • interconnected data /
    • environment / Data environment
    • data sharing / Data sharing
  • Data Enricher / Threat intel share – backend
  • data explosion problem / The data explosion problem
  • data ingestion
    • batch ingestion / Data ingest
    • stream ingestion / Data ingest
  • data integrity, Stardog
    • strict parsing of RDF / Strict parsing of RDF
    • Integrity Constraint Validation / Integrity Constraint Validation
  • data lineage
    • about / Data lineage
    • Apache Atlas / Apache Atlas
    • Apache Falcon / Apache Falcon
  • DataNode / DataNode
  • data nodes / High-level architecture of HDFS
  • data partitioning
    • about / Distributed storage, Data partitioning
    • range-based partitioning / Range-based partitioning
    • hash-based partitioning / Hash-based partitioning
  • data pipeline / Data pipeline
  • data processing design
    • challenges / The 3 + 1 Vs and how they affect choice in data processing design
  • data quality / Data quality
  • data replication / Distributed storage
  • data sharing
    • about / Data sharing
    • traffic light protocol / Traffic light protocol
  • data sources
    • types / Types of data sources
    • transactional data / Types of data sources
    • User Data and Personnel Data / Types of data sources
    • social and demographic data / Types of data sources
    • about / Types of data sources
    • publicly-available data / Types of data sources
  • Dead-Letter Exchanges
    • URL / Dead-Letter Exchanges
    • about / Dead-Letter Exchanges
  • Dependency Hell
    • reference / Micro Services application styles
  • Direct Exchange type / Dead-Letter Exchanges
  • Distributed Configuration-Management Module
    • configuration-management service / An introduction to ETCD
    • configuration-management client / An introduction to ETCD
  • distributed data
    • centralized collection / Centralized collection of distributed data
  • distributed filesystems / Hadoop Distributed Filesystem
  • distributed processing
    • about / Distributed processing, Distributed processing
    • capabilities / Distributed processing, Distributed processing
  • distributed storage
    • about / Distributed storage
    • data partitioning / Distributed storage

E

  • elastic search / Elastic search and free text search queries
  • ElasticSearch-Logstash-Kibana (ELK)
    • about / ELK
    • beats / Beats
    • load balancing / Load-balancing
    • Logstash / Logstash
    • back pressure / Back pressure
    • high-availability / High-availability
  • Enterprise Service Bus (ESB) / Query-Data pipelines
  • ETCD
    • about / An introduction to ETCD
    • high-level capabilities / An introduction to ETCD
    • scheduler / Scheduler
    • Micro Service, designing / Designing the Micro Service
  • Event-Data Pipelines
    • about / Event-Data Pipelines
    • topologies / Topology 1, Topology 2, Topology 3
    • resilience / Resilience
    • high-availability / High-availability
    • clustering / Clustering
  • event-sourcing / Reliable messaging
  • event streams / Architectural concepts
  • executor component / Scheduler/executor component

F

  • Flume Deployment Topology
    • reference / Apache Flume
  • formation management reference architecture, Oracle
    • business view / Reference architecture – business view
  • formatter component / Writer/formatter component

G

  • General Data Protection Regulation (GDPR) / Data lineage
  • graph store
    • use case / Background of the use case
    • solution discussion / Solution discussion
    • bank fraud data mode / Bank fraud data model (as can be designed in a property graph data store such as Neo4J)

H

  • Hadoop / What are Hadoop and HDFS, Introducing Hadoop, the Big Elephant
  • Hadoop Distributed File System (HDFS)
    • about / What are Hadoop and HDFS, The data explosion problem, Introducing Hadoop, the Big Elephant, Hadoop Distributed Filesystem
    • NameNode / NameNode
    • DataNode / DataNode
    • MapReduce / MapReduce
    • architecture principles / HDFS architecture principles (and assumptions)
    • high-level architecture / High-level architecture of HDFS
    • file formats / HDFS file formats
  • Hadoop MapReduce / Introducing Hadoop, the Big Elephant
  • Hadoop YARN / Introducing Hadoop, the Big Elephant
  • hash partitioning / Hash-based partitioning
  • HBase
    • about / HBase
    • basics / Understanding the basics of HBase
    • data model / HBase data model
    • architecture / HBase architecture
    • horizontal scaling, with automatic sharding of tables / Horizontal scaling with automatic sharding of HBase tables
    • region assignment / HMaster, region assignment, and balancing
    • HMaster / HMaster, region assignment, and balancing
    • balancing / HMaster, region assignment, and balancing
  • HBase cluster
    • performance tips / Tips for improved performance from your HBase cluster
  • HDFS file formats / HDFS file formats
  • High-Availability, Data Bus
    • about / High-availability
    • availability chart / Availability Chart
  • high-level architecture, Data-Collection System
    • about / High-level architecture
    • service gateway / Service gateway
    • discovery server / Discovery server
  • high-level reference architecture / High-level reference architecture

I

  • ICV constraint validations
    • examples / Integrity Constraint Validation
  • information-exchange, between nodes in DAG
    • dumb exchange / Data pipeline
    • smart exchange / Data pipeline
  • information management conceptual reference architecture, Oracle
    • about / Oracle's information management conceptual reference architecture
    • conceptual view / Conceptual view
    • event engine / Conceptual view
    • data reservoir / Conceptual view
    • data factory / Conceptual view
    • enterprise information store / Conceptual view
    • reporting / Conceptual view
    • discovery lab / Conceptual view
  • information management reference architecture, Oracle
    • about / Oracle's information management reference architecture
    • data process view / Data process view
    • use case examples / Real-life use case examples

J

  • Job / Architectural concepts
  • Job-Execution Context / Architectural concepts

K

  • Kafka streams
    • about / Kafka streams
    • features / Kafka streams
    • processing topology / Stream processing topology
  • Kappa architecture
    • about / Kappa architecture
    • No-Sql data stores, comparing / A brief comparison of different leading No-Sql data stores

L

  • Lambda architecture
    • about / Lambda architecture
    • data immutability / Lambda architecture
    • batch layer / Lambda architecture
    • serving layer / Lambda architecture
    • speed layer / Lambda architecture, Lambda architecture's speed layer
    • and batch processing / Lambda architecture and batch processing
  • Lambdas / Non-functional use cases for dissemination architecture
  • Listeners
    • Execution Listener / Designing the Micro Service
    • Execution State Listener / Designing the Micro Service
  • locking strategies
    • optimistic locking / Processing strategy
    • pessimistic locking / Processing strategy
  • Luigi
    • about / Luigi
    • features / Luigi

M

  • MapReduce framework
    • reference / DataNode
    • about / MapReduce
  • message-oriented application style / Message-oriented application style
  • micro-batch stream processing / Micro-batch stream processing
  • Micro Service, ETCD
    • components / Designing the Micro Service
    • scheduling / Designing the Micro Service
    • task, executing / Designing the Micro Service
    • Pagination Use Case, implementing / Designing the Micro Service
  • Micro Services application styles / Micro Services application styles
  • mirrored-queue / Mirrored queues
  • multi-processing / Processing strategy

N

  • NameNode
    • about / NameNode, High-level architecture of HDFS
    • reference / DataNode
  • network partitions / Clustering and Network Partitions
  • non-functional aspects, data dissemination
    • use cases / Non-functional use cases for dissemination architecture
    • elastic search / Elastic search and free text search queries
  • normalizer component / Normalizer component
  • notions of time, in streams
    • event time / Notion of time in stream processing
    • processing time / Notion of time in stream processing
    • ingestion time / Notion of time in stream processing

O

  • Oozie
    • about / Oozie
    • features / Oozie
  • optimistic locking / Processing strategy

P

  • parallel processing / Processing strategy
  • partitioning strategy
    • caveats / Hash-based partitioning
  • Persistent Messages / Persistent Messages
  • pessimistic locking / Processing strategy
  • processing application
    • performing / How to perform the processing
    • location / Where to perform the processing
    • data quality / Quality of data
    • networks / Networks are everywhere
    • effect consumption of data / Effective consumption of the data
  • processing component / Processing component
  • processing guarantees
    • about / Processing guarantees
    • exactly-once guarantee / Processing guarantees
    • at-least-once guarantee / Processing guarantees
    • at-most-once guarantee / Processing guarantees
  • processing strategy / Processing strategy

Q

  • Quartz / Architecture technology mapping
  • Quartz Scheduler
    • components / Scheduler
    • reference / Scheduler
  • Query-Data Pipelines / Query-Data pipelines

R

  • range-based partitioning / Range-based partitioning
  • read component / Read/extract component
  • reference architecture
    • about / What is a reference architecture?
    • problem statement / Problem statement
    • for data-intensive system / Reference architecture for a data-intensive system
  • reference architecture, for data-intensive system
    • about / Reference architecture for a data-intensive system
    • component view / Component view
    • data ingestion / Data ingest
    • data preparation / Data preparation
    • data, processing / Data processing
    • workflow management / Workflow management
    • data, accessing / Data access
    • data insight / Data insight
    • data governance / Data governance
    • data pipeline / Data pipeline
  • regional API endpoint / Non-functional use cases for dissemination architecture
  • Relational Database Management System / Relational Database Management Systems and Big data
  • Reliability guarantees
    • at-least-once delivery / Reliable messaging
    • at-most-once delivery / Reliable messaging
    • exactly-once delivery / Reliable messaging
  • reliable messaging / Reliable messaging
  • resources
    • sharing, among processing applications / Sharing resources among processing applications
  • retry pattern
    • about / The retry pattern
    • considerations / The retry pattern
  • routing
    • about / Routing
    • Header-Based Content Routing / Header-Based Content Routing
    • Topic-Based Content Routing / Topic-Based Content Routing

S

  • Samza
    • stream processing API / Samza's stream processing API
  • Samza architecture
    • about / Samza architecture
    • concepts / Architectural concepts
    • event-streaming layer / Event-streaming layer
  • scheduler component / Scheduler/executor component
  • seeking / Relational Database Management Systems and Big data
  • semantic graph
    • about / Semantic graph
    • linked data / Linked data
    • vocabularies / Vocabularies
    • Semantic Query Language / Semantic Query Language
    • inference / Inference
  • service gateway/API gateway capabilities, architectural capabilities
    • about / Service gateway/API gateway capabilities
    • security / Security
    • traffic control / Traffic control
    • mediation / Mediation
    • caching / Caching
    • routing / Routing
    • service orchestration / Service orchestration
  • session window / Types of windows
  • sink processor / Stream processing topology
  • sliding window / Types of windows
  • Solid State Drive (SDD)
    • reference / The data explosion problem
  • source processor / Stream processing topology
  • Sparql
    • reference / Semantic Query Language
  • Stardog
    • about / Stardog
    • GraphQL queries / GraphQL queries
    • Gremlin / Gremlin
    • Virtual Graphs / Virtual Graphs – a Unifying DAO
    • structured data / Structured data
    • CVs / CVS
    • constraints, validating / Data integrity and validating constraints
    • data integrity / Data integrity and validating constraints
    • monitoring and operation / Monitoring and operation
    • performance / Performance
    • reference / Performance
  • strategies, for loading configuration properties
    • fallback strategy / An introduction to ETCD
    • local only / An introduction to ETCD
    • remote only / An introduction to ETCD
  • stream / Stream processing topology
  • streaming application
    • real time views, computing / Computing real time views
  • streaming architecture
    • scheduler/executor component / The scheduler/executor component of the streaming architecture
  • streaming system / What is a streaming system?
  • stream partition / Architectural concepts
  • stream processing
    • notions of time / Notion of time in stream processing
  • stream processing application / Stream processing topology
  • stream processor / Stream processing topology

T

  • target audience / Target audience
  • Task / Architectural concepts
  • TaskTracker
    • reference / DataNode
  • Technopedia
    • reference / Content mashup
  • threat intel share backend architecture
    • about / Threat intel share – backend
    • RT query processor / RT query processor
    • view builder component / View builder
  • threat intel share frontend architecture / Threat intel share – frontend
  • throttling
    • strategies / Throttling
  • top-level objects
    • indicator / Target audience
    • vulnerability / Target audience
    • campaign / Target audience
    • threat actor / Target audience
  • Topic Exchange type / Alternate Exchanges
  • traffic light protocol / Traffic light protocol
  • tumbling window / Types of windows
  • types, data
    • structured data / What constitutes a data ecosystem?
    • semi-structured data / What constitutes a data ecosystem?
    • unstructured data / What constitutes a data ecosystem?

U

  • UI capabilities, architectural capabilities
    • about / UI capabilities
    • content mashup / Content mashup
    • multi-channel support / Multi-channel support
    • user workflow / User workflow
    • AR/VR support / AR/VR support
  • unstructured data / HDFS file formats
  • use case
    • scenario / Scenario
  • use case examples, formation management reference architecture
    • machine learning use case /
    • data enrichment use case / Data enrichment use case
    • extract transform load use case / Extract transform load use case

V

  • 3 + 1 Vs
    • about / The 3 + 1 Vs and how they affect choice in data processing design
    • cost associated with latency / Cost associated with latency
    • classic way of doing things / Classic way of doing things
  • validation component / Validation component

W

  • windowing / Windowing
  • windows
    • sliding window / Types of windows
    • tumbling windows / Types of windows
    • session window / Types of windows
  • writer/formatter component
    • basic shell component / Basic shell component
    • scheduler/executor component / Scheduler/executor component
  • writer component / Writer/formatter component

Y

  • YARN (Yet Another Resource Manager) / Samza's stream processing API