Index
A
- abstract syntax tree (AST) / SparkSQL and DataFrames
- aggregate functions
- about / Aggregate functions
- count / count
- first / first
- last / last
- approx_count_distinct / approx_count_distinct
- min / min
- max / max
- avg / avg
- sum / sum
- kurtosis / kurtosis
- skewness / skewness
- variance / Variance
- standard deviation / Standard deviation
- covariance / Covariance
- groupBy / groupBy
- Rollup / Rollup
- cube / Cube
- aggregation patterns
- about / Aggregation patterns
- average temperature by city / Average temperature by city
- aggregations
- about / Aggregations
- aggregate functions / Aggregate functions
- window functions / Window functions
- ntiles / ntiles
- Amazon DynamoDB / Amazon DynamoDB
- Amazon DynamoDB Encryption at Rest
- reference link / Amazon DynamoDB
- Amazon EC2 Auto Scaling / Elastic web-scale computing
- Amazon EC2 instances
- reference link / Instances
- Amazon Elastic Block Store (Amazon EBS)
- about / Amazon Elastic Block Store
- reference link / Amazon Elastic Block Store
- Amazon Elastic Compute Cloud (Amazon EC2)
- about / Amazon Elastic Compute Cloud
- regions / Regions and availability zones, Regions
- availability zones / Regions and availability zones
- availability zone, selecting / Availability zones
- and Amazon Virtual Private Cloud / Amazon EC2 and Amazon Virtual Private Cloud
- instance store / Amazon EC2 instance store
- Amazon EMR cluster
- about / Amazon EMR
- creating / Practical AWS EMR cluster
- Amazon Machine Image (AMI)
- about / Instances and Amazon Machine Images, AMIs
- instances, launching / Launching multiple instances of an AMI
- instance types / Instance types
- Amazon Macie / Comprehensive security and compliance capabilities
- Amazon Redshift Spectrum / Query in place
- Amazon Relational Database Service (Amazon RDS) / Integration
- Amazon S3
- about / Introduction to Amazon S3, Getting started with Amazon S3
- reference link / Most supported platform with the largest ecosystem
- Amazon S3 Transfer Acceleration / Easy and flexible data transfer
- Amazon Simple Storage Service (Amazon S3) / Integration
- Amazon Virtual Private Cloud (Amazon VPC)
- about / Integration, Amazon EC2 and Amazon Virtual Private Cloud
- documentation link / Amazon EC2 and Amazon Virtual Private Cloud
- Amazon Web Services (AWS)
- about / AMIs
- region / Region and availability zone concepts
- availability zone / Region and availability zone concepts
- available regions / Available regions
- regions and endpoints / Regions and endpoints
- Anaconda
- installing / Installing Anaconda
- download link / Installing Anaconda
- using / Using Conda
- Apache Flink
- about / Introduction to Apache Flink
- continuous processing, for unbounded datasets / Continuous processing for unbounded datasets
- bounded dataset / Flink, the streaming model, and bounded datasets
- streaming model / Flink, the streaming model, and bounded datasets
- installing / Installing Flink, Installing Flink
- downloading / Downloading Flink
- local cluster, starting / Starting a local Flink cluster
- Apache Hadoop
- used, for distributed computing / Distributed computing using Apache Hadoop
- Apache Kafka / Continuous processing for unbounded datasets
- Apache Spark
- about / Apache Spark
- stack / Apache Spark
- at-least-once processing paradigm / At-least-once processing
- at-most-once processing paradigm / At-most-once processing
- average temperature by city
- count, recording / Record count
- min/max/count / Min/max/count
- average/median/standard deviation / Average/median/standard deviation
- AWS Auto Scaling / Elastic web-scale computing
- AWS Cloud Security
- reference link / Comprehensive security and compliance capabilities
- AWS CloudTrail / Comprehensive security and compliance capabilities
- AWS CodeBuild / What is AWS Lambda?
- AWS CodePipeline / What is AWS Lambda?
- AWS Compliance
- reference link / Comprehensive security and compliance capabilities
- AWS data archiving
- reference link / Data archiving
- AWS data lakes and big data analytics
- reference link / Data lakes and big data analytics
- AWS disaster recovery
- reference link / Disaster recovery
- AWS Glue
- about / AWS Glue
- using / When should I use AWS Glue?
- AWS Glue Data Catalog / AWS Glue
- AWS hybrid Cloud storage
- reference link / Hybrid Cloud storage
- AWS Lambda
- about / What is AWS Lambda?
- reasons, for using / When should I use AWS Lambda?
- AWS Snowball Edge / Easy and flexible data transfer
- AWS Storage Gateway / Easy and flexible data transfer
B
- bar chart / Bar chart
- batch analytics
- about / Batch analytics
- file, reading / Reading file
- transformations / Transformations
- groupBy operation, using / GroupBy
- aggregation operation / Aggregation
- joins / Joins
- big data
- about / Introduction to big data
- variety / Variety of data
- velocity / Velocity of data
- volume / Volume of data
- veracity / Veracity of data
- variability / Variability of data
- visualization / Visualization
- value / Value
- big data visualization tools
- about / Big data visualization tools
- IBM Cognos Analytics / Big data visualization tools
- Microsoft PowerBI / Big data visualization tools
- Oracle Visual Analyzer / Big data visualization tools
- SAP Lumira / Big data visualization tools
- SAS Visual Analytics / Big data visualization tools
- Tableau Desktop / Big data visualization tools
- TIBCO Spotfire / Big data visualization tools
- binaries, Hive
- downloading / Downloading and extracting the Hive binaries
- extracting / Downloading and extracting the Hive binaries
- broadcast join / Broadcast join
- built-in functions, Hive / Built-in functions
- built-in operators, Hive
- relational operators / Built-in operators
- arithmetic operators / Built-in operators
- logical operators / Built-in operators
- business intelligence (BI) / Inside the data analytics process
C
- Cassandra connector
- reference / Cassandra connector
- about / Cassandra connector
- sinking with / Cassandra connector
- changes, Hadoop
- shell script rewrite / Shell script rewrite
- shaded-client JARs / Shaded-client JARs
- changes, Hadoop 3
- about / Other changes
- minimum required Java version / Minimum required Java version
- characteristics, Cloud
- on-demand usage / On-demand usage
- ubiquitous access / Ubiquitous access
- multi-tenancy (and resource pooling) / Multi-tenancy (and resource pooling)
- elasticity / Elasticity
- measured usage / Measured usage
- resiliency / Resiliency
- charts
- about / Chart types
- line charts / Line charts
- pie chart / Pie chart
- bar charts / Bar chart
- heat map / Heat map
- checkpointing
- about / Checkpointing
- metadata checkpointing / Metadata checkpointing
- data checkpointing / Data checkpointing
- CLI(Command Line Interface) / Easy to start
- Cloud
- concepts / Concepts and terminology
- about / Cloud
- increased scalability / Increased scalability
- increased availability and reliability / Increased availability and reliability
- risks and challenges / Risks and challenges
- characteristics / Cloud characteristics
- cloud consumer / Cloud consumers and Cloud providers
- Cloud consumers
- about / Cloud consumers and Cloud providers, Cloud consumer
- benefits / Goals and benefits
- Cloud data migration
- reference link / Easy and flexible data transfer
- Cloud delivery models
- combining / Combining Cloud delivery models, IaaS + PaaS, IaaS + PaaS + SaaS
- cloud provider / Cloud consumers and Cloud providers
- Cloud resource administrator
- about / Cloud resource administrator
- organizational boundary / Organizational boundary
- trust boundary / Trust boundary
- Cloud service owner / Cloud service owner
- collection-based sources
- reading / Collection-based
- comma-separated values (CSV) / Implicit schema
- command-line tools / Easy to start
- community Clouds / Community Clouds
- confirmatory data analysis (CDA) / Introduction to data analytics
- connectors
- Kafka connector / Kafka connector
- Twitter connector / Twitter connector
- RabbitMQ connector / RabbitMQ connector
- Elasticsearch connector / Elasticsearch connector
- Cassandra connector / Cassandra connector
- containers
- guaranteed container / Types of container execution
- opportunistic containers / Types of container execution
- Cross-Region Replication (CRR) / Disaster recovery
- cross join / Cross join
D
- data
- analyzing / Data analysis
- data analytics
- performing / Data analytics
- data analytics process
- about / Introduction to data analytics
- exploring / Inside the data analytics process
- data checkpointing / Data checkpointing
- DataFrame
- about / SparkSQL and DataFrames
- creating / DataFrame APIs and the SQL API
- API / DataFrame APIs and the SQL API
- pivots / Pivots
- filters / Filters
- data processing
- about / Data processing using the DataStream API
- execution environment / Execution environment
- data sources / Data sources
- data transformations / Transformations
- windowAll function / windowAll
- union function / union
- Window join / Window join
- split function / split
- select function / Select
- project function / Project
- physical partitioning / Physical partitioning
- rescaling / Rescaling
- broadcasting / Broadcasting
- event time / Event time and watermarks
- time / Event time and watermarks
- ingestion time / Event time and watermarks
- connectors / Connectors
- datasets
- loading / Loading datasets
- unbounded / Continuous processing for unbounded datasets
- bounded / Continuous processing for unbounded datasets
- data sources
- about / Data sources
- socket-based data sourcing / Socket-based
- file-based data sourcing / File-based
- data steward role / Inside the data analytics process
- DataStream API
- used, for data processing / Data processing using the DataStream API
- reference / Data sources
- data transformation
- data visualization
- Python, using / Using Python to visualize data
- R, using / Using R to visualize data
- delivery models, Cloud
- about / Cloud delivery models
- Infrastructure as a Service (IaaS) / Infrastructure as a Service
- Platform as a Service (PaaS) / Platform as a Service
- Software as a Service (SaaS) / Software as a Service
- Deploying Lambda-based Applications
- reference link / What is AWS Lambda?
- deployment models, Cloud
- about / Cloud deployment models
- public Clouds / Public Clouds
- community Clouds / Community Clouds
- private Cloud / Private Clouds
- hybrid Cloud / Hybrid Clouds
- Derby
- installing / Installing Derby
- installation link / Installing Derby
- Directed Acyclic Graphs (DAGs) / Complex stream processing
- direct stream approach
- about / Direct Stream
- properties / Direct Stream
- Discretized Streams (DStreams) / Discretized Streams
- distributed computing
- Apache Hadoop, using / Distributed computing using Apache Hadoop
- driver failure recovery / Driver failure recovery
E
- Elasticsearch connector
- about / Elasticsearch connector
- node mode / Elasticsearch connector
- client mode / Elasticsearch connector
- encoders / Encoders
- event time and date
- handling / Handling event time and late date
- exactly-once processing / Exactly-once processing
- execution models
- explicit schema / Explicit schema
- exploratory data analysis (EDA) / Introduction to data analytics
- extract, transform, and load (ETL) / AWS Glue
F
- fault-tolerance semantics / Fault-tolerance semantics
- features, Amazon Elastic Compute Cloud (Amazon EC2)
- elastic web-scale computing / Elastic web-scale computing
- operations, controlling / Complete control of operations
- hosting services / Flexible Cloud hosting services
- integration / Integration
- high reliability / High reliability
- security / Security
- inexpensive / Inexpensive
- pricing, reference link / Inexpensive
- easy of starting / Easy to start
- instances / Instances and Amazon Machine Images
- Amazon Machine Image (AMI) / Instances and Amazon Machine Images
- features, Amazon S3
- comprehensive security / Comprehensive security and compliance capabilities
- compliance capabilities / Comprehensive security and compliance capabilities
- query access / Query in place
- flexible management / Flexible management
- supported platform / Most supported platform with the largest ecosystem
- easy data transfer / Easy and flexible data transfer
- flexible data transfer / Easy and flexible data transfer
- data backup / Backup and recovery
- data recovery / Backup and recovery
- data archiving / Data archiving
- data lakes analytics / Data lakes and big data analytics
- big data analytics / Data lakes and big data analytics
- hybrid Cloud storage / Hybrid Cloud storage
- cloud-native application data / Cloud-native application data
- disaster recovery / Disaster recovery
- file
- writing to / Writing to a file
- file-based sources
- reading / File-based
- fileStream
- about / fileStream
- textFileStream / textFileStream
- binaryRecordsStream / binaryRecordsStream
- queueStream / queueStream
- Discretized Streams (DStreams) / Discretized Streams
- filtering patterns / Filtering patterns
- filters / Filters
- finite stream / Flink, the streaming model, and bounded datasets
- Flink
- reference / Installing Flink
- Flink cluster UI
- using / Using the Flink cluster UI
- following
- operators on complex types / Built-in operators
G
- generic sources
- reading / Generic
- Google File System (GFS) / Distributed computing using Apache Hadoop
H
- Hadoop 3
- installing / Installing Hadoop 3
- installing, reference / Installing Hadoop 3
- data, connecting / Install R on workstations and connect to the data in Hadoop
- Hadoop 3 installation
- prerequisites, reference / Prerequisites
- version, downloading / Downloading
- about / Installation
- password-less ssh, setting up / Setup password-less ssh
- NameNode, setting up / Setting up the NameNode
- HDFS, starting / Starting HDFS
- YARN service, setting up / Setting up the YARN service
- Intra-DataNode balancer / Intra-DataNode balancer
- YARN timeline service v.2, installing / Installing YARN timeline service v.2
- Hadoop Distributed File System (HDFS)
- about / Hadoop Distributed File System, Distributed computing using Apache Hadoop, The MapReduce framework
- NameNode / Hadoop Distributed File System
- DataNode / Hadoop Distributed File System
- high availability / High availability
- Intra-DataNode balancer / Intra-DataNode balancer
- erasure coding / Erasure coding
- port numbers / Port numbers
- heat map / Heat map
- Hive
- about / Hive
- reference / Hive
- binaries, downloading / Downloading and extracting the Hive binaries
- Derby, installing / Installing Derby
- using / Using Hive
- partitions / Using Hive
- buckets / Using Hive
- tables / Using Hive
- database, creating / Creating a database
- table, creating / Creating a table
- SELECT statement syntax / SELECT statement syntax
- INSERT statement syntax / INSERT statement syntax
- primitive types / Primitive types
- complex types / Complex types
- built-in operators and functions / Built-in operators and functions
- language capabilities / Language capabilities
- information, retrieving from cheat sheet / A cheat sheet on retrieving information
- horizontal scaling
- scaling in / Horizontal scaling
- scaling out / Horizontal scaling
- about / Horizontal scaling
- hybrid Clouds / Hybrid Clouds
I
- IBM Cognos Analytics
- reference / Big data visualization tools
- implicit schema / Implicit schema
- Infrastructure as a Service (IaaS) / Infrastructure as a Service
- inner join / Inner join
- input streams, StreamingContext
- receiverStream / receiverStream
- socketTextStream / socketTextStream
- rawSocketStream / rawSocketStream
- instances / Instances
- instance types, Amazon Machine Image (AMI)
- Tags / Tag basics
- Amazon EC2 key pairs / Amazon EC2 key pairs
- Amazon EC2 security groups, for Linux instances / Amazon EC2 security groups for Linux instances
- elastic IP addresses / Elastic IP addresses
- intermediate keys and values / The MapReduce framework
- intermediate output of mapper / Map
- Internet of Things (IoT) / Data processing using the DataStream API
- IT / Cloud consumers and Cloud providers
- IT resource / IT resource
J
- job types, MapReduce
- about / MapReduce job types
- single mapper job / Single mapper job
- single mapper reducer job / Single mapper reducer job
- multiple mappers reducer job / Multiple mappers reducer job
- SingleMapperCombinerReducer job / SingleMapperCombinerReducer job
- scenario / Scenario
- join patterns
- about / Join patterns
- inner joins / Inner join
- left anti join / Left anti join
- left outer join / Left outer join
- right outer join / Right outer join
- full outer join / Full outer join
- left semi join / Left semi join
- cross join / Cross join
- joins
- about / Joins, Joins
- inner working / Inner workings of join
- shuffle join / Shuffle join
- broadcast join / Broadcast join
- types / Join types
- inner join / Inner join, Inner join
- left outer join / Left outer join, Left outer join
- right outer join / Right outer join, Right outer join
- outer join / Outer join
- left anti join / Left anti join
- left semi join / Left semi join
- cross join / Cross join
- performance implications / Performance implications of join
- full outer join / Full outer join
- Jupyter Notebook
- standard Python, installing / Installing standard Python
- Jupyter Notebook installation
- about / Installation
- standard Python, installing / Installing standard Python
- Anaconda, installing / Installing Anaconda
K
- key / Chart types
- key pair / Amazon EC2 key pairs
- Kinesis Data Streams
L
- left anti join / Left anti join
- left outer join / Left outer join
- left semi join / Left semi join
- line charts / Line charts
M
- MapReduce
- R, executing with RMR2 / Execute R inside of MapReduce using RMR2
- MapReduce framework
- about / MapReduce framework, The MapReduce framework, The MapReduce framework
- task-level native optimization / Task-level native optimization
- dataset / Dataset
- record reader / Record reader
- map / Map
- combiner / Combiner
- partitioner / Partitioner
- shuffle and sort / Shuffle and sort
- reduce / Reduce
- output format / Output format
- MapReduce patterns
- about / MapReduce patterns
- aggregation patterns / Aggregation patterns
- filtering patterns / Filtering patterns
- join patterns / Join patterns
- massively parallel processing (MPP) / Distributed computing using Apache Hadoop
- metadata checkpointing / Metadata checkpointing
- methods, for R and Hadoop integration
- about / Methods of integrating R and Hadoop
- RHadoop / RHADOOP – install R on workstations and connect to data in Hadoop
- R and Hadoop Integrated Programming Environment (RHIPE) / RHIPE – execute R inside Hadoop MapReduce
- Hadoop Streaming API / R and Hadoop Streaming
- RHIVE / RHIVE – install R on workstations and connect to data in Hadoop
- ORCH / ORCH – Oracle connector for Hadoop
- Microsoft PowerBI
- reference / Big data visualization tools
- multiple mappers reducer job / Multiple mappers reducer job
N
- ntiles / ntiles
O
- on-demand self-service usage / On-demand usage
- on-premise / On-premise
- opportunistic containers
- container execution, types / Types of container execution
- Oracle Visual Analyzer
- reference / Big data visualization tools
- ORCH / ORCH – Oracle connector for Hadoop
- outer join / Outer join
P
- Petabytes (PB) / Distributed computing using Apache Hadoop
- physical partitioning
- custom partitioning / Custom partitioning
- random partitioning / Random partitioning
- rebalancing partitioning / Rebalancing partitioning
- pie charts / Pie chart
- pivoting / Pivots
- Platform as a Service (PaaS) / Platform as a Service
- private Clouds / Private Clouds
- proportional cost / Goals and benefits
- public Cloud / Public Clouds
- Python
- installing / Installing standard Python
- used, for data visualization / Using Python to visualize data
- Python release
- for Windows, reference / Installing standard Python
- for macOS X, reference / Installing standard Python
- Linux and Unix, reference / Installing standard Python
Q
- QlikSense
- reference / Big data visualization tools
- queueStream
- textFileStream example / textFileStream example
- twitterStream example / twitterStream example
R
- R
- building, options / Introduction
- installing, on workstations / Install R on workstations and connect to the data in Hadoop
- installing, on shared server / Install R on a shared server and connect to Hadoop
- connecting, to Hadoop / Install R on a shared server and connect to Hadoop
- pure open source options, summarizing / Summary and outlook for pure open source options
- and Hadoop integration, methods / Methods of integrating R and Hadoop
- used, for data visualization / Using R to visualize data
- RabbitMQ connector
- reference / RabbitMQ connector
- about / RabbitMQ connector
- options, on stream deliveries / RabbitMQ connector
- receiver-based approach / Receiver-based
- resilient distributed dataset (RDD) / SparkSQL and DataFrames
- Revolution R Open (RRO)
- utilizing / Utilize Revolution R Open
- RHadoop / RHADOOP – install R on workstations and connect to data in Hadoop
- RHIPE / RHIPE – execute R inside Hadoop MapReduce
- RHIVE / RHIVE – install R on workstations and connect to data in Hadoop
- right outer join / Right outer join
- risks and challenges
- increased security vulnerabilities / Increased security vulnerabilities
- reduced operational governance control / Reduced operational governance control
- limited portability, between Cloud providers / Limited portability between Cloud providers
- RMR2
- used, for executing R inside MapReduce / Execute R inside of MapReduce using RMR2
- roles, Cloud resource administrator
- cloud auditor / Additional roles
- cloud broker / Additional roles
- cloud carrier / Additional roles
- roles and boundaries
- about / Roles and boundaries
- cloud provider / Cloud provider
- Cloud consumer / Cloud consumer
- Cloud service owner / Cloud service owner
- Cloud resource administrator / Cloud resource administrator
S
- SAP Lumira
- reference / Big data visualization tools
- SAS Visual Analytics
- reference / Big data visualization tools
- scaling
- about / Scaling
- types / Types of scaling
- horizontal scaling / Horizontal scaling
- vertical scaling / Vertical scaling
- cloud service / Cloud service
- cloud service consumer / Cloud service consumer
- schema
- about / Schema – structure of data
- implicit schema / Implicit schema
- explicit schema / Explicit schema
- encoders / Encoders
- SELECT statement syntax
- about / SELECT statement syntax
- WHERE clauses / WHERE clauses
- service-level agreement (SLA) / High reliability
- shuffle join / Shuffle join
- SingleMapperCombinerReducer job / SingleMapperCombinerReducer job
- single mapper job / Single mapper job
- single mapper reducer job / Single mapper reducer job
- Social Security Numbers (SSNs) / Inside the data analytics process
- Software as a Service (SaaS) / Software as a Service
- solid state disks (SSDs) / Amazon DynamoDB
- SparkSQL
- about / SparkSQL and DataFrames
- reference / SparkSQL and DataFrames
- API / DataFrame APIs and the SQL API
- user-defined functions (UDFs) / User-defined functions
- spark streaming
- about / Spark Streaming
- StreamingContext / StreamingContext
- StreamingContext, creating / Creating StreamingContext
- StreamingContext, starting / Starting StreamingContext
- StreamingContext, stopping / Stopping StreamingContext
- stateful transformations / Stateful/stateless transformations, Stateful transformations
- stateless transformations / Stateful/stateless transformations, Stateless transformations
- streaming
- about / Streaming
- at-least-once processing paradigm / At-least-once processing
- at-most-once processing paradigm / At-most-once processing
- exactly-once processing / Exactly-once processing
- StreamingContext
- input streams / Input streams
- streaming execution model / Introduction to streaming execution model
- streaming platforms
- interoperability / Interoperability with streaming platforms (Apache Kafka)
- receiver-based approach / Receiver-based
- direct stream approach / Direct Stream
- structured streaming / Structured Streaming
- structured streaming
T
- Tableau
- used, for visualization / Visualization using Tableau
- setting up / Tableau
- reference / Tableau
- about / Tableau
- Tableau Desktop
- reference / Big data visualization tools
- table stakes / Summary and outlook for pure open source options
- TIBCO Spotfire
- reference / Big data visualization tools
- timeline service v.2
- enabling / Enabling timeline service v.2
- executing / Running timeline service v.2
- writing, with MapReduce / Enabling MapReduce to write to timeline service v.2
- transformation patterns / Filtering patterns
- transformations
- about / Transformations
- windows operations / Windows operations
- stateless transformations / Stateful/stateless transformations
- stateful transformations / Stateful/stateless transformations
- reference / Transformations
- Twitter connector / Twitter connector
U
- user-defined aggregation functions (UDAF) / Aggregate functions
- user-defined functions (UDFs) / User-defined functions
V
- vertical scaling
- about / Vertical scaling
- scaling up / Vertical scaling
- scaling down / Vertical scaling
- visualization
- Tableau, using / Visualization using Tableau
- reference / Heat map
W
- window functions
- about / Window functions, window
- global windows / Global windows
- tumbling windows / Tumbling windows
- sliding windows / Sliding windows
- session windows / Session windows
- windows operations / Windows operations
- write ahead log (WAL) / Receiver-based
Y
- YARN timeline service v.2
- about / YARN timeline service v.2
- scalability and reliability, enhancing / Enhancing scalability and reliability
- usability improvements / Usability improvements
- YARN timeline service v.2 installation
- about / Installing YARN timeline service v.2
- HBase cluster, setting up / Setting up the HBase cluster
- co-processor, enabling / Enabling the co-processor
- enabling step / Enabling timeline service v.2
- Yet Another Resource Negotiator (YARN)
- about / YARN
- opportunistic containers / Opportunistic containers
- timeline service v.2 / YARN timeline service v.2