Business Intelligence with Databricks SQL

By : Vihag Gupta

Business Intelligence with Databricks SQL

By: Vihag Gupta

Overview of this book

In this new era of data platform system design, data lakes and data warehouses are giving way to the lakehouse – a new type of data platform system that aims to unify all data analytics into a single platform. Databricks, with its Databricks SQL product suite, is the hottest lakehouse platform out there, harnessing the power of Apache Spark™, Delta Lake, and other innovations to enable data warehousing capabilities on the lakehouse with data lake economics. This book is a comprehensive hands-on guide that helps you explore all the advanced features, use cases, and technology components of Databricks SQL. You’ll start with the lakehouse architecture fundamentals and understand how Databricks SQL fits into it. The book then shows you how to use the platform, from exploring data, executing queries, building reports, and using dashboards through to learning the administrative aspects of the lakehouse – data security, governance, and management of the computational power of the lakehouse. You’ll also delve into the core technology enablers of Databricks SQL – Delta Lake and Photon. Finally, you’ll get hands-on with advanced SQL commands for ingesting data and maintaining the lakehouse. By the end of this book, you’ll have mastered Databricks SQL and be able to deploy and deliver fast, scalable business intelligence on the lakehouse.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Part 1: Databricks SQL on the Lakehouse

Free Chapter

Chapter 1: Introduction to Databricks

Technical requirements

An overview of Databricks, the company

An overview of the Lakehouse architecture

An overview of the Databricks Lakehouse platform

Summary

Chapter 2: The Databricks Product Suite – A Visual Tour

Technical requirements

Basic navigation with the sidebar

The SQL persona view

The Machine Learning persona view

The Data Science and Engineering persona view

Summary

Chapter 3: The Data Catalog

Technical requirements

Understanding the data organization model in  Databricks SQL

Exploring data visually with the Data Catalog

Exploring the data programmatically with SQL statements

Summary

Chapter 4: The Security Model

Technical requirements

The Databricks SQL security model

User-facing table access control

The internals of cloud storage access

Summary

Chapter 5: The Workbench

Technical requirements

Working with queries

Visualizing query results

Creating and publishing dashboards

Administering and governing artifacts

Summary

Chapter 6: The SQL Warehouses

Technical requirements

Understanding the SQL Warehouse architecture

Creating and configuring SQL Warehouses

The art of SQL Warehouse sizing

Organizing and governing SQL Warehouses

Using Serverless SQL

Summary

Chapter 7: Using Business Intelligence Tools with Databricks SQL

Technical requirements

Connecting from validated BI tools

Connecting from non-validated BI tools

Connecting programmatically

Databricks Partner Connect

Summary

Part 2: Internals of Databricks SQL

Chapter 8: The Delta Lake

Technical requirements

Fundamentals of the Delta Lake storage format

Built-in performance-boosting features of Delta Lake

Configurable performance-boosting features of Delta Lake

Summary

Chapter 9: The Photon Engine

Technical requirements

Understanding Photon Engine

Understanding vectorization

Discussing the Photon product roadmap

Summary

Further reading

Chapter 10: Warehouse on the Lakehouse

Technical requirements

Organizing data on the Lakehouse

Implementing data modeling techniques

Summary

Part 3: Databricks SQL Commands

Chapter 11: SQL Commands – Part 1

Technical requirements

Working with data definition language commands

Working with data manipulation language commands

Working with the inbuilt functions in Databricks SQL

Summary

Chapter 12: SQL Commands – Part 2

Technical requirements

Working with Delta Lake maintenance commands

Working with data security commands

Working with metadata commands

Summary

Part 4: TPC-DS, Experiments, and Frequently Asked Questions

Chapter 13: Playing with the TPC-DS Dataset

Technical requirements

Understanding the TPC-DS dataset

Generating TPC-DS data

Running automated benchmarks

Experimenting with TPC-DS in Databricks SQL

Summary

Chapter 14: Ask Me Anything

Frequently asked questions

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Built-in performance-boosting features of Delta Lake

Delta Lake provides built-in performance boosters that complement the data layout strategies that we discussed in the Optimizing the data layout section. If there is a well-working data layout strategy in place, performance is accelerated further. If the data layout strategy is lacking or limited due to a wide variety of query-filtering patterns on the data, then the boosters make sure that performance is still improved by reducing unnecessary I/O. Let’s learn about these performance boosters.

Automatic statistics collection

The first, and arguably the most important, performance booster is automatic statistics collection (stats collection for short), which enables a process called data skipping. Stats collection is an automatic process on Delta Lake. For every data file written, the stats collection process computes the minimum and maximum values for the columns present in the file.

By default, stats collection...

Business Intelligence with Databricks SQL

By : Vihag Gupta

Business Intelligence with Databricks SQL

By: Vihag Gupta

Overview of this book

Related Content you might be interested in

Current Title:

Business Intelligence with Databricks SQL

Optimizing Databricks Workloads

Simplifying Data Engineering and Analytics with Delta

Practical Machine Learning on Databricks

Built-in performance-boosting features of Delta Lake

Automatic statistics collection