Book Image

The Applied SQL Data Analytics Workshop - Second Edition

By : Matt Goldwasser, Upom Malik, Benjamin Johnston
3.5 (2)
Book Image

The Applied SQL Data Analytics Workshop - Second Edition

3.5 (2)
By: Matt Goldwasser, Upom Malik, Benjamin Johnston

Overview of this book

Every day, businesses operate around the clock and a huge amount of data is generated at a rapid pace. Hidden in this data are key patterns and behaviors that can help you and your business understand your customers at a deep, fundamental level. Are you ready to enter the exciting world of data analytics and unlock these useful insights? Written by a team of expert data scientists who have used their data analytics skills to transform businesses of all shapes and sizes, The Applied SQL Data Analytics Workshop is a great way to get started with data analysis, showing you how to effectively sieve and process information from raw data, even without any prior experience. The book begins by showing you how to form hypotheses and generate descriptive statistics that can provide key insights into your existing data. As you progress, you'll learn how to write SQL queries to aggregate, calculate and combine SQL data from sources outside of your current dataset. You'll also discover how to work with different data types, like JSON. By exploring advanced techniques, such as geospatial analysis and text analysis, you'll finally be able to understand your business at a deeper level. Finally, the book lets you in on the secret to getting information faster and more effectively by using advanced techniques like profiling and automation. By the end of The Applied SQL Data Analytics Workshop, you'll have the skills you need to start identifying patterns and unlocking insights in your own data. You will be capable of looking and assessing data with the critical eye of a skilled data analyst.
Table of Contents (9 chapters)
Preface
7
7. The Scientific Method and Applied Problem Solving

Relational Databases and SQL

A relational database is a database that utilizes the relational model of data. The relational model, invented by Edgar F. Codd in 1970, organizes data as relations, or sets of tuples. Each tuple consists of a series of attributes that generally describe the tuple. For example, we could imagine a customer relationship where each tuple represents a customer. Each tuple would then have attributes describing a single customer, giving information such as the last name, first name, and age, perhaps in the format (Smith, John, 27). One or more of the attributes is used to uniquely identify a tuple in a relation and is called the relational key. The relational model then allows logical operations to be performed between relations.

In a relational database, relations are usually implemented as tables, as in an Excel spreadsheet. Each row of the table is a tuple, and the attributes are represented as columns of the table. While not technically required, most tables in a relational database have a column referred to as the primary key, which uniquely identifies a row of the database. Every column also has a data type, which describes the type of data in the column. Tables are then usually collected together in common collections in databases called schemas. These tables usually are loaded with processes known as Extract, Transform, Load jobs (or ETL for short).

Tables are usually referred to in queries in the format [schema].[table]. For example, a products table in the analytics schema would be generally referred to as analytics.product. However, there is also a special schema called the public schema. This is a default schema where, if you do not explicitly mention a schema, then the database uses the public schema; for example, the public.products table and the products table are the same.

The software used to manage relational databases on a computer is referred to as a relational database management system (RDBMS). SQL is the language utilized by users of an RDBMS to access and interact with a relational database.

Note

Virtually all relational databases that use SQL deviate from the relational model in some basic way. For example, not every table has a specified relational key. Additionally, a relational model does not technically allow for duplicate rows, but you can have duplicate rows in a relational database. These differences are minor and will not matter for the vast majority of readers of this book.

Advantages and Disadvantages of SQL Databases

Since the release of Oracle Database in 1979, SQL has become an industry standard for data in nearly all computer applications—and for good reason. SQL databases provide a ton of advantages that make it the de facto choice for many applications:

  • Intuitive: Relations represented as tables are a common data structure that almost everyone understands. As such, working with and reasoning about relational databases is much easier than doing so with other models.
  • Efficient: Using a technique known as normalization, relational databases allow the representation of data without unnecessarily repeating it. As such, relational databases can represent large amounts of information while utilizing less space. This reduced storage footprint also allows the database to reduce operation costs, making well-designed relational databases quick to process.
  • Declarative: SQL is a declarative language, meaning that when you write code, you only need to tell the computer what data you want, and the RDBMS takes care of determining how to execute the SQL code. You never have to worry about telling the computer how to access and pull data in the table.
  • Robust: Most popular SQL databases have a property known as atomicity, consistency, isolation, and durability (ACID) compliance, which guarantees the validity of the data, even if the hardware fails.

That said, there are still some downsides to SQL databases, which are as follows:

  • Relatively Lower Specificity: While SQL is declarative, its functionality can often be limited to what has already been programmed into it. Although most popular RDBMS software is updated constantly with new functionality being built all the time, it can be difficult to process and work with data structures and algorithms that are not programmed into an RBDMS.
  • Limited Scalability: SQL databases are incredibly robust, but this robustness comes at a cost. As the amount of information, you have doubles, the cost of resources more than doubles. When very large volumes of information are involved, other data stores such as NoSQL databases may actually be better.
  • Object-Relation Mismatch Impedance: While tables are a very intuitive data structure, they are not necessarily the best format for representing objects in a computer. This primarily occurs because objects often have attributes that have many-to-many relationships. For instance, a customer for a company may own multiple products, but each product may have multiple customers. For an object in a computer, we could easily represent this as a list attribute under the customer object. However, in a normalized database, a customer's products would potentially have to be represented using three different tables, each of which must be updated for every new purchase, recall, and return.