-
Book Overview & Buying
-
Table Of Contents
Data Engineering with Scala and Spark
By :
Deequ offers capabilities to generate statistics called metrics on data. For example, we can use Deequ to provide us with the number of records in a dataset, tell us whether a particular column is unique, give us the degree of correlation between columns, and so on. Deequ offers this functionality with case classes such as ApproxCountDistinct, Completeness, Correlation, and so on, defined in the com.amazon.deequ.analyzers package. For a complete list of metrics along with their definitions, please refer to https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/.
In the following example, we will be using the flight data that we loaded into a MySQL table named flights. We analyze the flights data to check the count of records, whether the airline column contains any NULL value, an approximate distinct count of origin_airport, and so on. The result set is then converted into a dataframe and finally printed on the screen:
package com...