Using Bloom filters
Bloom filters are a way of efficiently filtering records in a database based on a condition. They have a probabilistic nature and are used to test the membership of an element in a set. We can encounter false positives but not false negatives. These filters were developed as a mathematical construct, to be applied when the amount of data to scan is impractical to be read, and are based on hashing techniques.
Delta Lake provides us with the ability to apply Bloom filters on our queries to further improve performance. We will see how they work at a basic level and how they can be applied in Delta Lake.
Understanding Bloom filters
As mentioned in the introduction to this section, Bloom filters are probabilistic data structures used to test if an element belongs to a category or not. This structure is a fixed-length bit array that is populated using a hash function, which maps the information into ones and zeros. The length of the array depends on the number...