Understanding bloom filter indexing
A bloom filter index is a data structure that provides data skipping on columns, especially on fields containing arbitrary text. The filter works by either stating that certain data is definitely not in a file or that it is probably in the file, which is defined by a false positive probability (FPP). The bloom filter index can help speed up needle in a haystack type of queries, which are not sped up by other techniques.
Let's go through a worked-out example that illustrates the performance benefits of using a bloom filter index:
- We will start by checking the Spark configuration for bloom filter indexes. Run the following line of code in a new cell:
spark.conf.get('spark.databricks.io.skipping.bloomFilter.enabled')
By default, it is true.
- Now, we can start creating our very first bloom filter index! To begin with, let's create a delta table using the following block of code:
%sql CREATE OR REPLACE TABLE bloom_filter_test...