In this recipe, we explore the groupBy()
and reduceBy()
methods, which allow us to group values corresponding to a key. It is an expensive operation due to internal shuffling. We first demonstrate groupby()
in more detail and then cover reduceBy()
to show the in coding these while the advantage of the reduceBy()
operator.
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter3
- Import the necessary packages:
import breeze.numerics.pow import org.apache.spark.sql.SparkSession import Array._
- Import the packages for setting up logging level for
log4j
. This step is optional, but we highly recommend it (change the level appropriately as you move through the development cycle):
import org.apache.log4j.Logger import org.apache.log4j.Level
- Set up the logging level to warning...