In this recipe, we explore how to manipulate with code and method calls only (without SQL). The DataFrames have their own methods that allow you to perform SQL-like operations using a programmatic approach. We demonstrate some of these commands such as select()
, show()
, and explain()
to get the point across that the DataFrame itself is capable of wrangling and manipulating the data without using SQL.
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Set up the package location where the program will reside
package spark.ml.cookbook.chapter3
- Set up the imports related to DataFrames and the required data structures and create the RDDs as needed for the example
import org.apache.spark.sql._
- Import the packages for setting up logging level for
log4j
. This step is optional, but we highly recommend it (change the level appropriately as you move through the development cycle...