After using the previous Scala example to create a data frame from a JSON input file on HDFS, we can now define a temporary table based on the data frame and run SQL against it.
The following example shows you the temporary table called washing_flat
being defined and a row count being created using count(*)
:
The schema for this data was created on the fly (inferred). This is a very nice function of the Apache Spark DataSource API that has been used when reading the JSON
file from HDFS using the SparkSession
object. However, if you want to specify the schema on your own, you can do so.
So first, we have to import
some classes. Follow the code to do this:
import org.apache.spark.sql.types._
So let's define a schema for some CSV file. In order to create one, we can simply write the DataFrame from the previous section to HDFS (again using the Apache Spark Datasoure API):
washing_flat.write.csv("hdfs://localhost:9000/tmp/washing_flat.csv")
Let's double-check the contents...