Spark SQL can read data from external storage systems such as files, Hive tables, and JDBC databases through the DataFrameReader
interface.
The format of the API call is spark.read.inputtype
:
- Parquet
- CSV
- Hive table
- JDBC
- ORC
- Text
- JSON
Let's look at a couple of simple examples of reading CSV files into DataFrames:
scala> val statesPopulationDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation.csv") statesPopulationDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1 more field] scala> val statesTaxRatesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesTaxRates.csv") statesTaxRatesDF: org.apache.spark.sql.DataFrame = [State: string, TaxRate: double]