DataFrames have schema, RDDs don't. That is, unless RDDs are composed of Row(...)
objects.
In this recipe, we will learn how to create DataFrames by inferring the schema using reflection.
To execute this recipe, you need to have a working Spark 2.3 environment.
There are no other requirements.
In this example, we will first read our CSV sample data into an RDD and then create a DataFrame from it. Here's the code:
import pyspark.sql as sql sample_data_rdd = sc.textFile('../Data/DataFrames_sample.csv') header = sample_data_rdd.first() sample_data_rdd_row = ( sample_data_rdd .filter(lambda row: row != header) .map(lambda row: row.split(',')) .map(lambda row: sql.Row( Id=int(row[0]) , Model=row[1] , Year=int(row[2]) , ScreenSize=row[3] , RAM=row[4] , HDD=row[5] , W=float(row[6]) , D=float(row[7]) ...