In the previous section of this chapter, we learnt many different ways of creating DataFrames. In this section, we will focus on various operations that can be performed on DataFrames. Developers chain multiple operations to filter, transform, aggregate, and sort data in the DataFrames. The underlying Catalyst optimizer ensures efficient execution of these operations. These functions you find here are similar to those you commonly find in SQL operations on tables:
Python:
//Create a local collection of colors first >>> colors = ['white','green','yellow','red','brown','pink'] //Distribute the local collection to form an RDD //Apply map function on that RDD to get another RDD containing colour, length tuples and convert that RDD to a DataFrame >>> color_df = sc.parallelize(colors) .map(lambda x:(x,len(x))).toDF(['color','length']) //Check the object type >>> color_df DataFrame[color: string, length: bigint] //Check the schema...