HDFS is the most widely used big data storage system. One of the reasons for the wide adoption of HDFS is schema-on-read. What this means is that HDFS does not put any restriction on data when data is being written. Any and all kinds of data are welcome and can be stored in a raw format. This feature makes it ideal storage for raw unstructured data and semi-structured data.
When it comes to reading data, even unstructured data needs to be given some structure to make sense. Hadoop uses InputFormat
to determine how to read the data. Spark provides complete support for Hadoop's InputFormat
so anything that can be read by Hadoop can be read by Spark as well.
The default InputFormat
is TextInputFormat
. TextInputFormat
takes the byte offset of a line as a key and the content of a line as a value. Spark uses the sc.textFile
method to read using TextInputFormat
. It ignores the byte offset and creates an RDD of strings.
Sometimes the filename itself contains useful information...