It is said that around 50% of the data scientist's time goes into transforming raw data into a usable format. Raw data can be in any format or size. It can be structured like RDBMS, semi-structured like CSV, or unstructured like regular text files. These contain some valuable information. And to extract that information, it has to be converted into a data structure or a usable format from which an algorithm can find valuable insights. Therefore, usable format refers to the data in a model that can be consumed in the data science process. This usable format differs from use case to use case.
This chapter will guide you through data munging, or the process of preparing the data. It covers the following topics:
What is data munging?
DataFrames.jl
Uploading data from a file
Finding the required data
Joins and indexing
Split-Apply-Combine strategy
Reshaping the data
Formula (ModelFrame and ModelMatrix)
PooledDataArray
Web scraping