In this chapter, we used a sample dataset, a collection of tweets called Sentiment140, to learn how to clean and manipulate data in a relational database management system. We performed a few basic cleaning procedures in Excel, and then we reviewed how to get the data out of a CSV file and into the database. At this point, the rest of the cleaning procedures were performed inside the RDBMS itself. We learned how to manipulate strings into proper dates, and then we worked on extracting three kinds of data from within the tweet text, ultimately moving these extracted values to new, clean tables. Next, we learned how to create a lookup table of values that are currently stored inefficiently, thus allowing us to update the original table with efficient, numeric lookup values. Finally, because we performed a lot of steps and because there is always the potential for mistakes or miscommunication about what we did, we reviewed some strategies to document our cleaning procedures.
In the next...