Book Image

Clean Data

By : Megan Squire
Book Image

Clean Data

By: Megan Squire

Overview of this book

<p>Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise.</p> <p>The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies on: file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples.</p> <p>At the end of the book, you will be given a chance to tackle a couple of real-world projects.</p>
Table of Contents (17 chapters)
Clean Data
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Preparing a clean data package


In this section, we delve into the many important questions that need to be answered before you can release a data package for general consumption.

How do you want people to access your data? If it is in a database, do you want users to be able to log in and run SQL commands on it? Or do you want to create downloadable flat text files for them to use? Do you need to create an API for the data? How much data do you have anyway, and do you want different levels of access for different parts of the data?

The technical aspects of how you want to share your clean data are extremely important. In general, it is probably a good idea to start with the simple things and move to a more sophisticated distribution plan when and if you need to. The following are some options for distributing data, in the order of the least complicated to the most complicated. Of course, with greater sophistication comes greater benefits:

  • Compressed plain text – This is a very low-stakes distribution...