Book Image

Clean Data

By : Megan Squire
Book Image

Clean Data

By: Megan Squire

Overview of this book

<p>Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise.</p> <p>The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies on: file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples.</p> <p>At the end of the book, you will be given a chance to tackle a couple of real-world projects.</p>
Table of Contents (17 chapters)
Clean Data
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Step five – clean other mystery characters


As we are perusing the tweet_text column, we may have noticed a few odd tweets, such as tweet IDs 613 and 2086:

613, Talk is Cheap: Bing that, I?ll stick with Google
2086, Stanford University?s Facebook Profile

The ? character is what we should be concerned about. As with the HTML-encoded characters we saw earlier, this character issue is also very likely an artifact of a prior conversion between character sets. In this case, there was probably some kind of high-ASCII or Unicode apostrophe (sometimes called a smart quote) in the original tweet, but when the data was converted into a lower-order character set, such as plain ASCII, that particular flavor of apostrophe was simply changed to a ?.

Depending on what we plan to do with the data, we might not want to leave out the ? character, for example, if we are performing word counting or text mining, it may be very important that we convert I?ll to I'll and University?s to University's. If we decide...