As we are perusing the tweet_text
column, we may have noticed a few odd tweets, such as tweet IDs 613 and 2086:
613, Talk is Cheap: Bing that, I?ll stick with Google 2086, Stanford University?s Facebook Profile
The ?
character is what we should be concerned about. As with the HTML-encoded characters we saw earlier, this character issue is also very likely an artifact of a prior conversion between character sets. In this case, there was probably some kind of high-ASCII or Unicode apostrophe (sometimes called a smart quote) in the original tweet, but when the data was converted into a lower-order character set, such as plain ASCII, that particular flavor of apostrophe was simply changed to a ?
.
Depending on what we plan to do with the data, we might not want to leave out the ?
character, for example, if we are performing word counting or text mining, it may be very important that we convert I?ll
to I'll
and University?s
to University's
. If we decide...