Book Image

Clean Data

By : Megan Squire

Book Image

Clean Data

By: Megan Squire

Overview of this book

<p>Is much of your time spent doing tedious tasks such as cleaning dirty data, accounting for lost data, and preparing data to be used by others? If so, then having the right tools makes a critical difference, and will be a great investment as you grow your data science expertise.</p> <p>The book starts by highlighting the importance of data cleaning in data science, and will show you how to reap rewards from reforming your cleaning process. Next, you will cement your knowledge of the basic concepts that the rest of the book relies on: file formats, data types, and character encodings. You will also learn how to extract and clean data stored in RDBMS, web files, and PDF documents, through practical examples.</p> <p>At the end of the book, you will be given a chance to tackle a couple of real-world projects.</p>

Clean Data

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Why Do You Need Clean Data?

Why Do You Need Clean Data?

A fresh perspective

The data science process

Communicating about data cleaning

Our data cleaning environment

An introductory example

Fundamentals – Formats, Types, and Encodings

Fundamentals – Formats, Types, and Encodings

Archiving and compression

Data types, nulls, and encodings

Workhorses of Clean Data – Spreadsheets and Text Editors

Workhorses of Clean Data – Spreadsheets and Text Editors

Spreadsheet data cleaning

Text editor data cleaning

An example project

Speaking the Lingua Franca – Data Conversions

Speaking the Lingua Franca – Data Conversions

Quick tool-based conversions

Converting with PHP

Converting with Python

The example project

Collecting and Cleaning Data from the Web

Collecting and Cleaning Data from the Web

Understanding the HTML page structure

Method one – Python and regular expressions

Method two – Python and BeautifulSoup

Method three – Chrome Scraper

Example project – Extracting data from e-mail and web forums

Cleaning Data in PDF Files

Cleaning Data in PDF Files

Why is cleaning PDF files difficult?

Try simple solutions first – copying

Another technique to try – pdfMiner

Third choice – Tabula

When all else fails – the fourth technique

RDBMS Cleaning Techniques

RDBMS Cleaning Techniques

Step one – download and examine Sentiment140

Step two – clean for database import

Step three – import the data into MySQL in a single table

Step four – clean the & character

Step five – clean other mystery characters

Step seven – separate user mentions, hashtags, and URLs

Step eight – cleaning for lookup tables

Best Practices for Sharing Your Clean Data

Best Practices for Sharing Your Clean Data

Preparing a clean data package

Documenting your data

Setting terms and licenses for your data

Publicizing your data

Stack Overflow Project

Stack Overflow Project

Step one – posing a question about Stack Overflow

Step two – collecting and storing the Stack Overflow data

Step three – cleaning the data

Step four – analyzing the data

Step five – visualizing the data

Step six – problem resolution

Moving from test tables to full tables

Twitter Project

Twitter Project

Step one – posing a question about an archive of tweets

Step two – collecting the data

Step three – data cleaning

Step four – simple data analysis

Step five – visualizing the data

Step six – problem resolution

Moving this process into full (non-test) tables

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

& character
- cleaning / Step four – clean the & character
abnormalities
- about / Detecting and cleaning abnormalities
American Standard Code for Information Exchange (ASCII)
- about / Character encodings
Application Programming Interface (API)
- about / Text files versus binary files, Preparing a clean data package
archive file
- about / Archive files
- tar / tar
archiving
- about / Archiving and compression
Association of Internet Researchers (AOIR)
- about / Common terms of use

B

BeautifulSoup
- about / Method two – Python and BeautifulSoup
- file, finding for experimenting / Step one – find and save a file for experimenting
- file, saving for experimenting / Step one – find and save a file for experimenting
- installing / Step two – install BeautifulSoup
- Python program used, for extracting data / Step three – write a Python program to extract the data
- file, viewing / Step four – view the file and make sure it is clean
BigQuery Google Group
- URL / Step one – collect the Google Groups messages
binary files
- versus text files / Text files versus binary files
- about / Text files versus binary files
Bzip2
- about / Rules of thumb

C

character encodings
- about / Character encodings
- example / Example one – finding multibyte characters in MySQL data, Example two – finding the UTF-8 and Latin-1 equivalents of Unicode characters stored in MySQL, Example three – handling UTF-8 encoding at the file level, Option two – write the file in a way that can handle UTF-8 characters
child
- about / The tree structure model
child items
- about / Step three – write a Python program to extract the data
Chrome Scraper
- about / Method three – Chrome Scraper
- Scraper Chrome extension, installing / Step one – install the Scraper Chrome extension
- data, collecting from website / Step two – collect data from the website
- final cleaning, on data columns / Step three – final cleaning on the data columns
clean data package
- preparing / Preparing a clean data package
cleaning solution, Adobe Systems
- about / When all else fails – the fourth technique
cleaning solution, copying
- about / Try simple solutions first – copying
- PDF file, using / Our experimental file
- data, copying out / Step one – try copying out the data we want
- copied data, pasting into text editor / Step two – try pasting the copied data into a text editor
- smaller version, creating of file / Step three – make a smaller version of the file
cleaning solution, pdfMiner
- about / Another technique to try – pdfMiner
- installing / Step one – install pdfMiner
- text, pulling from PDF file / Step two – pull text from the PDF file
cleaning solution, Tabula
- about / Third choice – Tabula
- downloading / Step one – download Tabula
- running / Step two – run Tabula
- used, for extracting data / Step three – direct Tabula to extract the data
- data, copying out / Step four – copy the data out
- data cleaning / Step five – more cleaning
CMS
- about / Documentation wiki or CMS
- documentation / Documentation wiki or CMS
- project / Documentation wiki or CMS
- data, obtaining / Documentation wiki or CMS
- data, using / Documentation wiki or CMS
column mode / The column mode
- about / The column mode
- limitations / The column mode
Comma Separated Values (CSV)
- about / The delimited format
compressed files
- about / Compressed files
- creating / How to compress files
compression
- about / Archiving and compression
- with zip / Compression with zip, gzip, and bzip2
- with gzip / Compression with zip, gzip, and bzip2
- with bzip2 / Compression with zip, gzip, and bzip2
- options / Compression options
- using / Which compression program should I use?
- factors / Which compression program should I use?
- thumb rules / Rules of thumb
concatenate() function
- about / Concatenating strings
contents
- about / Step three – write a Python program to extract the data
conversions, spreadsheet to JSON
- about / Spreadsheet to JSON
- Google spreadsheet, publishing to Web / Step one – publish Google spreadsheet to the Web
- correct URL, creating / Step two – create the correct URL
Creative Commons (CC) licenses
- about / Creative Commons
- URL / Creative Commons
csvkit
- used, for converting CSV to JSON / CSV to JSON using csvkit
- about / CSV to JSON using csvkit

D

D3 galleries
- URL / Step three – convert the GDF file into JSON
data
- importing, into MySQL / Step three – import the data into MySQL in a single table
- abnormalities, detecting / Detecting and cleaning abnormalities
- abnormalities, cleaning / Detecting and cleaning abnormalities
- table, creating / Creating our table
- documenting / Documenting your data
- README files / README files
- file headers / File headers
- models / Data models and diagrams
- diagrams / Data models and diagrams
- CMS / Documentation wiki or CMS
- publicizing / Publicizing your data
- lists, of datasets / Lists of datasets
- Open Data, on Stack Exchange / Open Data on Stack Exchange
- hackathons / Hackathons
- visualizing / Step five – visualizing the data, Step five – visualizing the data
data, cleaning from web forums
- program status / Program status
- program output / Program output
- extraction code / Extraction code
data analysis
- about / Step four – analyzing the data
- popular paste sites, finding / Which paste sites are most popular?
- popular paste sites, in questions / Which paste sites are popular in questions and which are popular in answers?
- popular paste sites, in answers / Which paste sites are popular in questions and which are popular in answers?
- URL-containing posts / Do posts contain both URLs to paste sites and source code?
/ Step four – simple data analysis
database import
- cleaning / Step two – clean for database import
Database Management System (DBMS)
- about / Integers
database tables
- tweet table / Creating database tables
- hashtag table / Creating database tables
- URL table / Creating database tables
- mentions table / Creating database tables
DataCleaning
- about / Example three – handling UTF-8 encoding at the file level
data cleaning
- communicating about / Communicating about data cleaning
- environment / Our data cleaning environment
- tools and technologies / Our data cleaning environment
- introductory example / An introductory example
- about / Step three – cleaning the data, Step three – data cleaning
- new tables, creating / Creating the new tables
- URLs, extracting / Extracting URLs and populating the new tables
- new tables, populating / Extracting URLs and populating the new tables, Extracting code and populating new tables
- code, extracting / Extracting code and populating new tables
- database tables, creating / Creating database tables
- new tables, populating in Python / Populating the new tables in Python
data cleaning tasks
- performing / Step five – convert data to the Pajek file format
data collection
- defining / Step two – collecting the data
- Ferguson file, downloading / Download and extract the Ferguson file
- Ferguson file, extracting / Download and extract the Ferguson file
- test version, creating of file / Create a test version of the file
- tweet IDs, hydrating / Hydrate the tweet IDs
data license
- about / Setting terms and licenses for your data
data loss
- risk factors / Data loss
data science
- perspective / A fresh perspective
- process / The data science process
dataset
- about / Experimenting with JSON
data types
- about / Data types, nulls, and encodings, Data types
- numeric data / Numeric data
- dates and times / Dates and time
- strings / Strings
- sets/enums / Other data types
- booleans / Other data types
- blobs / Other data types
- converting between / Converting between data types
- data loss / Data loss
date
- about / Step six – clean the dates
dates
- cleaning / Step six – clean the dates
datetime
- about / Step six – clean the dates
Dave Heaton
- about / Method three – Chrome Scraper
decimal point (scale)
- about / Numbers with decimals
delimited format
- about / The delimited format
- invisible characters, observing / Seeing invisible characters
- values, enclosing to trap errant characters / Enclosing values to trap errant characters
- characters, escaping / Escaping characters
- JSON format / The JSON format
- HTML format / The HTML format
delimiter model / The line-by-line delimiter model
delimiters
- about / The delimited format, The line-by-line delimiter model
Dev-Zone developer
- URL / Step one – collect some RSS that points us to HTML files
Django IRC log
- URL / The line-by-line delimiter model
Django IRC log archive
- URL / Step one – find and save a Web file for experimenting
DocuSign
- about / Part two – cleaning data from web forums

E

elements
- about / Understanding the HTML page structure
empties
- about / Empties
- blanks / Blanks
empty
- about / Blanks
encodings
- about / Data types, nulls, and encodings
Enron
- reference link / An introductory example
- about / An introductory example
enron database
- about / SQL to JSON using PHP
Enron e-mail corpus
- reference link / An introductory example
entity-relationship diagram (ERD)
- about / Data models and diagrams
- creating / Data models and diagrams
example project
- about / An example project
- problem, stating / Step one – state the problem
- data collection / Step two – data collection
- data, downloading / Download the data
- data, defining / Get familiar with the data
- data cleaning / Step three – data cleaning
- relevant lines, extracting / Extracting relevant lines
- spreadsheet, using / Using a spreadsheet
- text editor, using / Using a text editor
- lines, transforming / Transform the lines
- data analysis / Step four – data analysis
- data, extracting from e-mail / Example project – Extracting data from e-mail and web forums
- data, extracting from web forums / Example project – Extracting data from e-mail and web forums
- project background / The background of the project
- data, cleaning from Google Groups e-mail / Part one – cleaning data from Google Groups e-mail
- Google Groups messages, collecting / Step one – collect the Google Groups messages
- data, extracting from Google Groups messages / Step two – extract data from the Google Groups messages
- data, cleaning from web forums / Part two – cleaning data from web forums
- RSS, collecting / Step one – collect some RSS that points us to HTML files
- URLs, extracting from RSS / Step two – Extract URLs from RSS; collect and parse HTML
- HTML, collecting / Step two – Extract URLs from RSS; collect and parse HTML
- HTML, parsing / Step two – Extract URLs from RSS; collect and parse HTML
example project, data conversions
- about / The example project
- Facebook social network, investigating / The example project
- Facebook data, downloading as GDF / Step one – download Facebook data as GDF
- GDF file format, in text editor / Step two – look at the GDF file format in a text editor
- GDF, converting into JSON / Step three – convert the GDF file into JSON
- D3 diagram, building / Step four – build a D3 diagram
- data, converting to Pajek file format / Step five – convert data to the Pajek file format
- simple network metrics, calculating / Step six – calculate simple network metrics

F

false positives
- about / Step five – clean other mystery characters
Ferguson tweets
- URL / Moving this process into full (non-test) tables
file, via JSON
- URL / Step two – create the correct URL
file-based manipulation / Strategies for conversion
file extensions
- URL / Opening and reading files
file formats
- about / File formats
- text files, versus binary files / Text files versus binary files
- used, for text files / Common formats for text files
- delimited format / The delimited format
file headers / File headers
files
- opening / Opening and reading files
- reading / Opening and reading files
- unknown file, opening / Peeking inside files
- on OSX / On OSX or Linux
- on Linux / On OSX or Linux
- on Windows / On Windows
- uncompressing / How to uncompress files
find-replace combinations
- about / Heavy duty find and replace
FLOSSmole
- about / README files
force option
- about / Compression options
full set of tweets
- collecting / Moving this process into full (non-test) tables
full tables
- test tables, moving to / Moving from test tables to full tables

G

GDF file
- converting, into Pajek format / Step five – convert data to the Pajek file format
Git
- about / Communicating about data cleaning
Github
- used, for distributing data / A word of caution – Using GitHub to distribute data
Google Groups messages
- collecting / Step one – collect the Google Groups messages
- extraction code / Extraction code
- program output / Program output
Google spreadsheet
- converting, into JSON representation / Spreadsheet to JSON
- URL / Step two – create the correct URL
- list / Step two – create the correct URL
- key / Step two – create the correct URL
- sheet / Step two – create the correct URL
Graph Description Format (GDF)
- about / The example project
Gzip
- about / Rules of thumb

H

hackathons / Hackathons
hashtags
- about / Step seven – separate user mentions, hashtags, and URLs
- extracting / Extract hashtags
header row
- about / The delimited format
headers
- about / Step two – extract data from the Google Groups messages
Help guide, Github
- about / A word of caution – Using GitHub to distribute data
HTML format / The HTML format
HTML page structure
- defining / Understanding the HTML page structure
- delimiter model / The line-by-line delimiter model
- tree structure model / The tree structure model
hydrating
- about / Hydrate the tweet IDs

I

Institutional Research Boards (IRB)
- about / Common terms of use
interactive paste
- about / Step one – posing a question about Stack Overflow
invalid user mentions
- about / Extract user mentions
IRC chat
- URL / Text to columns in Excel
iTunes API
- URL / Experimenting with JSON

J

janitor work
- about / A fresh perspective
JSON
- experimenting with / Experimenting with JSON
JSON format
- about / The JSON format

L

link rot
- about / Step one – posing a question about Stack Overflow
LOAD XML syntax
- URL / Creating MySQL tables and loading data
log
- maintaining / Step nine – document what you did
log, for data cleaning
- example / Communicating about data cleaning
- about / Communicating about data cleaning
lookup tables
- cleaning / Step eight – cleaning for lookup tables
- creating / Step eight – cleaning for lookup tables

M

meta-collections
- about / Lists of datasets
metadata
- about / Step two – extract data from the Google Groups messages
microblogging platform
- about / Step one – posing a question about an archive of tweets
Mike Bostock
- URL / Step five – visualizing the data
mnmldave
- about / Method three – Chrome Scraper
movielens dataset
- URL / Example two – finding the UTF-8 and Latin-1 equivalents of Unicode characters stored in MySQL
MySQL documentation
- URL / Example one – finding multibyte characters in MySQL data
mystery characters
- cleaning / Step five – clean other mystery characters

N

name-value pairs
- about / Experimenting with JSON
netvizz app
- URL / Step one – download Facebook data as GDF
networkx
- about / Step five – convert data to the Pajek file format
new tables
- creating / Create some new tables
node
- about / The tree structure model
null
- defining / If a null falls in a forest…
- zero / Zero
- empties / Empties
- about / Null
- using / Why is the middle name example "empty" and not NULL?, Is it ever useful to clean data using a zero instead of an empty or null?
nulls
- about / Data types, nulls, and encodings
number (precision)
- about / Numbers with decimals
numeric data
- about / Numeric data
- integers / Integers
- numbers with decimals / Numbers with decimals
- non-numeric data, defining / When numbers are not numeric

O

ODbL
- about / ODbL and Open Data Commons
online regex testers
- using / A word of caution
Open Data, on Stack Exchange
- about / Open Data on Stack Exchange
- URL / Open Data on Stack Exchange
Open Data Commons
- about / ODbL and Open Data Commons
- URL / ODbL and Open Data Commons
Open Data Handbook
- URL / ODbL and Open Data Commons
Open Knowledge Foundation (OKF)
- about / ODbL and Open Data Commons
options, distribute data
- about / Preparing a clean data package
- compressed plain text / Preparing a clean data package
- compressed SQL files / Preparing a clean data package
- live database access / Preparing a clean data package
- API / Preparing a clean data package
options, for errant characters
- about / Enclosing values to trap errant characters
originals[]
- about / Step two – extract data from the Google Groups messages

P

Pajek
- about / The example project
Pajek file format
- features / Step five – convert data to the Pajek file format
parent
- about / The tree structure model
Pastebin
- URL / Step one – posing a question about Stack Overflow
paste site
- about / Step one – posing a question about Stack Overflow
PDF files
- cleaning / Why is cleaning PDF files difficult?
pdfMiner
- URL / Step one – install pdfMiner
peer
- about / Step one – state the problem
PewResearch website
- URL / Our experimental file
PHP
- used, for conversion / Converting with PHP
- used, for converting SQL to JSON / SQL to JSON using PHP
- used, for converting SQL to CSV / SQL to CSV using PHP
- used, for converting JSON to CSV / JSON to CSV using PHP
- used, for converting CSV to JSON / CSV to JSON using PHP
phpMyAdmin tool
- about / SQL to CSV or JSON using phpMyAdmin
Portable Document Format (PDF)
- about / Why is cleaning PDF files difficult?
problem resolution
- about / Step six – problem resolution, Step six – problem resolution
Process Lines Containing / Process Lines Containing
Python
- used, for conversion / Converting with Python
- used, for converting CSV to JSON / CSV to JSON using Python
- used, for converting JSON to CSV / Python JSON to CSV
- about / Method two – Python and BeautifulSoup
Python and regular expressions
- about / Method one – Python and regular expressions
- Web file, finding for experimenting / Step one – find and save a Web file for experimenting
- Web file, saving for experimenting / Step one – find and save a Web file for experimenting
- several things, extracting / Step two – look into the file and decide what is worth extracting
- Python program, writing / Step three – write a Python program to pull out the interesting pieces and save them to a CSV file
- file, viewing / Step four – view the file and make sure it is clean
- HTML parsing, limitations / The limitations of parsing HTML using regular expressions
Python regex tester
- URL / The limitations of parsing HTML using regular expressions

R

RAR
- about / Rules of thumb
rate limits, Twitter
- URL / Setting up a Twitter developer account
README files / README files
Really Simple Syndication (RSS)
- about / Step one – collect some RSS that points us to HTML files
- URL / Step one – collect some RSS that points us to HTML files
record-oriented files
- about / The delimited format
regex
- using / Extraction code
regular expression (regex) / Heavy duty find and replace
- URL / Heavy duty find and replace
regular expression symbols
- $ / Heavy duty find and replace
- ^ / Heavy duty find and replace
- + / Heavy duty find and replace
- * / Heavy duty find and replace
- \w / Heavy duty find and replace
- \s / Heavy duty find and replace
- \r / Heavy duty find and replace
- \ / Heavy duty find and replace
replacing
- about / Compression options
replies[]
- about / Step two – extract data from the Google Groups messages
rewriting process
- URL / Option two – write the file in a way that can handle UTF-8 characters

S

Scraper
- about / Method three – Chrome Scraper
- URL / Step one – install the Scraper Chrome extension
semi-structured data
- about / The JSON format
Sentiment140 dataset
- about / Getting ready
- downloading / Step one – download and examine Sentiment140
- examining / Step one – download and examine Sentiment140
Sentiment140 project
- URL / Step one – download and examine Sentiment140
smart quote
- about / Step five – clean other mystery characters
Social Network Analysis (SNA)
- about / Step six – calculate simple network metrics
space
- about / Blanks
spreadsheet
- CSV, creating from / Creating CSV from a spreadsheet
- used, for generating SQL / Generating SQL using a spreadsheet
spreadsheet, for data cleaning
- about / Spreadsheet data cleaning
- text to columns, in Excel / Text to columns in Excel
- strings, splitting / Splitting strings
- strings, concatenating / Concatenating strings
SQL-based manipulation / Strategies for conversion
Stack Exchange
- URL / Downloading the Stack Overflow data dump
Stack Overflow
- URL / Heavy duty find and replace, Step two – create the correct URL, The limitations of parsing HTML using regular expressions, Step three – cleaning the data
- about / Step one – posing a question about Stack Overflow
Stack Overflow data
- collecting / Step two – collecting and storing the Stack Overflow data
- storing / Step two – collecting and storing the Stack Overflow data
- data dump, downloading / Downloading the Stack Overflow data dump
- files, unarchiving / Unarchiving the files
- MySQL tables, creating / Creating MySQL tables and loading data
- data, loading / Creating MySQL tables and loading data
- test tables, building / Building test tables
storage engine
- URL / Creating our table
strategies, for conversion
- about / Strategies for conversion
- SQL-based manipulation / Strategies for conversion
- file-based manipulation / Strategies for conversion
- type conversion, at SQL level / Type conversion at the SQL level
- type conversion, at file level / Type conversion at the file level
string functions
- URL / Example one – parsing MySQL date into a formatted string
strings / Strings
strings, concatenating
- conditional formatting, to find unusual values / Conditional formatting to find unusual values
- unusual values, finding / Sorting to find unusual values
- spreadsheet data, importing into MySQL / Importing spreadsheet data into MySQL

T

Tab Separated Values (TSV)
- about / The delimited format, Step one – download Facebook data as GDF
Tabula
- about / Third choice – Tabula
tags
- about / Understanding the HTML page structure
Tape ARchive (TAR) files
- about / tar
tar / tar
terms, data
- using / Common terms of use
- citations / Common terms of use
- privacy / Common terms of use
- appropriate uses for data / Common terms of use
- contact / Common terms of use
- Creative Commons (CC) licenses / Creative Commons
- ODbL / ODbL and Open Data Commons
terms and licenses, data
- setting / Setting terms and licenses for your data
Terms of Service (ToS)
- about / Step two – collecting the data
Terms of Use (ToU)
- about / Setting terms and licenses for your data
test tables
- moving, to full tables / Moving from test tables to full tables
- creating / Moving this process into full (non-test) tables
text editor
- used, for data cleaning / Text editor data cleaning
- text tweaking / Text tweaking
- changing case / Text tweaking
- zapping gremlins / Text tweaking
- column mode / The column mode
- find and replace / Heavy duty find and replace
- text sorting / Text sorting and processing duplicates
- duplicates, processing / Text sorting and processing duplicates
- Process Lines Containing / Process Lines Containing
text files
- versus binary files / Text files versus binary files
- about / Text files versus binary files
- formats / Common formats for text files
- types / Common formats for text files
text sorting / Text sorting and processing duplicates
text tweaking / Text tweaking
Text Wrangler
- URL / Text editor data cleaning
thumb rules / Rules of thumb
time
- about / Step six – clean the dates
tool-based conversions
- about / Quick tool-based conversions
- spreadsheet to CSV / Spreadsheet to CSV
- spreadsheet to JSON / Spreadsheet to JSON
- SQL to CSV, phpMyAdmin used / SQL to CSV or JSON using phpMyAdmin
- SQL to JSON, phpMyAdmin used / SQL to CSV or JSON using phpMyAdmin
tree structure model / The tree structure model
trim() function / Text to columns in Excel
true positives
- about / Step five – clean other mystery characters
twarc
- about / Hydrate the tweet IDs
- installing / Installing twarc
- URL / Installing twarc
- running / Running twarc
tweet archives
- question, posing for / Step one – posing a question about an archive of tweets
- about / Step one – posing a question about an archive of tweets
tweet IDs
- hydrating / Hydrate the tweet IDs
- Twitter developer account, setting up / Setting up a Twitter developer account
- twarc, installing / Installing twarc
- twarc, running / Running twarc
tweets
- URL / Step one – posing a question about an archive of tweets
Twitter account
- URL / Setting up a Twitter developer account
Twitter authentication
- URL / Example three – handling UTF-8 encoding at the file level
Twitter developer account
- setting up / Setting up a Twitter developer account
- URL / Setting up a Twitter developer account
type conversion, at file level
- about / Type conversion at the file level
- example / Example one – type detection and converting in Excel, Example two – type converting in JSON
type conversion, at SQL level
- about / Type conversion at the SQL level
- example / Example one – parsing MySQL date into a formatted string, Example two – converting a string into MySQL's date type, Example three – casting or converting MySQL string data to a decimal number

U

Unarchiver
- URL / Unarchiving the files
Unicode
- about / Character encodings
updating
- about / Compression options
URL pattern matching routine
- URL / Extract URLs
URLs
- about / Step seven – separate user mentions, hashtags, and URLs
- extracting / Extract URLs
user mentions
- about / Step seven – separate user mentions, hashtags, and URLs
- extracting / Extract user mentions
UTF-8
- about / Character encodings
utf8mb4 collation
- about / Creating database tables

V

valid user mentions
- about / Extract user mentions
version control systems / Communicating about data cleaning
Vi
- using / Seeing invisible characters

W

weather data
- URL / The HTML format

Z

7-Zip software
- URL / Unarchiving the files
Zip
- about / Rules of thumb