Index
A
- & character
- cleaning / Step four – clean the & character
- abnormalities
- American Standard Code for Information Exchange (ASCII)
- about / Character encodings
- Application Programming Interface (API)
- archive file
- about / Archive files
- tar / tar
- archiving
- about / Archiving and compression
- Association of Internet Researchers (AOIR)
- about / Common terms of use
B
- BeautifulSoup
- about / Method two – Python and BeautifulSoup
- file, finding for experimenting / Step one – find and save a file for experimenting
- file, saving for experimenting / Step one – find and save a file for experimenting
- installing / Step two – install BeautifulSoup
- Python program used, for extracting data / Step three – write a Python program to extract the data
- file, viewing / Step four – view the file and make sure it is clean
- BigQuery Google Group
- binary files
- versus text files / Text files versus binary files
- about / Text files versus binary files
- Bzip2
- about / Rules of thumb
C
- character encodings
- about / Character encodings
- example / Example one – finding multibyte characters in MySQL data, Example two – finding the UTF-8 and Latin-1 equivalents of Unicode characters stored in MySQL, Example three – handling UTF-8 encoding at the file level, Option two – write the file in a way that can handle UTF-8 characters
- child
- about / The tree structure model
- child items
- Chrome Scraper
- about / Method three – Chrome Scraper
- Scraper Chrome extension, installing / Step one – install the Scraper Chrome extension
- data, collecting from website / Step two – collect data from the website
- final cleaning, on data columns / Step three – final cleaning on the data columns
- clean data package
- preparing / Preparing a clean data package
- cleaning solution, Adobe Systems
- cleaning solution, copying
- about / Try simple solutions first – copying
- PDF file, using / Our experimental file
- data, copying out / Step one – try copying out the data we want
- copied data, pasting into text editor / Step two – try pasting the copied data into a text editor
- smaller version, creating of file / Step three – make a smaller version of the file
- cleaning solution, pdfMiner
- about / Another technique to try – pdfMiner
- installing / Step one – install pdfMiner
- text, pulling from PDF file / Step two – pull text from the PDF file
- cleaning solution, Tabula
- about / Third choice – Tabula
- downloading / Step one – download Tabula
- running / Step two – run Tabula
- used, for extracting data / Step three – direct Tabula to extract the data
- data, copying out / Step four – copy the data out
- data cleaning / Step five – more cleaning
- CMS
- about / Documentation wiki or CMS
- documentation / Documentation wiki or CMS
- project / Documentation wiki or CMS
- data, obtaining / Documentation wiki or CMS
- data, using / Documentation wiki or CMS
- column mode / The column mode
- about / The column mode
- limitations / The column mode
- Comma Separated Values (CSV)
- about / The delimited format
- compressed files
- about / Compressed files
- creating / How to compress files
- compression
- about / Archiving and compression
- with zip / Compression with zip, gzip, and bzip2
- with gzip / Compression with zip, gzip, and bzip2
- with bzip2 / Compression with zip, gzip, and bzip2
- options / Compression options
- using / Which compression program should I use?
- factors / Which compression program should I use?
- thumb rules / Rules of thumb
- concatenate() function
- about / Concatenating strings
- contents
- conversions, spreadsheet to JSON
- about / Spreadsheet to JSON
- Google spreadsheet, publishing to Web / Step one – publish Google spreadsheet to the Web
- correct URL, creating / Step two – create the correct URL
- Creative Commons (CC) licenses
- about / Creative Commons
- URL / Creative Commons
- csvkit
- used, for converting CSV to JSON / CSV to JSON using csvkit
- about / CSV to JSON using csvkit
D
- D3 galleries
- data
- importing, into MySQL / Step three – import the data into MySQL in a single table
- abnormalities, detecting / Detecting and cleaning abnormalities
- abnormalities, cleaning / Detecting and cleaning abnormalities
- table, creating / Creating our table
- documenting / Documenting your data
- README files / README files
- file headers / File headers
- models / Data models and diagrams
- diagrams / Data models and diagrams
- CMS / Documentation wiki or CMS
- publicizing / Publicizing your data
- lists, of datasets / Lists of datasets
- Open Data, on Stack Exchange / Open Data on Stack Exchange
- hackathons / Hackathons
- visualizing / Step five – visualizing the data, Step five – visualizing the data
- data, cleaning from web forums
- program status / Program status
- program output / Program output
- extraction code / Extraction code
- data analysis
- about / Step four – analyzing the data
- popular paste sites, finding / Which paste sites are most popular?
- popular paste sites, in questions / Which paste sites are popular in questions and which are popular in answers?
- popular paste sites, in answers / Which paste sites are popular in questions and which are popular in answers?
- URL-containing posts / Do posts contain both URLs to paste sites and source code?
- database import
- cleaning / Step two – clean for database import
- Database Management System (DBMS)
- about / Integers
- database tables
- tweet table / Creating database tables
- hashtag table / Creating database tables
- URL table / Creating database tables
- mentions table / Creating database tables
- DataCleaning
- data cleaning
- communicating about / Communicating about data cleaning
- environment / Our data cleaning environment
- tools and technologies / Our data cleaning environment
- introductory example / An introductory example
- about / Step three – cleaning the data, Step three – data cleaning
- new tables, creating / Creating the new tables
- URLs, extracting / Extracting URLs and populating the new tables
- new tables, populating / Extracting URLs and populating the new tables, Extracting code and populating new tables
- code, extracting / Extracting code and populating new tables
- database tables, creating / Creating database tables
- new tables, populating in Python / Populating the new tables in Python
- data cleaning tasks
- performing / Step five – convert data to the Pajek file format
- data collection
- defining / Step two – collecting the data
- Ferguson file, downloading / Download and extract the Ferguson file
- Ferguson file, extracting / Download and extract the Ferguson file
- test version, creating of file / Create a test version of the file
- tweet IDs, hydrating / Hydrate the tweet IDs
- data license
- data loss
- risk factors / Data loss
- data science
- perspective / A fresh perspective
- process / The data science process
- dataset
- about / Experimenting with JSON
- data types
- about / Data types, nulls, and encodings, Data types
- numeric data / Numeric data
- dates and times / Dates and time
- strings / Strings
- sets/enums / Other data types
- booleans / Other data types
- blobs / Other data types
- converting between / Converting between data types
- data loss / Data loss
- date
- about / Step six – clean the dates
- dates
- cleaning / Step six – clean the dates
- datetime
- about / Step six – clean the dates
- Dave Heaton
- about / Method three – Chrome Scraper
- decimal point (scale)
- about / Numbers with decimals
- delimited format
- about / The delimited format
- invisible characters, observing / Seeing invisible characters
- values, enclosing to trap errant characters / Enclosing values to trap errant characters
- characters, escaping / Escaping characters
- JSON format / The JSON format
- HTML format / The HTML format
- delimiter model / The line-by-line delimiter model
- delimiters
- Dev-Zone developer
- Django IRC log
- Django IRC log archive
- DocuSign
E
- elements
- empties
- empty
- about / Blanks
- encodings
- about / Data types, nulls, and encodings
- Enron
- reference link / An introductory example
- about / An introductory example
- enron database
- about / SQL to JSON using PHP
- Enron e-mail corpus
- reference link / An introductory example
- entity-relationship diagram (ERD)
- about / Data models and diagrams
- creating / Data models and diagrams
- example project
- about / An example project
- problem, stating / Step one – state the problem
- data collection / Step two – data collection
- data, downloading / Download the data
- data, defining / Get familiar with the data
- data cleaning / Step three – data cleaning
- relevant lines, extracting / Extracting relevant lines
- spreadsheet, using / Using a spreadsheet
- text editor, using / Using a text editor
- lines, transforming / Transform the lines
- data analysis / Step four – data analysis
- data, extracting from e-mail / Example project – Extracting data from e-mail and web forums
- data, extracting from web forums / Example project – Extracting data from e-mail and web forums
- project background / The background of the project
- data, cleaning from Google Groups e-mail / Part one – cleaning data from Google Groups e-mail
- Google Groups messages, collecting / Step one – collect the Google Groups messages
- data, extracting from Google Groups messages / Step two – extract data from the Google Groups messages
- data, cleaning from web forums / Part two – cleaning data from web forums
- RSS, collecting / Step one – collect some RSS that points us to HTML files
- URLs, extracting from RSS / Step two – Extract URLs from RSS; collect and parse HTML
- HTML, collecting / Step two – Extract URLs from RSS; collect and parse HTML
- HTML, parsing / Step two – Extract URLs from RSS; collect and parse HTML
- example project, data conversions
- about / The example project
- Facebook social network, investigating / The example project
- Facebook data, downloading as GDF / Step one – download Facebook data as GDF
- GDF file format, in text editor / Step two – look at the GDF file format in a text editor
- GDF, converting into JSON / Step three – convert the GDF file into JSON
- D3 diagram, building / Step four – build a D3 diagram
- data, converting to Pajek file format / Step five – convert data to the Pajek file format
- simple network metrics, calculating / Step six – calculate simple network metrics
F
- false positives
- Ferguson tweets
- file, via JSON
- file-based manipulation / Strategies for conversion
- file extensions
- file formats
- about / File formats
- text files, versus binary files / Text files versus binary files
- used, for text files / Common formats for text files
- delimited format / The delimited format
- file headers / File headers
- files
- opening / Opening and reading files
- reading / Opening and reading files
- unknown file, opening / Peeking inside files
- on OSX / On OSX or Linux
- on Linux / On OSX or Linux
- on Windows / On Windows
- uncompressing / How to uncompress files
- find-replace combinations
- about / Heavy duty find and replace
- FLOSSmole
- about / README files
- force option
- about / Compression options
- full set of tweets
- collecting / Moving this process into full (non-test) tables
- full tables
- test tables, moving to / Moving from test tables to full tables
G
- GDF file
- converting, into Pajek format / Step five – convert data to the Pajek file format
- Git
- Github
- used, for distributing data / A word of caution – Using GitHub to distribute data
- Google Groups messages
- collecting / Step one – collect the Google Groups messages
- extraction code / Extraction code
- program output / Program output
- Google spreadsheet
- converting, into JSON representation / Spreadsheet to JSON
- URL / Step two – create the correct URL
- list / Step two – create the correct URL
- key / Step two – create the correct URL
- sheet / Step two – create the correct URL
- Graph Description Format (GDF)
- about / The example project
- Gzip
- about / Rules of thumb
H
- hackathons / Hackathons
- hashtags
- about / Step seven – separate user mentions, hashtags, and URLs
- extracting / Extract hashtags
- header row
- about / The delimited format
- headers
- Help guide, Github
- HTML format / The HTML format
- HTML page structure
- defining / Understanding the HTML page structure
- delimiter model / The line-by-line delimiter model
- tree structure model / The tree structure model
- hydrating
- about / Hydrate the tweet IDs
I
- Institutional Research Boards (IRB)
- about / Common terms of use
- interactive paste
- invalid user mentions
- about / Extract user mentions
- IRC chat
- URL / Text to columns in Excel
- iTunes API
- URL / Experimenting with JSON
J
- janitor work
- about / A fresh perspective
- JSON
- experimenting with / Experimenting with JSON
- JSON format
- about / The JSON format
L
- link rot
- LOAD XML syntax
- log
- maintaining / Step nine – document what you did
- log, for data cleaning
- example / Communicating about data cleaning
- about / Communicating about data cleaning
- lookup tables
- cleaning / Step eight – cleaning for lookup tables
- creating / Step eight – cleaning for lookup tables
M
- meta-collections
- about / Lists of datasets
- metadata
- microblogging platform
- Mike Bostock
- mnmldave
- about / Method three – Chrome Scraper
- movielens dataset
- MySQL documentation
- mystery characters
- cleaning / Step five – clean other mystery characters
N
- name-value pairs
- about / Experimenting with JSON
- netvizz app
- networkx
- new tables
- creating / Create some new tables
- node
- about / The tree structure model
- null
- defining / If a null falls in a forest…
- zero / Zero
- empties / Empties
- about / Null
- using / Why is the middle name example "empty" and not NULL?, Is it ever useful to clean data using a zero instead of an empty or null?
- nulls
- about / Data types, nulls, and encodings
- number (precision)
- about / Numbers with decimals
- numeric data
- about / Numeric data
- integers / Integers
- numbers with decimals / Numbers with decimals
- non-numeric data, defining / When numbers are not numeric
O
- ODbL
- about / ODbL and Open Data Commons
- online regex testers
- using / A word of caution
- Open Data, on Stack Exchange
- about / Open Data on Stack Exchange
- URL / Open Data on Stack Exchange
- Open Data Commons
- about / ODbL and Open Data Commons
- URL / ODbL and Open Data Commons
- Open Data Handbook
- Open Knowledge Foundation (OKF)
- about / ODbL and Open Data Commons
- options, distribute data
- about / Preparing a clean data package
- compressed plain text / Preparing a clean data package
- compressed SQL files / Preparing a clean data package
- live database access / Preparing a clean data package
- API / Preparing a clean data package
- options, for errant characters
- originals[]
P
- Pajek
- about / The example project
- Pajek file format
- parent
- about / The tree structure model
- Pastebin
- paste site
- PDF files
- cleaning / Why is cleaning PDF files difficult?
- pdfMiner
- peer
- about / Step one – state the problem
- PewResearch website
- URL / Our experimental file
- PHP
- used, for conversion / Converting with PHP
- used, for converting SQL to JSON / SQL to JSON using PHP
- used, for converting SQL to CSV / SQL to CSV using PHP
- used, for converting JSON to CSV / JSON to CSV using PHP
- used, for converting CSV to JSON / CSV to JSON using PHP
- phpMyAdmin tool
- Portable Document Format (PDF)
- problem resolution
- Process Lines Containing / Process Lines Containing
- Python
- used, for conversion / Converting with Python
- used, for converting CSV to JSON / CSV to JSON using Python
- used, for converting JSON to CSV / Python JSON to CSV
- about / Method two – Python and BeautifulSoup
- Python and regular expressions
- about / Method one – Python and regular expressions
- Web file, finding for experimenting / Step one – find and save a Web file for experimenting
- Web file, saving for experimenting / Step one – find and save a Web file for experimenting
- several things, extracting / Step two – look into the file and decide what is worth extracting
- Python program, writing / Step three – write a Python program to pull out the interesting pieces and save them to a CSV file
- file, viewing / Step four – view the file and make sure it is clean
- HTML parsing, limitations / The limitations of parsing HTML using regular expressions
- Python regex tester
R
- RAR
- about / Rules of thumb
- rate limits, Twitter
- README files / README files
- Really Simple Syndication (RSS)
- record-oriented files
- about / The delimited format
- regex
- using / Extraction code
- regular expression (regex) / Heavy duty find and replace
- regular expression symbols
- replacing
- about / Compression options
- replies[]
- rewriting process
S
- Scraper
- semi-structured data
- about / The JSON format
- Sentiment140 dataset
- about / Getting ready
- downloading / Step one – download and examine Sentiment140
- examining / Step one – download and examine Sentiment140
- Sentiment140 project
- smart quote
- Social Network Analysis (SNA)
- space
- about / Blanks
- spreadsheet
- CSV, creating from / Creating CSV from a spreadsheet
- used, for generating SQL / Generating SQL using a spreadsheet
- spreadsheet, for data cleaning
- about / Spreadsheet data cleaning
- text to columns, in Excel / Text to columns in Excel
- strings, splitting / Splitting strings
- strings, concatenating / Concatenating strings
- SQL-based manipulation / Strategies for conversion
- Stack Exchange
- Stack Overflow
- Stack Overflow data
- collecting / Step two – collecting and storing the Stack Overflow data
- storing / Step two – collecting and storing the Stack Overflow data
- data dump, downloading / Downloading the Stack Overflow data dump
- files, unarchiving / Unarchiving the files
- MySQL tables, creating / Creating MySQL tables and loading data
- data, loading / Creating MySQL tables and loading data
- test tables, building / Building test tables
- storage engine
- URL / Creating our table
- strategies, for conversion
- about / Strategies for conversion
- SQL-based manipulation / Strategies for conversion
- file-based manipulation / Strategies for conversion
- type conversion, at SQL level / Type conversion at the SQL level
- type conversion, at file level / Type conversion at the file level
- string functions
- strings / Strings
- strings, concatenating
- conditional formatting, to find unusual values / Conditional formatting to find unusual values
- unusual values, finding / Sorting to find unusual values
- spreadsheet data, importing into MySQL / Importing spreadsheet data into MySQL
T
- Tab Separated Values (TSV)
- Tabula
- about / Third choice – Tabula
- tags
- Tape ARchive (TAR) files
- about / tar
- tar / tar
- terms, data
- using / Common terms of use
- citations / Common terms of use
- privacy / Common terms of use
- appropriate uses for data / Common terms of use
- contact / Common terms of use
- Creative Commons (CC) licenses / Creative Commons
- ODbL / ODbL and Open Data Commons
- terms and licenses, data
- setting / Setting terms and licenses for your data
- Terms of Service (ToS)
- about / Step two – collecting the data
- Terms of Use (ToU)
- test tables
- moving, to full tables / Moving from test tables to full tables
- creating / Moving this process into full (non-test) tables
- text editor
- used, for data cleaning / Text editor data cleaning
- text tweaking / Text tweaking
- changing case / Text tweaking
- zapping gremlins / Text tweaking
- column mode / The column mode
- find and replace / Heavy duty find and replace
- text sorting / Text sorting and processing duplicates
- duplicates, processing / Text sorting and processing duplicates
- Process Lines Containing / Process Lines Containing
- text files
- versus binary files / Text files versus binary files
- about / Text files versus binary files
- formats / Common formats for text files
- types / Common formats for text files
- text sorting / Text sorting and processing duplicates
- text tweaking / Text tweaking
- Text Wrangler
- thumb rules / Rules of thumb
- time
- about / Step six – clean the dates
- tool-based conversions
- about / Quick tool-based conversions
- spreadsheet to CSV / Spreadsheet to CSV
- spreadsheet to JSON / Spreadsheet to JSON
- SQL to CSV, phpMyAdmin used / SQL to CSV or JSON using phpMyAdmin
- SQL to JSON, phpMyAdmin used / SQL to CSV or JSON using phpMyAdmin
- tree structure model / The tree structure model
- trim() function / Text to columns in Excel
- true positives
- twarc
- about / Hydrate the tweet IDs
- installing / Installing twarc
- URL / Installing twarc
- running / Running twarc
- tweet archives
- question, posing for / Step one – posing a question about an archive of tweets
- about / Step one – posing a question about an archive of tweets
- tweet IDs
- hydrating / Hydrate the tweet IDs
- Twitter developer account, setting up / Setting up a Twitter developer account
- twarc, installing / Installing twarc
- twarc, running / Running twarc
- tweets
- Twitter account
- Twitter authentication
- Twitter developer account
- setting up / Setting up a Twitter developer account
- URL / Setting up a Twitter developer account
- type conversion, at file level
- type conversion, at SQL level
U
- Unarchiver
- URL / Unarchiving the files
- Unicode
- about / Character encodings
- updating
- about / Compression options
- URL pattern matching routine
- URL / Extract URLs
- URLs
- about / Step seven – separate user mentions, hashtags, and URLs
- extracting / Extract URLs
- user mentions
- about / Step seven – separate user mentions, hashtags, and URLs
- extracting / Extract user mentions
- UTF-8
- about / Character encodings
- utf8mb4 collation
- about / Creating database tables
V
- valid user mentions
- about / Extract user mentions
- version control systems / Communicating about data cleaning
- Vi
- using / Seeing invisible characters
W
- weather data
- URL / The HTML format
Z
- 7-Zip software
- URL / Unarchiving the files
- Zip
- about / Rules of thumb