Index
A
- abstractive summarization
- accuracy
- adjacency list
- about / Edge lists and adjacency lists
- adjacency list format
- about / Adjacency list format
- adjacency matrix
- about / Adjacency matrix
- Anaconda Python distribution
- download link / How do we set up our data mining work environment?
- annotated corpus
- about / Tagging parts of speech
- anomaly
- about / What are data anomalies?
- antecedent
- about / Association rules
- Apache
- references / Apache Board meeting minutes
- Apache board meeting minutes
- about / Apache Board meeting minutes
- apache_twitter
- about / Locating missing data
- Apriori
- aspects
- about / The structure of an opinion
- association rules
- about / Towards association rules, Association rules
- metrics / Towards association rules
- support / Support
- confidence / Confidence
- data, example / An example with data
- value - fixing flaw, adding / Added value – fixing a flaw in the plan
- frequent itemsets, finding methods / Methods for finding frequent itemsets
- discovering, in software project tags / A project – discovering association rules in software project tags
- atomic
- about / Merging data
- attribute-based similarity matching
- about / Attribute-based similarity matching
- pairwise comparisons / Be careful of pairwise comparisons
- rare values, leveraging / Leverage rare values
- attributes
- about / The structure of an opinion
- attributes matching, methods
- about / Methods for matching attributes
- range-based, from target / Range-based or distance from target
- distance, from target / Range-based or distance from target
- string edit distance / String edit distance
- Hamming distance / Hamming distance
- Levenshtein distance / Levenshtein distance
- Soundex / Soundex
- attributes of edges
- about / What is a network?
- attributes of nodes
- about / What is a network?
- automatic text summarization
- automatic text summarization techniques
- reference / What is automatic text summarization?
B
- bag of words
- about / Important features of opinions
- bag of words (bow)
- about / Gensim for topic modeling
- betweenness centrality
- about / Betweenness centrality
- big data
- about / What is data mining?
- blank data values
- about / Missing data
- blocking methods
- bonus words
- about / Sumy's Edmundson summarizer
- boundary errors
- about / NER and partial matches
- box-and-whisker plot
- boxplot
- Brown Corpus
- about / Tagging parts of speech
- reference link / Tagging parts of speech
C
- CamelCase
- about / Why look for named entities?
- change detection problems
- check constraint
- about / Logic or semantic errors
- classification problems
- closed path
- closeness centrality
- about / Closeness centrality
- clustering-based outlier
- reference link / Detecting outliers with machine learning
- clustering problems
- code, entity matching project
- code and text files, NER project
- reference link / A simple NER tool
- coding
- about / General-purpose data collections
- components
- about / The structure of an opinion
- compound score
- confidence,association rules
- about / Confidence
- consequent
- about / Association rules
- context-based similarity matching
- corpus
- about / Tagging parts of speech
- CREATE and INSERT statements
- CRISP-DM process
- about / The CRISP-DM process
- business understanding / The CRISP-DM process
- data understanding / The CRISP-DM process
- data preparation / The CRISP-DM process
- modeling / The CRISP-DM process
- evaluation / The CRISP-DM process
- deployment / The CRISP-DM process
- cues
- about / Sumy's Edmundson summarizer
D
- data
- merging / Merging data
- sets, merging vertically / Merging datasets vertically
- sets, merging horizontally / Merging datasets horizontally
- exploring / Exploring the data
- data, exploring
- datasources table / Exploring the data
- rf_developer_projects table / Exploring the data
- data, importing into graph structure
- about / Importing data into a graph structure
- adjacency list format / Adjacency list format
- edge list format / Edge list format
- GEXF format / GEXF and GraphML
- GraphML format / GEXF and GraphML
- graph data format (GDF) / GDF
- Graph Data Format (GDF) / GDF
- Python pickle / Python pickle
- JavaScript Serialized Object Notation (JSON) / JSON
- JSON node series / JSON node and link series
- JSON link series / JSON node and link series
- JSON trees / JSON trees
- Pajek format / Pajek format
- data, social network
- simple network metrics, generating / Generating simple network metrics
- network parameters / Playing with the parameters of a network
- subgraphs, analyzing / Analyzing subgraphs
- cliques, analyzing / Analyzing cliques and centrality in the subgraphs
- centrality in subgraphs, analyzing / Analyzing cliques and centrality in the subgraphs
- change over time, finding / Looking for change over time
- data anomalies
- about / What are data anomalies?
- missing data / Missing data
- missing data, fixing / Fixing missing data
- data errors / Data errors
- outliers / Outliers
- data append
- about / Merging datasets vertically
- data errors
- about / Data errors
- truncated fields / Truncated fields
- data type errors / Data type and character set errors
- character set errors / Data type and character set errors
- logic errors / Logic or semantic errors
- semantic errors / Logic or semantic errors
- data file
- datafiles
- reference link / Generating the network files
- data mining
- about / What is data mining?
- machine learning / What is data mining?
- predictive analytics / What is data mining?
- big data / What is data mining?
- data science / What is data mining?
- performing / How do we do data mining?
- Fayyad et al. KDD process / The Fayyad et al. KDD process
- Han et al. KDD process / The Han et al. KDD process
- CRISP-DM process / The CRISP-DM process
- Six Steps process / The Six Steps process
- methodology / Which data mining methodology is the best?
- techniques / What are the techniques used in data mining?, What techniques are we going to use in this book?
- development environment, setting up / How do we set up our data mining work environment?
- data quality
- about / Merging data
- data science
- about / What is data mining?
- dataset, entity matching project
- about / The dataset
- datasources table
- datasource_id / Exploring the data
- date_donated / Exploring the data
- comments / Exploring the data
- data type errors
- example / Data type and character set errors
- degree
- about / Degree of a network
- degree centrality
- about / Degree centrality
- dependency modeling problems
- details
- about / Locating missing data
- developer channel, Ubuntu
- reference for text archive / Data preparation
- deviation detection problems
- diameter
- about / Diameter of a network
- directed network
- about / What is a network?
- direction
- about / What is a network?
- disjoint sets
- leveraging / Leveraging disjoint sets
- about / Leveraging disjoint sets
- distance
- about / Diameter of a network
- Django IRC chat
- about / Django IRC chat
- reference link / Django IRC chat
- doc2bow()
- about / Gensim for topic modeling
- document level
- domain
- about / Frequent itemset mining basics
- domain knowledge
- about / What is entity matching?
- doubletons
- about / Frequent itemset mining basics
E
- edge list
- about / Edge lists and adjacency lists
- edge list format
- about / Edge list format
- edges
- about / What is a network?
- entity
- about / The structure of an opinion
- entity matching
- about / What is entity matching?
- data, merging / Merging data
- techniques / Techniques for matching
- attribute-based similarity matching / Attribute-based similarity matching
- attributes matching, methods / Methods for matching attributes
- disjoint sets, leveraging / Leveraging disjoint sets
- context-based similarity matching / Context-based similarity matching
- machine learning based entity matching / Machine learning-based entity matching
- entity matching project
- about / Entity matching project
- difficulties, with matching software projects / Difficulties with matching software projects
- project names, matching / Matching on project names
- people names, matching / Matching on people names
- URLs, matching / Matching on URLs
- topics and description keywords, matching / Matching on topics and description keywords
- dataset / The dataset
- code / The code
- results / The results
- entity matching techniques
- errors
- about / What are data anomalies?
- explicit
- about / The structure of an opinion
- extractive method
F
- Facebook Research blog
- download link / What is topic modeling?
- Fayyad et al. KDD process
- data selection / The Fayyad et al. KDD process
- data pre-processing / The Fayyad et al. KDD process
- data transformation / The Fayyad et al. KDD process
- data mining / The Fayyad et al. KDD process
- data interpretation / The Fayyad et al. KDD process
- data evaluation / The Fayyad et al. KDD process
- feature engineering
- about / Sentiment analysis algorithms
- flaccid designator
- fliers
- FLOSSmole
- URL / A project – discovering association rules in software project tags
- reference link / GnuIRC summaries
- FLOSSmole.org
- references / Exploring the data
- FLOSSmole data
- about / The dataset
- database tables / The dataset
- FLOSSmole project
- URL / The dataset
- frequent itemsets
- about / What are frequent itemsets?
- diapers and beer urban legend example / The diapers and beer urban legend
- mining basics / Frequent itemset mining basics
G
- gazetteer
- about / Why look for named entities?
- GDF format
- reference link / GDF
- general-purpose data collections
- Hu and Liu's sentiment analysis lexicon / Hu and Liu's sentiment analysis lexicon
- SentiWordNet / SentiWordNet
- Vader sentiment / Vader sentiment
- generalizable
- general user channel, Ubuntu
- reference for text archive / Data preparation
- Gensim
- about / How do we set up our data mining work environment?
- used, for text summarization / Text summarization using Gensim
- used, for topic modeling / Gensim for topic modeling
- Gensim approach
- reference / Text summarization using Gensim
- Gensim changelog
- reference / Text summarization using Gensim
- Gensim documentation
- reference link / Serializing a corpus
- Gensim LDA
- download link / Latent Dirichlet Allocation
- larger project / Gensim LDA for a larger project
- Gensim LDA model
- applying, to documents / Applying a Gensim LDA model to new documents
- Gensim LDA objects
- serializing / Serializing Gensim LDA objects
- dictionary, serializing / Serializing a dictionary
- corpus, serializing / Serializing a corpus
- model, serializing / Serializing a model
- Gensim LDA passes
- about / Understanding Gensim LDA passes
- Gensim LDA topics
- about / Understanding Gensim LDA topics
- example / Understanding Gensim LDA topics
- GEXF format
- about / GEXF and GraphML
- glosses
- about / SentiWordNet
- gnueIRCsummary.txt
- reference link / GnuIRC summaries
- GnuIRC summaries
- about / GnuIRC summaries
- graph
- about / What is a network?
- graph data
- representing / Representing graph data
- graph data, representing
- adjacency matrix / Adjacency matrix
- edge list / Edge lists and adjacency lists
- adjacency list / Edge lists and adjacency lists
- graph data structures, differences / Differences between graph data structures
- data, importing into graph structure / Importing data into a graph structure
- graph data format (GDF)
- about / GDF
- GraphML format
- about / GEXF and GraphML
- graph trail
- graph walk
- Grubbs' test
- gzipped
- download link / The dataset
H
- Hamming distance
- about / Hamming distance
- Han et al. KDD process
- data cleaning / The Han et al. KDD process
- data integration / The Han et al. KDD process
- data selection / The Han et al. KDD process
- data transformation / The Han et al. KDD process
- data mining / The Han et al. KDD process
- pattern evaluation / The Han et al. KDD process
- knowledge representation / The Han et al. KDD process
- hapax
- about / Important features of opinions
- horizontal merge
- example / Merging datasets horizontally
- hot deck imputation
- about / Use a similar value
I
- 2-itemsets
- about / Frequent itemset mining basics
- 3-itemsets
- about / Frequent itemset mining basics
- implicit
- about / The structure of an opinion
- impute
- about / Use a central measure
- in-degree
- about / Degree of a network
- InterCaps
- about / Why look for named entities?
- interestingness measures for association rules
- isolates
- about / Components of a network
J
- JavaScript Serialized Object Notation (JSON)
- about / JSON
- JSON link series
- about / JSON node and link series
- JSON node series
- about / JSON node and link series
- JSON trees
- about / JSON trees
K
- knowledge discovery in databases (KDD)
- about / What is data mining?
- knowledge discovery process
- about / What is data mining?
L
- Last Observation Carried Forward (LOCF)
- Latent Dirichlet Allocation (LDA)
- about / Latent Dirichlet Allocation
- reference link / Latent Dirichlet Allocation
- download link / Latent Dirichlet Allocation
- Latent Semantic Analysis (LSA)
- reference / Sumy's LSA summarizer
- Levenshtein distance
- about / Levenshtein distance
- lexicon
- link analysis problems
- links
- about / What is a network?
- linusrants
- Linux Kernel Mailing List (LKML)
- about / Data analysis of e-mail messages
- LKML e-mails
- about / LKML e-mails
- lkmlLinusAll.txt
- reference link / Gensim LDA for a larger project
- logic errors
- about / Logic or semantic errors
M
- machine learning
- reference link / What is topic modeling?
- outliers, detecting with / Detecting outliers with machine learning
- machine learning based entity matching
- manually, fixing
- example / Fix the problem manually
- market basket analysis
- about / What are frequent itemsets?
- market / Frequent itemset mining basics
- basket / Frequent itemset mining basics
- items / Frequent itemset mining basics
- Matrix Market (MM) format
- about / Serializing a corpus
- reference link / Serializing a corpus
- maximum normalized residual test
- Message Understanding Conference (MUC)
- about / Handling partial matches
- minimum support threshold
- about / Support
- missing data
- about / Missing data
- locating / Locating missing data
- zero values / Zero values
- missing data, fixing
- about / Fixing missing data
- rows, ignoring / Ignore the problem rows
- manually, fixing / Fix the problem manually
- fabricated value used / Use a fabricated value
- central measure used / Use a central measure
- Last Observation Carried Forward (LOCF) used / Use Last Observation Carried Forward
- similar value used / Use a similar value
- most likely value used / Use the most likely value
- modified z-score
- modified z-scores
- outliers, detecting with / Detecting outliers with modified z-scores
- multi-document
- multiple components
- about / Components of a network
- multivariate data sets
- MySQL
N
- named entity recognition (NER)
- about / Why look for named entities?
- techniques / Techniques for named entity recognition
- part of speech (POS), tagging / Tagging parts of speech
- named entity recognition (NER) project
- about / Named entity recognition project
- NER tool / A simple NER tool
- named entity recognition (NER) systems
- building / Building and evaluating NER systems
- evaluating / Building and evaluating NER systems
- partial matches / NER and partial matches
- partial matches handling / Handling partial matches
- named entity recognition (NER) tool
- about / A simple NER tool
- Apache board meeting minutes / Apache Board meeting minutes
- Django IRC chat / Django IRC chat
- GnuIRC summaries / GnuIRC summaries
- LKML e-mails / LKML e-mails
- natural language processing (NLP)
- about / The basics of sentiment analysis
- Natural Language Toolkit (NLTK)
- negation words
- about / Important features of opinions
- network
- about / What is a network?
- measuring / Measuring a network
- network, measuring
- degree / Degree of a network
- diameter / Diameter of a network
- graph walk / Walks, paths, and trails in a network
- graph trail / Walks, paths, and trails in a network
- path / Walks, paths, and trails in a network
- components / Components of a network
- centrality / Closeness centrality
- degree centrality / Degree centrality
- betweenness centrality / Betweenness centrality
- centrality, measures / Other measures of centrality
- NetworkX
- installing / Understanding our data as a network
- NetworkX file formats
- reference link / Pajek format
- neutral word
- about / SentiWordNet
- NLTK
- used, for naive text summarization / Naive text summarization using NLTK
- NLTK documentation page
- nodes
- about / What is a network?
- novelty
- about / Outliers
- nullable
- about / Locating missing data
- null data values
- about / Missing data
- null words
- about / Sumy's Edmundson summarizer
O
- objectivity score
- about / SentiWordNet
- opinion mining
- about / What is sentiment analysis?
- reference / What is sentiment analysis?
- opinion shifters
- about / Important features of opinions
- opinion words
- about / Important features of opinions
- out-degree
- about / Degree of a network
- outlier
- about / What are data anomalies?, Outliers
- outlier detection
- reference link / Detecting outliers with machine learning
- outliers
- visual mining / Visual mining for outliers
- statistical detection / Statistical detection of outliers
- outliers, statistical detection
- outliers, detecting with modified z-scores / Detecting outliers with modified z-scores
- outliers, detecting by combining statistics / Detecting outliers by combining statistics and visual mining
- outliers, detecting by combining visual mining / Detecting outliers by combining statistics and visual mining
- outliers, detecting with machine learning / Detecting outliers with machine learning
- overfitting
- about / Sentiment analysis algorithms
P
- Pajek format
- about / Pajek format
- partial matches
- about / NER and partial matches
- strict scoring / NER and partial matches
- lenient scoring / NER and partial matches
- partial scoring / NER and partial matches
- part of speech (POS)
- about / Tagging parts of speech
- tagging / Tagging parts of speech
- named entities, classes / Classes of named entities
- part of speech, abbreviations
- reference link / Tagging parts of speech
- parts of speech
- about / Important features of opinions
- path
- pendant nodes
- Penn, noun abbreviations
- example / Tagging parts of speech
- Penn Treebank tagger
- about / Tagging parts of speech
- position of word
- about / Important features of opinions
- POS tagger
- about / Tagging parts of speech
- precision
- profile
- about / Leveraging disjoint sets
- Python pickle
- about / Python pickle
Q
- question answering (QA) systems
- about / Why look for named entities?
R
- real-world project, network
- about / A real project
- data, exploring / Exploring the data
- network files, generating / Generating the network files
- data, social network / Understanding our data as a network
- recall
- regression problems
- relational database management systems (RDBMS)
- about / Locating missing data
- results, entity matching project
- about / The results, How many entity matches did we find?
- entity matches / How many entity matches did we find?
- pairs, identifying / How good are the pairs we found?
- rf_developer_projects table
- datasource_id / Exploring the data
- dev_loginname / Exploring the data
- proj_unixname / Exploring the data
- rigid designator
- Rmagick on RubyForge
- about / Two examples
- references / Two examples
- Rmagick on RubyGems
- about / Two examples
- references / Two examples
- RubyForge
- URL / Matching on URLs
- Ruby on Rails
S
- Scikit-learn tutorial
- semantic errors
- about / Logic or semantic errors
- example / Logic or semantic errors
- sentiment analysis
- about / What is sentiment analysis?
- reference / What is sentiment analysis?
- algorithms / Sentiment analysis algorithms
- general-purpose data collections / General-purpose data collections
- sentiment analysis, basics
- about / The basics of sentiment analysis
- opinion, structure / The structure of an opinion
- document-level analysis / Document-level and sentence-level analysis
- sentence-level analysis / Document-level and sentence-level analysis
- opinions, features / Important features of opinions
- sentiment intensity
- about / Vader sentiment
- sentiment mining application
- about / Sentiment mining application
- project, motivating / Motivating the project
- data preparation / Data preparation
- chat messages, data analysis / Data analysis of chat messages
- e-mail messages, data analysis / Data analysis of e-mail messages
- sentiment score
- sentiment words
- about / Important features of opinions
- SentiWordNet
- URL / SentiWordNet
- sequence analysis problems
- set notation
- about / Frequent itemset mining basics
- significant words
- simpleTextSummaryNLTK.py
- reference / Naive text summarization using NLTK
- single-document
- Six Steps process
- problem statement / The Six Steps process
- data collection / The Six Steps process
- data storage / The Six Steps process
- data cleaning / The Six Steps process
- data mining / The Six Steps process
- representation / The Six Steps process
- visualization / The Six Steps process
- problem resolution / The Six Steps process
- software project tags
- association rules, discovering / A project – discovering association rules in software project tags
- Soundex
- about / Soundex
- source lines of code (SLOC)
- about / Outliers
- specificity
- stigma words
- about / Sumy's Edmundson summarizer
- stopwords
- string edit distance
- about / String edit distance
- subgraphs
- reference link / Analyzing subgraphs
- subjectivity classification
- summarization problems
- SUMMRY
- about / Tools for text summarization
- reference / Tools for text summarization
- Sumy
- used, for text summarization / Text summarization using Sumy
- references / Text summarization using Sumy
- Sumy's Edmundson summarizer
- reference / Sumy's Edmundson summarizer
- about / Sumy's Edmundson summarizer
- Sumy's LSA summarizer
- about / Sumy's LSA summarizer
- Sumy's Luhn summarizer
- about / Sumy's Luhn summarizer
- Sumy's TextRank summarizer
- about / Sumy's TextRank summarizer
- sustainable
T
- target
- about / The structure of an opinion
- target data
- about / The Fayyad et al. KDD process
- terms
- about / Important features of opinions
- text samples
- download link / Gensim for topic modeling
- text summarization
- tools / Tools for text summarization
- naive text summarization, NLTK used / Naive text summarization using NLTK
- using Gensim / Text summarization using Gensim
- Sumy used / Text summarization using Sumy
- text summarization, methods
- Sumy's Luhn summarizer / Sumy's Luhn summarizer
- Sumy's TextRank summarizer / Sumy's TextRank summarizer
- Sumy's LSA summarizer / Sumy's LSA summarizer
- Sumy's Edmundson summarizer / Sumy's Edmundson summarizer
- topic modeling
- about / What is topic modeling?
- Gensim used / Gensim for topic modeling
- Gensim LDA topics / Understanding Gensim LDA topics
- Gensim LDA passes / Understanding Gensim LDA passes
- Gensim LDA model, applying to documents / Applying a Gensim LDA model to new documents
- Gensim LDA objects, serializing / Serializing Gensim LDA objects
- training examples
- about / Sentiment analysis algorithms
- tree structure
- about / JSON trees
- tripletons
- about / Frequent itemset mining basics
- true positives (TP)
- about / How good are the pairs we found?
- type errors
U
- Ubuntu
- URL / Data preparation
- undirected network
- about / What is a network?
- univariate data sets
- unsupervised
- about / What is topic modeling?
- upward closure property
V
- Vader sentiment
- URL / Vader sentiment
- URL, for specific lexicon / Vader sentiment
- Vapor on RubyForge
- about / Two examples
- references / Two examples
- Vapor on RubyGems
- about / Two examples
- references / Two examples
- vertical merge
- example / Merging datasets vertically
- vertices
- about / What is a network?
- visual mining
- about / Visual mining for outliers
W
- weighted network
- about / What is a network?
Z
- z-score
- about / Detecting outliers with modified z-scores
- reference link / Detecting outliers with modified z-scores