Book Image

Mastering Data Mining with Python - Find patterns hidden in your data

By : Megan Squire

Book Image

Mastering Data Mining with Python - Find patterns hidden in your data

By: Megan Squire

Overview of this book

Data mining is an integral part of the data science pipeline. It is the foundation of any successful data-driven strategy – without it, you'll never be able to uncover truly transformative insights. Since data is vital to just about every modern organization, it is worth taking the next step to unlock even greater value and more meaningful understanding. If you already know the fundamentals of data mining with Python, you are now ready to experiment with more interesting, advanced data analytics techniques using Python's easy-to-use interface and extensive range of libraries. In this book, you'll go deeper into many often overlooked areas of data mining, including association rule mining, entity matching, network mining, sentiment analysis, named entity recognition, text summarization, topic modeling, and anomaly detection. For each data mining technique, we'll review the state-of-the-art and current best practices before comparing a wide variety of strategies for solving each problem. We will then implement example solutions using real-world data from the domain of software engineering, and we will spend time learning how to understand and interpret the results we get. By the end of this book, you will have solid experience implementing some of the most interesting and relevant data mining techniques available today, and you will have achieved a greater fluency in the important field of Python data analytics.

Mastering Data Mining with Python – Find patterns hidden in your data

Mastering Data Mining with Python – Find patterns hidden in your data

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Expanding Your Data Mining Toolbox

Expanding Your Data Mining Toolbox

What is data mining?

How do we do data mining?

What are the techniques used in data mining?

How do we set up our data mining work environment?

Association Rule Mining

Association Rule Mining

What are frequent itemsets?

Towards association rules

A project – discovering association rules in software project tags

Entity Matching

Entity Matching

What is entity matching?

Entity matching project

Network Analysis

Network Analysis

What is a network?

Measuring a network

Representing graph data

Sentiment Analysis in Text

Sentiment Analysis in Text

What is sentiment analysis?

The basics of sentiment analysis

Sentiment analysis algorithms

Sentiment mining application

Named Entity Recognition in Text

Named Entity Recognition in Text

Why look for named entities?

Techniques for named entity recognition

Building and evaluating NER systems

Named entity recognition project

Automatic Text Summarization

Automatic Text Summarization

What is automatic text summarization?

Tools for text summarization

Topic Modeling in Text

Topic Modeling in Text

What is topic modeling?

Latent Dirichlet Allocation

Gensim for topic modeling

Gensim LDA for a larger project

Mining for Data Anomalies

Mining for Data Anomalies

What are data anomalies?

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

abstractive summarization
- about / What is automatic text summarization?
accuracy
- about / Effectiveness – how accurate are the matches that we generate?
adjacency list
- about / Edge lists and adjacency lists
adjacency list format
- about / Adjacency list format
adjacency matrix
- about / Adjacency matrix
Anaconda Python distribution
- download link / How do we set up our data mining work environment?
annotated corpus
- about / Tagging parts of speech
anomaly
- about / What are data anomalies?
antecedent
- about / Association rules
Apache
- references / Apache Board meeting minutes
Apache board meeting minutes
- about / Apache Board meeting minutes
apache_twitter
- about / Locating missing data
Apriori
- about / Methods for finding frequent itemsets
aspects
- about / The structure of an opinion
association rules
- about / Towards association rules, Association rules
- metrics / Towards association rules
- support / Support
- confidence / Confidence
- data, example / An example with data
- value - fixing flaw, adding / Added value – fixing a flaw in the plan
- frequent itemsets, finding methods / Methods for finding frequent itemsets
- discovering, in software project tags / A project – discovering association rules in software project tags
atomic
- about / Merging data
attribute-based similarity matching
- about / Attribute-based similarity matching
- pairwise comparisons / Be careful of pairwise comparisons
- rare values, leveraging / Leverage rare values
attributes
- about / The structure of an opinion
attributes matching, methods
- about / Methods for matching attributes
- range-based, from target / Range-based or distance from target
- distance, from target / Range-based or distance from target
- string edit distance / String edit distance
- Hamming distance / Hamming distance
- Levenshtein distance / Levenshtein distance
- Soundex / Soundex
attributes of edges
- about / What is a network?
attributes of nodes
- about / What is a network?
automatic text summarization
- about / What is automatic text summarization?
automatic text summarization techniques
- reference / What is automatic text summarization?

B

bag of words
- about / Important features of opinions
bag of words (bow)
- about / Gensim for topic modeling
betweenness centrality
- about / Betweenness centrality
big data
- about / What is data mining?
blank data values
- about / Missing data
blocking methods
- about / Efficiency – how long does it take to do the matching?
bonus words
- about / Sumy's Edmundson summarizer
boundary errors
- about / NER and partial matches
box-and-whisker plot
- about / Detecting outliers by combining statistics and visual mining
boxplot
- about / Detecting outliers by combining statistics and visual mining
- reference link / Detecting outliers by combining statistics and visual mining
Brown Corpus
- about / Tagging parts of speech
- reference link / Tagging parts of speech

C

CamelCase
- about / Why look for named entities?
change detection problems
- about / What are the techniques used in data mining?
check constraint
- about / Logic or semantic errors
classification problems
- about / What are the techniques used in data mining?
closed path
- about / Walks, paths, and trails in a network
closeness centrality
- about / Closeness centrality
clustering-based outlier
- reference link / Detecting outliers with machine learning
clustering problems
- about / What are the techniques used in data mining?
code, entity matching project
- about / The code
- reference / The code
code and text files, NER project
- reference link / A simple NER tool
coding
- about / General-purpose data collections
components
- about / The structure of an opinion
compound score
- URL / Data analysis of chat messages
confidence,association rules
- about / Confidence
consequent
- about / Association rules
context-based similarity matching
- about / Context-based similarity matching
corpus
- about / Tagging parts of speech
CREATE and INSERT statements
- URL / Data analysis of e-mail messages
CRISP-DM process
- about / The CRISP-DM process
- business understanding / The CRISP-DM process
- data understanding / The CRISP-DM process
- data preparation / The CRISP-DM process
- modeling / The CRISP-DM process
- evaluation / The CRISP-DM process
- deployment / The CRISP-DM process
cues
- about / Sumy's Edmundson summarizer

D

data
- merging / Merging data
- sets, merging vertically / Merging datasets vertically
- sets, merging horizontally / Merging datasets horizontally
- exploring / Exploring the data
data, exploring
- datasources table / Exploring the data
- rf_developer_projects table / Exploring the data
data, importing into graph structure
- about / Importing data into a graph structure
- adjacency list format / Adjacency list format
- edge list format / Edge list format
- GEXF format / GEXF and GraphML
- GraphML format / GEXF and GraphML
- graph data format (GDF) / GDF
- Graph Data Format (GDF) / GDF
- Python pickle / Python pickle
- JavaScript Serialized Object Notation (JSON) / JSON
- JSON node series / JSON node and link series
- JSON link series / JSON node and link series
- JSON trees / JSON trees
- Pajek format / Pajek format
data, social network
- simple network metrics, generating / Generating simple network metrics
- network parameters / Playing with the parameters of a network
- subgraphs, analyzing / Analyzing subgraphs
- cliques, analyzing / Analyzing cliques and centrality in the subgraphs
- centrality in subgraphs, analyzing / Analyzing cliques and centrality in the subgraphs
- change over time, finding / Looking for change over time
data anomalies
- about / What are data anomalies?
- missing data / Missing data
- missing data, fixing / Fixing missing data
- data errors / Data errors
- outliers / Outliers
data append
- about / Merging datasets vertically
data errors
- about / Data errors
- truncated fields / Truncated fields
- data type errors / Data type and character set errors
- character set errors / Data type and character set errors
- logic errors / Logic or semantic errors
- semantic errors / Logic or semantic errors
data file
- URL / A project – discovering association rules in software project tags
datafiles
- reference link / Generating the network files
data mining
- about / What is data mining?
- machine learning / What is data mining?
- predictive analytics / What is data mining?
- big data / What is data mining?
- data science / What is data mining?
- performing / How do we do data mining?
- Fayyad et al. KDD process / The Fayyad et al. KDD process
- Han et al. KDD process / The Han et al. KDD process
- CRISP-DM process / The CRISP-DM process
- Six Steps process / The Six Steps process
- methodology / Which data mining methodology is the best?
- techniques / What are the techniques used in data mining?, What techniques are we going to use in this book?
- development environment, setting up / How do we set up our data mining work environment?
data quality
- about / Merging data
data science
- about / What is data mining?
dataset, entity matching project
- about / The dataset
datasources table
- datasource_id / Exploring the data
- date_donated / Exploring the data
- comments / Exploring the data
data type errors
- example / Data type and character set errors
degree
- about / Degree of a network
degree centrality
- about / Degree centrality
dependency modeling problems
- about / What are the techniques used in data mining?
details
- about / Locating missing data
developer channel, Ubuntu
- reference for text archive / Data preparation
deviation detection problems
- about / What are the techniques used in data mining?
diameter
- about / Diameter of a network
directed network
- about / What is a network?
direction
- about / What is a network?
disjoint sets
- leveraging / Leveraging disjoint sets
- about / Leveraging disjoint sets
distance
- about / Diameter of a network
Django IRC chat
- about / Django IRC chat
- reference link / Django IRC chat
doc2bow()
- about / Gensim for topic modeling
document level
- about / Document-level and sentence-level analysis
domain
- about / Frequent itemset mining basics
domain knowledge
- about / What is entity matching?
doubletons
- about / Frequent itemset mining basics

E

edge list
- about / Edge lists and adjacency lists
edge list format
- about / Edge list format
edges
- about / What is a network?
entity
- about / The structure of an opinion
entity matching
- about / What is entity matching?
- data, merging / Merging data
- techniques / Techniques for matching
- attribute-based similarity matching / Attribute-based similarity matching
- attributes matching, methods / Methods for matching attributes
- disjoint sets, leveraging / Leveraging disjoint sets
- context-based similarity matching / Context-based similarity matching
- machine learning based entity matching / Machine learning-based entity matching
entity matching project
- about / Entity matching project
- difficulties, with matching software projects / Difficulties with matching software projects
- project names, matching / Matching on project names
- people names, matching / Matching on people names
- URLs, matching / Matching on URLs
- topics and description keywords, matching / Matching on topics and description keywords
- dataset / The dataset
- code / The code
- results / The results
entity matching techniques
- efficiency / Efficiency – how long does it take to do the matching?
- effectiveness / Effectiveness – how accurate are the matches that we generate?
- usefulness / Usefulness – how practical is the matching procedure to use?
errors
- about / What are data anomalies?
explicit
- about / The structure of an opinion
extractive method
- about / What is automatic text summarization?

F

Facebook Research blog
- download link / What is topic modeling?
Fayyad et al. KDD process
- data selection / The Fayyad et al. KDD process
- data pre-processing / The Fayyad et al. KDD process
- data transformation / The Fayyad et al. KDD process
- data mining / The Fayyad et al. KDD process
- data interpretation / The Fayyad et al. KDD process
- data evaluation / The Fayyad et al. KDD process
feature engineering
- about / Sentiment analysis algorithms
flaccid designator
- about / Techniques for named entity recognition
fliers
- about / Detecting outliers by combining statistics and visual mining
- reference link / Detecting outliers by combining statistics and visual mining
FLOSSmole
- URL / A project – discovering association rules in software project tags
- reference link / GnuIRC summaries
FLOSSmole.org
- references / Exploring the data
FLOSSmole data
- about / The dataset
- database tables / The dataset
FLOSSmole project
- URL / The dataset
frequent itemsets
- about / What are frequent itemsets?
- diapers and beer urban legend example / The diapers and beer urban legend
- mining basics / Frequent itemset mining basics

G

gazetteer
- about / Why look for named entities?
GDF format
- reference link / GDF
general-purpose data collections
- Hu and Liu's sentiment analysis lexicon / Hu and Liu's sentiment analysis lexicon
- SentiWordNet / SentiWordNet
- Vader sentiment / Vader sentiment
generalizable
- about / Usefulness – how practical is the matching procedure to use?
general user channel, Ubuntu
- reference for text archive / Data preparation
Gensim
- about / How do we set up our data mining work environment?
- used, for text summarization / Text summarization using Gensim
- used, for topic modeling / Gensim for topic modeling
Gensim approach
- reference / Text summarization using Gensim
Gensim changelog
- reference / Text summarization using Gensim
Gensim documentation
- reference link / Serializing a corpus
Gensim LDA
- download link / Latent Dirichlet Allocation
- larger project / Gensim LDA for a larger project
Gensim LDA model
- applying, to documents / Applying a Gensim LDA model to new documents
Gensim LDA objects
- serializing / Serializing Gensim LDA objects
- dictionary, serializing / Serializing a dictionary
- corpus, serializing / Serializing a corpus
- model, serializing / Serializing a model
Gensim LDA passes
- about / Understanding Gensim LDA passes
Gensim LDA topics
- about / Understanding Gensim LDA topics
- example / Understanding Gensim LDA topics
GEXF format
- about / GEXF and GraphML
glosses
- about / SentiWordNet
gnueIRCsummary.txt
- reference link / GnuIRC summaries
GnuIRC summaries
- about / GnuIRC summaries
graph
- about / What is a network?
graph data
- representing / Representing graph data
graph data, representing
- adjacency matrix / Adjacency matrix
- edge list / Edge lists and adjacency lists
- adjacency list / Edge lists and adjacency lists
- graph data structures, differences / Differences between graph data structures
- data, importing into graph structure / Importing data into a graph structure
graph data format (GDF)
- about / GDF
GraphML format
- about / GEXF and GraphML
graph trail
- about / Walks, paths, and trails in a network
graph walk
- about / Walks, paths, and trails in a network
Grubbs' test
- about / Detecting outliers with modified z-scores
gzipped
- download link / The dataset

H

Hamming distance
- about / Hamming distance
Han et al. KDD process
- data cleaning / The Han et al. KDD process
- data integration / The Han et al. KDD process
- data selection / The Han et al. KDD process
- data transformation / The Han et al. KDD process
- data mining / The Han et al. KDD process
- pattern evaluation / The Han et al. KDD process
- knowledge representation / The Han et al. KDD process
hapax
- about / Important features of opinions
horizontal merge
- example / Merging datasets horizontally
hot deck imputation
- about / Use a similar value

I

2-itemsets
- about / Frequent itemset mining basics
3-itemsets
- about / Frequent itemset mining basics
implicit
- about / The structure of an opinion
impute
- about / Use a central measure
in-degree
- about / Degree of a network
InterCaps
- about / Why look for named entities?
interestingness measures for association rules
- about / Added value – fixing a flaw in the plan
isolates
- about / Components of a network

J

JavaScript Serialized Object Notation (JSON)
- about / JSON
JSON link series
- about / JSON node and link series
JSON node series
- about / JSON node and link series
JSON trees
- about / JSON trees

K

knowledge discovery in databases (KDD)
- about / What is data mining?
knowledge discovery process
- about / What is data mining?

L

Last Observation Carried Forward (LOCF)
- about / Use Last Observation Carried Forward
Latent Dirichlet Allocation (LDA)
- about / Latent Dirichlet Allocation
- reference link / Latent Dirichlet Allocation
- download link / Latent Dirichlet Allocation
Latent Semantic Analysis (LSA)
- reference / Sumy's LSA summarizer
Levenshtein distance
- about / Levenshtein distance
lexicon
- URL / Hu and Liu's sentiment analysis lexicon
link analysis problems
- about / What are the techniques used in data mining?
links
- about / What is a network?
linusrants
- about / Data analysis of e-mail messages
- URL / Data analysis of e-mail messages
Linux Kernel Mailing List (LKML)
- about / Data analysis of e-mail messages
LKML e-mails
- about / LKML e-mails
lkmlLinusAll.txt
- reference link / Gensim LDA for a larger project
logic errors
- about / Logic or semantic errors

M

machine learning
- reference link / What is topic modeling?
- outliers, detecting with / Detecting outliers with machine learning
machine learning based entity matching
- about / Machine learning-based entity matching
manually, fixing
- example / Fix the problem manually
market basket analysis
- about / What are frequent itemsets?
- market / Frequent itemset mining basics
- basket / Frequent itemset mining basics
- items / Frequent itemset mining basics
Matrix Market (MM) format
- about / Serializing a corpus
- reference link / Serializing a corpus
maximum normalized residual test
- about / Detecting outliers with modified z-scores
Message Understanding Conference (MUC)
- about / Handling partial matches
minimum support threshold
- about / Support
missing data
- about / Missing data
- locating / Locating missing data
- zero values / Zero values
missing data, fixing
- about / Fixing missing data
- rows, ignoring / Ignore the problem rows
- manually, fixing / Fix the problem manually
- fabricated value used / Use a fabricated value
- central measure used / Use a central measure
- Last Observation Carried Forward (LOCF) used / Use Last Observation Carried Forward
- similar value used / Use a similar value
- most likely value used / Use the most likely value
modified z-score
- about / Detecting outliers with modified z-scores
modified z-scores
- outliers, detecting with / Detecting outliers with modified z-scores
multi-document
- about / What is automatic text summarization?
multiple components
- about / Components of a network
multivariate data sets
- about / Statistical detection of outliers
MySQL
- URL / How do we set up our data mining work environment?

N

named entity recognition (NER)
- about / Why look for named entities?
- techniques / Techniques for named entity recognition
- part of speech (POS), tagging / Tagging parts of speech
named entity recognition (NER) project
- about / Named entity recognition project
- NER tool / A simple NER tool
named entity recognition (NER) systems
- building / Building and evaluating NER systems
- evaluating / Building and evaluating NER systems
- partial matches / NER and partial matches
- partial matches handling / Handling partial matches
named entity recognition (NER) tool
- about / A simple NER tool
- Apache board meeting minutes / Apache Board meeting minutes
- Django IRC chat / Django IRC chat
- GnuIRC summaries / GnuIRC summaries
- LKML e-mails / LKML e-mails
natural language processing (NLP)
- about / The basics of sentiment analysis
Natural Language Toolkit (NLTK)
- about / How do we set up our data mining work environment?
negation words
- about / Important features of opinions
network
- about / What is a network?
- measuring / Measuring a network
network, measuring
- degree / Degree of a network
- diameter / Diameter of a network
- graph walk / Walks, paths, and trails in a network
- graph trail / Walks, paths, and trails in a network
- path / Walks, paths, and trails in a network
- components / Components of a network
- centrality / Closeness centrality
- degree centrality / Degree centrality
- betweenness centrality / Betweenness centrality
- centrality, measures / Other measures of centrality
NetworkX
- installing / Understanding our data as a network
NetworkX file formats
- reference link / Pajek format
neutral word
- about / SentiWordNet
NLTK
- used, for naive text summarization / Naive text summarization using NLTK
NLTK documentation page
- URL / Data analysis of e-mail messages
nodes
- about / What is a network?
novelty
- about / Outliers
nullable
- about / Locating missing data
null data values
- about / Missing data
null words
- about / Sumy's Edmundson summarizer

O

objectivity score
- about / SentiWordNet
opinion mining
- about / What is sentiment analysis?
- reference / What is sentiment analysis?
opinion shifters
- about / Important features of opinions
opinion words
- about / Important features of opinions
out-degree
- about / Degree of a network
outlier
- about / What are data anomalies?, Outliers
outlier detection
- reference link / Detecting outliers with machine learning
outliers
- visual mining / Visual mining for outliers
- statistical detection / Statistical detection of outliers
outliers, statistical detection
- outliers, detecting with modified z-scores / Detecting outliers with modified z-scores
- outliers, detecting by combining statistics / Detecting outliers by combining statistics and visual mining
- outliers, detecting by combining visual mining / Detecting outliers by combining statistics and visual mining
- outliers, detecting with machine learning / Detecting outliers with machine learning
overfitting
- about / Sentiment analysis algorithms

P

Pajek format
- about / Pajek format
partial matches
- about / NER and partial matches
- strict scoring / NER and partial matches
- lenient scoring / NER and partial matches
- partial scoring / NER and partial matches
part of speech (POS)
- about / Tagging parts of speech
- tagging / Tagging parts of speech
- named entities, classes / Classes of named entities
part of speech, abbreviations
- reference link / Tagging parts of speech
parts of speech
- about / Important features of opinions
path
- about / Walks, paths, and trails in a network
pendant nodes
- about / Playing with the parameters of a network
Penn, noun abbreviations
- example / Tagging parts of speech
Penn Treebank tagger
- about / Tagging parts of speech
position of word
- about / Important features of opinions
POS tagger
- about / Tagging parts of speech
precision
- about / Effectiveness – how accurate are the matches that we generate?
profile
- about / Leveraging disjoint sets
Python pickle
- about / Python pickle

Q

question answering (QA) systems
- about / Why look for named entities?

R

real-world project, network
- about / A real project
- data, exploring / Exploring the data
- network files, generating / Generating the network files
- data, social network / Understanding our data as a network
recall
- about / Effectiveness – how accurate are the matches that we generate?
regression problems
- about / What are the techniques used in data mining?
relational database management systems (RDBMS)
- about / Locating missing data
results, entity matching project
- about / The results, How many entity matches did we find?
- entity matches / How many entity matches did we find?
- pairs, identifying / How good are the pairs we found?
rf_developer_projects table
- datasource_id / Exploring the data
- dev_loginname / Exploring the data
- proj_unixname / Exploring the data
rigid designator
- about / Techniques for named entity recognition
Rmagick on RubyForge
- about / Two examples
- references / Two examples
Rmagick on RubyGems
- about / Two examples
- references / Two examples
RubyForge
- URL / Matching on URLs
Ruby on Rails
- URL / How many entity matches did we find?

S

Scikit-learn tutorial
- URL / How do we set up our data mining work environment?
semantic errors
- about / Logic or semantic errors
- example / Logic or semantic errors
sentiment analysis
- about / What is sentiment analysis?
- reference / What is sentiment analysis?
- algorithms / Sentiment analysis algorithms
- general-purpose data collections / General-purpose data collections
sentiment analysis, basics
- about / The basics of sentiment analysis
- opinion, structure / The structure of an opinion
- document-level analysis / Document-level and sentence-level analysis
- sentence-level analysis / Document-level and sentence-level analysis
- opinions, features / Important features of opinions
sentiment intensity
- about / Vader sentiment
sentiment mining application
- about / Sentiment mining application
- project, motivating / Motivating the project
- data preparation / Data preparation
- chat messages, data analysis / Data analysis of chat messages
- e-mail messages, data analysis / Data analysis of e-mail messages
sentiment score
- URL / Data analysis of chat messages
sentiment words
- about / Important features of opinions
SentiWordNet
- URL / SentiWordNet
sequence analysis problems
- about / What are the techniques used in data mining?
set notation
- about / Frequent itemset mining basics
significant words
- about / What is automatic text summarization?
simpleTextSummaryNLTK.py
- reference / Naive text summarization using NLTK
single-document
- about / What is automatic text summarization?
Six Steps process
- problem statement / The Six Steps process
- data collection / The Six Steps process
- data storage / The Six Steps process
- data cleaning / The Six Steps process
- data mining / The Six Steps process
- representation / The Six Steps process
- visualization / The Six Steps process
- problem resolution / The Six Steps process
software project tags
- association rules, discovering / A project – discovering association rules in software project tags
Soundex
- about / Soundex
source lines of code (SLOC)
- about / Outliers
specificity
- about / Effectiveness – how accurate are the matches that we generate?
stigma words
- about / Sumy's Edmundson summarizer
stopwords
- about / Naive text summarization using NLTK
string edit distance
- about / String edit distance
subgraphs
- reference link / Analyzing subgraphs
subjectivity classification
- about / Document-level and sentence-level analysis
summarization problems
- about / What are the techniques used in data mining?
SUMMRY
- about / Tools for text summarization
- reference / Tools for text summarization
Sumy
- used, for text summarization / Text summarization using Sumy
- references / Text summarization using Sumy
Sumy's Edmundson summarizer
- reference / Sumy's Edmundson summarizer
- about / Sumy's Edmundson summarizer
Sumy's LSA summarizer
- about / Sumy's LSA summarizer
Sumy's Luhn summarizer
- about / Sumy's Luhn summarizer
Sumy's TextRank summarizer
- about / Sumy's TextRank summarizer
sustainable
- about / Usefulness – how practical is the matching procedure to use?

T

target
- about / The structure of an opinion
target data
- about / The Fayyad et al. KDD process
terms
- about / Important features of opinions
text samples
- download link / Gensim for topic modeling
text summarization
- tools / Tools for text summarization
- naive text summarization, NLTK used / Naive text summarization using NLTK
- using Gensim / Text summarization using Gensim
- Sumy used / Text summarization using Sumy
text summarization, methods
- Sumy's Luhn summarizer / Sumy's Luhn summarizer
- Sumy's TextRank summarizer / Sumy's TextRank summarizer
- Sumy's LSA summarizer / Sumy's LSA summarizer
- Sumy's Edmundson summarizer / Sumy's Edmundson summarizer
topic modeling
- about / What is topic modeling?
- Gensim used / Gensim for topic modeling
- Gensim LDA topics / Understanding Gensim LDA topics
- Gensim LDA passes / Understanding Gensim LDA passes
- Gensim LDA model, applying to documents / Applying a Gensim LDA model to new documents
- Gensim LDA objects, serializing / Serializing Gensim LDA objects
training examples
- about / Sentiment analysis algorithms
tree structure
- about / JSON trees
tripletons
- about / Frequent itemset mining basics
true positives (TP)
- about / How good are the pairs we found?
type errors
- about / Data type and character set errors

U

Ubuntu
- URL / Data preparation
undirected network
- about / What is a network?
univariate data sets
- about / Statistical detection of outliers
unsupervised
- about / What is topic modeling?
upward closure property
- about / Methods for finding frequent itemsets

V

Vader sentiment
- URL / Vader sentiment
- URL, for specific lexicon / Vader sentiment
Vapor on RubyForge
- about / Two examples
- references / Two examples
Vapor on RubyGems
- about / Two examples
- references / Two examples
vertical merge
- example / Merging datasets vertically
vertices
- about / What is a network?
visual mining
- about / Visual mining for outliers

W

weighted network
- about / What is a network?

Z

z-score
- about / Detecting outliers with modified z-scores
- reference link / Detecting outliers with modified z-scores