4. Text Classification | Practical Data Analysis

Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar

3.5 (2)

Practical Data Analysis

3.5 (2)

By: Hector Cuesta, Dr. Sampath Kumar

Overview of this book

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service. This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

1. Getting Started

1. Getting Started

Computer science

Artificial intelligence

Machine learning

Statistics

Mathematics

Knowledge domain

Data, information, and knowledge

The data analysis process

Quantitative versus qualitative data analysis

Importance of data visualization

What about big data?

Quantified self

Tools and toys for this book

Summary

2. Preprocessing Data

2. Preprocessing Data

Data sources

Data scrubbing

Data formats

Data reduction methods

Getting started with OpenRefine

Summary

3. Getting to Grips with Visualization

3. Getting to Grips with Visualization

What is visualization?

Working with web-based visualization

Exploring scientific visualization

Visualization in art

The visualization life cycle

Visualizing different types of data

Getting started with D3.js

Interaction and animation

Data from social networks

An overview of visual analytics

Summary

4. Text Classification

4. Text Classification

Learning and classification

Bayesian classification

E-mail subject line tester

The data

The algorithm

Classifier accuracy

Summary

5. Similarity-Based Image Retrieval

5. Similarity-Based Image Retrieval

Image similarity search

Dynamic time warping

Processing the image dataset

Implementing DTW

Analyzing the results

Summary

6. Simulation of Stock Prices

6. Simulation of Stock Prices

Financial time series

Random Walk simulation

Monte Carlo methods

Generating random numbers

Implementation in D3js

Quantitative analyst

Summary

7. Predicting Gold Prices

7. Predicting Gold Prices

Working with time series data

Smoothing time series

Lineal regression

The data - historical gold prices

Nonlinear regressions

Summary

8. Working with Support Vector Machines

8. Working with Support Vector Machines

Understanding the multivariate dataset

Dimensionality reduction

Getting started with SVM

Summary

9. Modeling Infectious Diseases with Cellular Automata

9. Modeling Infectious Diseases with Cellular Automata

Introduction to epidemiology

The epidemic models

Modeling with Cellular Automaton

Simulation of the SIRS model in CA with D3.js

Summary

10. Working with Social Graphs

10. Working with Social Graphs

Structure of a graph

Social networks analysis

Acquiring the Facebook graph

Working with graphs using Gephi

Statistical analysis

Degree distribution

Transforming GDF to JSON

Graph visualization with D3.js

Summary

11. Working with Twitter Data

11. Working with Twitter Data

The anatomy of Twitter data

Using OAuth to access Twitter API

Getting started with Twython

Summary

12. Data Processing and Aggregation with MongoDB

12. Data Processing and Aggregation with MongoDB

Getting started with MongoDB

Data preparation

Group

Aggregation framework

Summary

13. Working with MapReduce

13. Working with MapReduce

An overview of MapReduce

Programming model

Using MapReduce with MongoDB

Filtering the input collection

Grouping and aggregation

Counting the most common words in tweets

Summary

14. Online Data Analysis with Jupyter and Wakari

14. Online Data Analysis with Jupyter and Wakari

Getting started with Wakari

Getting started with IPython notebook

Introduction to image processing with PIL

Getting started with pandas

Sharing your Notebook

Summary

15. Understanding Data Processing using Apache Spark

15. Understanding Data Processing using Apache Spark

Platform for data processing

An introduction to the distributed file system

An introduction to Apache Spark

Summary

Classifier accuracy

Now we need to test our classifier with a bigger test set; in this case, we will randomly select 100 subjects: 50 spam and 50 not spam. Finally, we will count how many times the classifier chose the correct category:

with open("test.csv") as f: 
    correct = 0 
    tests = csv.reader(f) 
    for subject in test: 
          clase = classifier(subject[0],w,c,t,tw) 
          if clase[1] =subject[1]: 
      correct += 1 
     print("Efficiency : {0} of 100".format(correct))

In this case, the Efficiency is 82 percent:

>>> Efficiency: 82 of 100

Tip

We can use an out of the box implementation of the Naive Bayes classifier, like the NaiveBayesClassifier function in the NLTK package for Python. NLTK provides a very powerful natural language toolkit and we can download it from http://nltk.org/.

In Chapter 1, Getting Started, we presented a more sophisticated version of the NaÃ¯ve Bayes classifier to perform a sentiment analysis.

In this case, we will find an optimal size...

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Practical Data Analysis

Search

Your notes and bookmarks