Book Image

Practical Data Analysis - Second Edition

By : Hector Cuesta, Dr. Sampath Kumar
Book Image

Practical Data Analysis - Second Edition

By: Hector Cuesta, Dr. Sampath Kumar

Overview of this book

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service. This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.
Table of Contents (21 chapters)
Practical Data Analysis - Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface

The algorithm


We use the function list_words() to get a list of unique words with more than three characters in lower case:

def list_words(text): 
    words = [] 
    words_tmp = text.lower().split() 
    for w in words_tmp: 
        if w not in words and len(w) > 3: 
            words.append(w) 
    return words 

Tip

For a more advanced term-document matrix, we can use the Python textmining package from: https://pypi.python.org/pypi/textmining/1.0

The training() function creates variables to store the data needed for the classification. The c_words variable is a dictionary with the unique words and its number of occurrences in the text (frequency) by category. The c_categories variable stores a dictionary of each category and its number of texts. Finally, c_text and c_total_words store the total count of texts and words, respectively:

def training(texts): 
    c_words ={} 
    c_categories ={} 
    c_texts = 0 
    c_total_words =0 
    #add the classes to the categories 
    for t in texts...