Mastering Python for Data Science

Book Image

Mastering Python for Data Science

By : Samir Madhavan

Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Mastering Python for Data Science

Mastering Python for Data Science

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Raw Data

Getting Started with Raw Data

The world of arrays with NumPy

Empowering data analysis with pandas

Data operations

Inferential Statistics

Inferential Statistics

Various forms of distribution

One-tailed and two-tailed tests

Type 1 and Type 2 errors

A confidence interval

Z-test vs T-test

The F distribution

The chi-square distribution

The chi-square test of independence

Finding a Needle in a Haystack

Finding a Needle in a Haystack

What is data mining?

Presenting an analysis

Studying the Titanic

Making Sense of Data through Advanced Visualization

Making Sense of Data through Advanced Visualization

Controlling the line properties of a chart

Creating multiple plots

Playing with text

Styling your plots

Scatter plots with histograms

A scatter plot matrix

Hexagon bin plots

A 3D plot of a surface

Uncovering Machine Learning

Uncovering Machine Learning

Different types of machine learning

Linear regression

Logistic regression

The naive Bayes classifier

The k-means clustering

Hierarchical clustering

Performing Predictions with a Linear Regression

Performing Predictions with a Linear Regression

Simple linear regression

Multiple regression

Training and testing a model

Estimating the Likelihood of Events

Estimating the Likelihood of Events

Logistic regression

Generating Recommendations with Collaborative Filtering

Generating Recommendations with Collaborative Filtering

Recommendation data

User-based collaborative filtering

Item-based collaborative filtering

Pushing Boundaries with Ensemble Models

Pushing Boundaries with Ensemble Models

The census income dataset

Applying Segmentation with k-means Clustering

Applying Segmentation with k-means Clustering

The k-means algorithm and its working

The k-means clustering with countries

Clustering the countries

Analyzing Unstructured Data with Text Mining

Analyzing Unstructured Data with Text Mining

Preprocessing data

Creating a wordcloud

Word and sentence tokenization

Parts of speech tagging

Stemming and lemmatization

The Stanford Named Entity Recognizer

Performing sentiment analysis on world leaders using Twitter

Leveraging Python in the World of Big Data

Leveraging Python in the World of Big Data

What is Hadoop?

Python MapReduce

File handling with Hadoopy

Python with Apache Spark

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Word and sentence tokenization

We have dealt with word tokenization previously, but we can perform this using NLTK as well as sentence tokenization, which is quite tricky, as the English language has period symbols for abbreviations and other purposes. Thankfully, the sentence tokenizer is a instance of PunktSentenceTokenizer from the tokenize.punkt module of nltk, which helps in tokenizing sentences.

Let's look at word tokenization using this code:

>>> #Loading the forbes data
>>> data = open('./Data/madmax_review/forbes.txt','r').read()

>>> word_data = nltk.word_tokenize(data)
>>> word_data[:15]
['Pundits',
 'and',
 'critics',
 'like',
 'to',
 'blame',
 'the',
 'twin',
 'successes',
 'of',
 'Jaws',
 'and',
 'Star',
 'Wars',
 'for']

Now, let's perform the sentence tokenization of the Forbes article:

>>> sent_tokenize(data)[:5]

['Pundits and critics like to blame the twin successes of Jaws and Star Wars for turning Hollywood into something of...