Mastering Python for Data Science

Book Image

Mastering Python for Data Science

By : Samir Madhavan

Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Mastering Python for Data Science

Mastering Python for Data Science

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Raw Data

Getting Started with Raw Data

The world of arrays with NumPy

Empowering data analysis with pandas

Data operations

Inferential Statistics

Inferential Statistics

Various forms of distribution

One-tailed and two-tailed tests

Type 1 and Type 2 errors

A confidence interval

Z-test vs T-test

The F distribution

The chi-square distribution

The chi-square test of independence

Finding a Needle in a Haystack

Finding a Needle in a Haystack

What is data mining?

Presenting an analysis

Studying the Titanic

Making Sense of Data through Advanced Visualization

Making Sense of Data through Advanced Visualization

Controlling the line properties of a chart

Creating multiple plots

Playing with text

Styling your plots

Scatter plots with histograms

A scatter plot matrix

Hexagon bin plots

A 3D plot of a surface

Uncovering Machine Learning

Uncovering Machine Learning

Different types of machine learning

Linear regression

Logistic regression

The naive Bayes classifier

The k-means clustering

Hierarchical clustering

Performing Predictions with a Linear Regression

Performing Predictions with a Linear Regression

Simple linear regression

Multiple regression

Training and testing a model

Estimating the Likelihood of Events

Estimating the Likelihood of Events

Logistic regression

Generating Recommendations with Collaborative Filtering

Generating Recommendations with Collaborative Filtering

Recommendation data

User-based collaborative filtering

Item-based collaborative filtering

Pushing Boundaries with Ensemble Models

Pushing Boundaries with Ensemble Models

The census income dataset

Applying Segmentation with k-means Clustering

Applying Segmentation with k-means Clustering

The k-means algorithm and its working

The k-means clustering with countries

Clustering the countries

Analyzing Unstructured Data with Text Mining

Analyzing Unstructured Data with Text Mining

Preprocessing data

Creating a wordcloud

Word and sentence tokenization

Parts of speech tagging

Stemming and lemmatization

The Stanford Named Entity Recognizer

Performing sentiment analysis on world leaders using Twitter

Leveraging Python in the World of Big Data

Leveraging Python in the World of Big Data

What is Hadoop?

Python MapReduce

File handling with Hadoopy

Python with Apache Spark

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

The k-means algorithm and its working

The k-means clustering algorithm operates by computing the average of features, such as the variables that we use for clustering. For example, segmenting customers based on the average transaction amount and the average number of products purchased in a quarter of a year. This mean then becomes the center of a cluster. The K number is the number of clusters, that is, the technique consists of computing a K number of means that lead to the clustering of data around these k-means.

How do we choose this K? If we have some idea of what we are looking for or how many clusters we expect or want, then we can set K to be this number before we start the engines and let the algorithm compute along.

If we don't know how many clusters there are, then our exploration will take a little longer and involve some trial and error, say, as we try K=3,4, and 5.

The k-means algorithm is iterative. It starts by choosing K points at random from the data and uses these as cluster...