Mastering Python for Data Science

Book Image

Mastering Python for Data Science

By : Samir Madhavan

Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Mastering Python for Data Science

Mastering Python for Data Science

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Raw Data

Getting Started with Raw Data

The world of arrays with NumPy

Empowering data analysis with pandas

Data operations

Inferential Statistics

Inferential Statistics

Various forms of distribution

One-tailed and two-tailed tests

Type 1 and Type 2 errors

A confidence interval

Z-test vs T-test

The F distribution

The chi-square distribution

The chi-square test of independence

Finding a Needle in a Haystack

Finding a Needle in a Haystack

What is data mining?

Presenting an analysis

Studying the Titanic

Making Sense of Data through Advanced Visualization

Making Sense of Data through Advanced Visualization

Controlling the line properties of a chart

Creating multiple plots

Playing with text

Styling your plots

Scatter plots with histograms

A scatter plot matrix

Hexagon bin plots

A 3D plot of a surface

Uncovering Machine Learning

Uncovering Machine Learning

Different types of machine learning

Linear regression

Logistic regression

The naive Bayes classifier

The k-means clustering

Hierarchical clustering

Performing Predictions with a Linear Regression

Performing Predictions with a Linear Regression

Simple linear regression

Multiple regression

Training and testing a model

Estimating the Likelihood of Events

Estimating the Likelihood of Events

Logistic regression

Generating Recommendations with Collaborative Filtering

Generating Recommendations with Collaborative Filtering

Recommendation data

User-based collaborative filtering

Item-based collaborative filtering

Pushing Boundaries with Ensemble Models

Pushing Boundaries with Ensemble Models

The census income dataset

Applying Segmentation with k-means Clustering

Applying Segmentation with k-means Clustering

The k-means algorithm and its working

The k-means clustering with countries

Clustering the countries

Analyzing Unstructured Data with Text Mining

Analyzing Unstructured Data with Text Mining

Preprocessing data

Creating a wordcloud

Word and sentence tokenization

Parts of speech tagging

Stemming and lemmatization

The Stanford Named Entity Recognizer

Performing sentiment analysis on world leaders using Twitter

Leveraging Python in the World of Big Data

Leveraging Python in the World of Big Data

What is Hadoop?

Python MapReduce

File handling with Hadoopy

Python with Apache Spark

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Training and testing a model

Let's take the data and divide it into training and test sets:

>>> from sklearn import linear_model,cross_validation, 
                   feature_selection,preprocessing
>>> import statsmodels.formula.api as sm
>>> from statsmodels.tools.eval_measures import mse
>>> from statsmodels.tools.tools import add_constant
>>> from sklearn.metrics import mean_squared_error

>>> X = b_data.values.copy() 
>>> X_train, X_valid, y_train, y_valid = 
                     cross_validation.train_test_split( X[:, :-1], X[:, -1], 
                     train_size=0.80)

We first convert the data frame into an array structure using values.copy() of b_data. We then use the train_test_split function of cross_validation from SciKit to divide the data into training and test set for 80% of the data.

We'll learn how to build the linear regression models using the following packages:

The statsmodels module
The SciKit package

Even...