Mastering Python for Data Science

Book Image

Mastering Python for Data Science

By : Samir Madhavan

Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Mastering Python for Data Science

Mastering Python for Data Science

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Raw Data

Getting Started with Raw Data

The world of arrays with NumPy

Empowering data analysis with pandas

Data operations

Inferential Statistics

Inferential Statistics

Various forms of distribution

One-tailed and two-tailed tests

Type 1 and Type 2 errors

A confidence interval

Z-test vs T-test

The F distribution

The chi-square distribution

The chi-square test of independence

Finding a Needle in a Haystack

Finding a Needle in a Haystack

What is data mining?

Presenting an analysis

Studying the Titanic

Making Sense of Data through Advanced Visualization

Making Sense of Data through Advanced Visualization

Controlling the line properties of a chart

Creating multiple plots

Playing with text

Styling your plots

Scatter plots with histograms

A scatter plot matrix

Hexagon bin plots

A 3D plot of a surface

Uncovering Machine Learning

Uncovering Machine Learning

Different types of machine learning

Linear regression

Logistic regression

The naive Bayes classifier

The k-means clustering

Hierarchical clustering

Performing Predictions with a Linear Regression

Performing Predictions with a Linear Regression

Simple linear regression

Multiple regression

Training and testing a model

Estimating the Likelihood of Events

Estimating the Likelihood of Events

Logistic regression

Generating Recommendations with Collaborative Filtering

Generating Recommendations with Collaborative Filtering

Recommendation data

User-based collaborative filtering

Item-based collaborative filtering

Pushing Boundaries with Ensemble Models

Pushing Boundaries with Ensemble Models

The census income dataset

Applying Segmentation with k-means Clustering

Applying Segmentation with k-means Clustering

The k-means algorithm and its working

The k-means clustering with countries

Clustering the countries

Analyzing Unstructured Data with Text Mining

Analyzing Unstructured Data with Text Mining

Preprocessing data

Creating a wordcloud

Word and sentence tokenization

Parts of speech tagging

Stemming and lemmatization

The Stanford Named Entity Recognizer

Performing sentiment analysis on world leaders using Twitter

Leveraging Python in the World of Big Data

Leveraging Python in the World of Big Data

What is Hadoop?

Python MapReduce

File handling with Hadoopy

Python with Apache Spark

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

File handling with Hadoopy

Hadoopy is a library in Python, which provides an API to interact with Hadoop to manage files and perform MapReduce on it. Hadoopy can be downloaded from http://www.Hadoopy.com/en/latest/tutorial.html#installing-Hadoopy.

Let's try to put a few files in Hadoop through Hadoopy in a directory created within HDFS, called data:

$ Hadoop fs -mkdir data

Here is the code that puts the data into HDFS:

importHadoopy
import os
hdfs_path = ''
def read_local_dir(local_path):
  for fn in os.listdir(local_path):
    path = os.path.join(local_path, fn)
    if os.path.isfile(path):
      yield path

def main():
  local_path = './BigData/dummy_data'
  for file in  read_local_dir(local_path):
    Hadoopy.put(file, 'data')
    print"The file %s has been put into hdfs"% (file,)

if __name__ =='__main__':
  main()
The file ./BigData/dummy_data/test9 has been put into hdfs
The file ./BigData/dummy_data/test7 has been put into hdfs
The file ./BigData/dummy_data/test1 has been put into...