Book Image

Mastering Python for Data Science

By : Samir Madhavan
Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Table of Contents (19 chapters)
Mastering Python for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
7
Estimating the Likelihood of Events
Index

File handling with Hadoopy


Hadoopy is a library in Python, which provides an API to interact with Hadoop to manage files and perform MapReduce on it. Hadoopy can be downloaded from http://www.Hadoopy.com/en/latest/tutorial.html#installing-Hadoopy.

Let's try to put a few files in Hadoop through Hadoopy in a directory created within HDFS, called data:

$ Hadoop fs -mkdir data

Here is the code that puts the data into HDFS:

importHadoopy
import os
hdfs_path = ''
def read_local_dir(local_path):
  for fn in os.listdir(local_path):
    path = os.path.join(local_path, fn)
    if os.path.isfile(path):
      yield path

def main():
  local_path = './BigData/dummy_data'
  for file in  read_local_dir(local_path):
    Hadoopy.put(file, 'data')
    print"The file %s has been put into hdfs"% (file,)

if __name__ =='__main__':
  main()
The file ./BigData/dummy_data/test9 has been put into hdfs
The file ./BigData/dummy_data/test7 has been put into hdfs
The file ./BigData/dummy_data/test1 has been put into...