Book Image

Distributed Computing with Python

Book Image

Distributed Computing with Python

Overview of this book

CPU-intensive data processing tasks have become crucial considering the complexity of the various big data applications that are used today. Reducing the CPU utilization per process is very important to improve the overall speed of applications. This book will teach you how to perform parallel execution of computations by distributing them across multiple processors in a single machine, thus improving the overall performance of a big data processing task. We will cover synchronous and asynchronous models, shared memory and file systems, communication between various processes, synchronization, and more.
Table of Contents (15 chapters)
Distributed Computing with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Storing data in Amazon S3


Amazon Simple Storage Service, S3, is a web service that can be used to store and retrieve arbitrary blobs of data. Data stored in S3 can include files of any kind, up to 5 terabytes in size (at the time of writing this), and also raw bytes.

S3 is also significantly cheaper than EBS; however, it does not offer a filesystem layer but rather a REST API. Another difference is that while EBS volumes can only be attached to a single running instance at a time, S3 objects can be shared among as many instances as we want, and depending on the desired permission policy, they can be accessed from anywhere on the Internet.

Getting started with S3 is easy; you need to create a number of buckets (that is, data containers in S3 parlance) in relevant geographical areas (usually, in order to minimize access times) and then add data to them. The process is as follows; if you're not there already, log in to the AWS management console and click on the S3 icon under Storage & Content...