Book Image

Learning Big Data with Amazon Elastic MapReduce

By : Amarkant Singh, Vijay Rayapati
Book Image

Learning Big Data with Amazon Elastic MapReduce

By: Amarkant Singh, Vijay Rayapati

Overview of this book

<p>Amazon Elastic MapReduce is a web service used to process and store vast amount of data, and it is one of the largest Hadoop operators in the world. With the increase in the amount of data generated and collected by many businesses and the arrival of cost-effective cloud-based solutions for distributed computing, the feasibility to crunch large amounts of data to get deep insights within a short span of time has increased greatly.</p> <p>This book will get you started with AWS so that you can quickly create your own account and explore the services provided, many of which you might be delighted to use. This book covers the architectural details of the MapReduce framework, Apache Hadoop, various job models on EMR, how to manage clusters on EMR, and the command-line tools available with EMR. Each chapter builds on the knowledge of the previous one, leading to the final chapter where you will learn about solving a real-world use case using Apache Hadoop and EMR. This book will, therefore, get you up and running with major Big Data technologies quickly and efficiently.</p>
Table of Contents (18 chapters)
Learning Big Data with Amazon Elastic MapReduce
Credits
About the Authors
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Services provided by AWS


AWS provides a wide variety of global services catering to large enterprises as well as smart start-ups. As of today, AWS provides a growing set of over 60 services across various sectors of a cloud infrastructure. All of the services provided by AWS can be accessed via the AWS management console (a web portal) or programmatically via API (or web services). We will learn about the most popular ones and which are most used across industries.

AWS categorizes its services into the following major groups:

  • Compute

  • Storage

  • Database

  • Network and CDN

  • Analytics

  • Application services

  • Deployment and management

Let's now discuss all the groups and list down the services available in each one of them.

Compute

The compute group of services includes the most basic service provided by AWS: Amazon EC2, which is like a virtual compute machine. AWS provides a wide range of virtual machine types; in AWS lingo, they are called instances.

Amazon EC2

EC2 stands for Elastic Compute Cloud. The key word is elastic. EC2 is a web service that provides resizable compute capacity in the AWS Cloud. Basically, using this service, you can provision instances of varied capacity on a cloud. You can launch instances within minutes and you can terminate them when work is done. You can decide on the computing capacity of your instance, that is, number of CPU cores or amount of memory, among others from a pool of machine types offered by AWS.

You only pay for usage of instances by number of hours. It may be noted here that if you run an instance for one hour and few minutes, it will be billed as 2 hours. Each partial instance hour consumed is billed as full hour. We will learn about EC2 in more detail in the next section.

Auto Scaling

Auto scaling is one of the popular services AWS has built and offers to customers to handle spikes in application loads by adding or removing infrastructure capacity. Auto scaling allows you to define conditions; when these conditions are met, AWS would automatically scale your compute capacity up or down. This service is well suited for applications that have time dependency on its usage or predictable spikes in the usage.

Auto scaling also helps in the scenario where you want your application infrastructure to have a fixed number of machines always available to it. You can configure this service to automatically check the health of each of the machines and add capacity as and when required if there are any issues with existing machines. This helps you to ensure that your application receives the compute capacity it requires.

Moreover, this service doesn't have additional pricing, only EC2 capacity being used is billed.

Elastic Load Balancing

Elastic Load Balancing (ELB) is the load balancing service provided by AWS. ELB automatically distributes the incoming application's traffic among multiple EC2 instances. This service helps in achieving high availability for applications by load balancing traffic across multiple instances in different availability zones for fault tolerance.

ELB has the capability to automatically scale its capacity to handle requests to match the demands of the application's traffic. It also offers integration with auto scaling, wherein you may configure it to also scale the backend capacity to cater to the varying traffic levels without manual intervention.

Amazon Workspaces

The Amazon Workspaces service provides cloud-based desktops for on-demand usage by businesses. It is a fully managed desktop computing service in the cloud. It allows you to access your documents and applications from anywhere and from devices of your choice. You can choose the hardware and software as per your requirement. It allows you to choose from packages providing different amounts of CPU, memory, and storage.

Amazon Workspaces also have the facility to securely integrate with your corporate Active Directory.

Storage

Storage is another group of essential services. AWS provides low-cost data storage services having high durability and availability. AWS offers storage choices for backup, archiving, and disaster recovery, as well as block, file, and object storage. As is the nature of most of the services on AWS, for storage too, you pay as you go.

Amazon S3

S3 stands for Simple Storage Service. S3 provides a simple web service interface with fully redundant data storage infrastructure to store and retrieve any amount of data at any time and from anywhere on the Web. Amazon uses S3 to run its own global network of websites.

As AWS states:

Amazon S3 is cloud storage for the Internet.

Amazon S3 can be used as a storage medium for various purposes. We will read about it in more detail in the next section.

Amazon EBS

EBS stands for Elastic Block Store. It is one of the most used service of AWS. It provides block-level storage volumes to be used with EC2 instances. While the instance storage data cannot be persisted after the instance has been terminated, using EBS volumes you can persist your data independently from the life cycle of an instance to which the volumes are attached to. EBS is sometimes also termed as off-instance storage.

EBS provides consistent and low-latency performance. Its reliability comes from the fact that each EBS volume is automatically replicated within its availability zone to protect you from hardware failures. It also provides the ability to copy snapshots of volumes across AWS regions, which enables you to migrate data and plan for disaster recovery.

Amazon Glacier

Amazon Glacier is an extremely low-cost storage service targeted at data archival and backup. Amazon Glacier is optimized for infrequent access of data. You can reliably store your data that you do not want to read frequently with a cost as low as $0.01 per GB per month.

AWS commits to provide average annual durability of 99.999999999 percent for an archive. This is achieved by redundantly storing data in multiple locations and on multiple devices within one location. Glacier automatically performs regular data integrity checks and has automatic self-healing capability.

AWS Storage Gateway

AWS Storage Gateway is a service that enables secure and seamless connection between on-premise software appliance with AWS's storage infrastructure. It provides low-latency reads by maintaining an on-premise cache of frequently accessed data while all the data is stored securely on Amazon S3 or Glacier.

In case you need low-latency access to your entire dataset, you can configure this service to store data locally and asynchronously back up point-in-time snapshots of this data to S3.

AWS Import/Export

The AWS Import/Export service accelerates moving large amounts of data into and out of AWS infrastructure using portable storage devices for transport. Data transfer via Internet might not always be the feasible way to move data to and from AWS's storage services.

Using this service, you can import data into Amazon S3, Glacier, or EBS. It is also helpful in disaster recovery scenarios where in you might need to quickly retrieve a large amount of data backup stored in S3 or Glacier; using this service, your data can be transferred to a portable storage device and delivered to your site.

Databases

AWS provides fully managed relational and NoSQL database services. It also has one fully managed in-memory caching as a service and a fully managed data-warehouse service. You can also use Amazon EC2 and EBS to host any database of your choice.

Amazon RDS

RDS stands for Relational Database Service. With database systems, setup, backup, and upgrading are the tasks, which are tedious and at the same time critical. RDS aims to free you of these responsibilities and lets you focus on your application. RDS supports all the major databases, namely, MySQL, Oracle, SQL Server, and PostgreSQL. It also provides the capability to resize the instances holding these databases as per the load. Similarly, it provides a facility to add more storage as and when required.

Amazon RDS makes it just a matter of few clicks to use replication to enhance availability and reliability for production workloads. Using its Multi-AZ deployment option, you can run very critical applications with high availability and in-built automated failover. It synchronously replicates data to a secondary database. On failure of the primary database, Amazon RDS automatically starts fetching data for further requests from the replicated secondary database.

Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service mainly aimed at applications requiring single-digit millisecond latency. There is no limit to the amount of data you can store in DynamoDB. It uses an SSD-storage, which helps in providing very high performance.

DynamoDB is a schemaless database. Tables do not need to have fixed schemas. Each record may have a different number of columns. Unlike many other nonrelational databases, DynamoDB ensures strong read consistency, making sure that you always read the latest value.

DynamoDB also integrates with Amazon Elastic MapReduce (Amazon EMR). With DynamoDB, it is easy for customers to use Amazon EMR to analyze datasets stored in DynamoDB and archive the results in Amazon S3.

Amazon Redshift

Amazon Redshift is basically a modern data warehouse system. It is an enterprise-class relational query and management system. It is PostgreSQL compliant, which means you may use most of the SQL commands to query tables in Redshift.

Amazon Redshift achieves efficient storage and great query performance through a combination of various techniques. These include massively parallel processing infrastructures, columnar data storage, and very efficient targeted data compressions encoding schemes as per the column data type. It has the capability of automated backups and fast restores. There are in-built commands to import data directly from S3, DynamoDB, or your on-premise servers to Redshift.

You can configure Redshift to use SSL to secure data transmission. You can also set it up to encrypt data at rest, for which Redshift uses hardware-accelerated AES-256 encryption.

As we will see in Chapter 10, Use Case – Analyzing CloudFront Logs Using Amazon EMR, Redshift can be used as the data store to efficiently analyze all your data using existing business intelligence tools such as Tableau or Jaspersoft. Many of these existing business intelligence tools have in-built capabilities or plugins to work with Redshift.

Amazon ElastiCache

Amazon ElastiCache is basically an in-memory cache cluster service in cloud. It makes life easier for developers by loading off most of the operational tasks. Using this service, your applications can fetch data from fast in-memory caches for some frequently needed information or for some counters kind of data.

Amazon ElastiCache supports two most commonly used open source in-memory caching engines:

  • Memcached

  • Redis

As with other AWS services, Amazon ElastiCache is also fully managed, which means it automatically detects and replaces failed nodes.

Networking and CDN

Networking and CDN services include the networking services that let you create logically isolated networks in cloud, the setup of a private network connection to the AWS cloud, and an easy-to-use DNS service. AWS also has one content delivery network service that lets you deliver content to your users with higher speeds.

Amazon VPC

VPC stands for Virtual Private Cloud. As the name suggests, AWS allows you to set up an isolated section of AWS cloud, which is private. You can launch resources to be available only inside that private network. It allows you to create subnets and then create resources within those subnets. With EC2 instances without VPC, one internal and one external IP addresses are always assigned; but with VPC, you have control over the IP of your resource; you may choose to only keep an internal IP for a machine. In effect, that machine will only be known by other machines on that subnet; hence, providing a greater level of control over security of your cloud infrastructure.

You can further control the security of your cloud infrastructure by using features such as security groups and network access control lists. You can configure inbound and outbound filtering at instance level as well as at subnet level.

You can connect your entire VPC to your on-premise data center.

Amazon Route 53

Amazon Route 53 is simply a Domain Name System (DNS) service that translates names to IP addresses and provides low-latency responses to DNS queries by using its global network of DNS servers.

Amazon CloudFront

Amazon CloudFront is a CDN service provided by AWS. Amazon CloudFront has a network of delivery centers called as edge locations all around the world. Static contents are cached on the edge locations closer to the requests for those contents, effecting into lowered latency for further downloads of those contents. Requests for your content are automatically routed to the nearest edge location, so content is delivered with the best possible performance.

AWS Direct Connect

If you do not trust Internet to connect to AWS services, you may use this service. Using AWS Direct Connect, a private connectivity can be established between your data center and AWS. You may also want to use this service to reduce your network costs and have more consistent network performance.

Analytics

Analytics is the group of services, which host Amazon EMR among others. These are a set of services that help you to process and analyze huge volumes of data.

Amazon EMR

The Amazon EMR service lets you process any amount of data by launching a cluster of required number of instances, and this cluster will have one of the analytics engines predeployed. EMR mainly provides Hadoop and related tools such as Pig, Hive, and HBase. People who have spent hours in deploying a Hadoop cluster will understand the importance of EMR. Within minutes, you can launch a Hadoop cluster having hundreds of instances. Also, you can resize your cluster on the go with a few simple commands. We will be learning more about EMR throughout this book.

Amazon Kinesis

Amazon Kinesis is a service for real-time streaming data collection and processing. It can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, as claimed by AWS. It allows you to write applications to process data in real time from sources such as log streams, clickstreams, and many more. You can build real-time dashboards showing current trends, recent changes/improvements, failures, and errors.

AWS Data Pipeline

AWS Data Pipeline is basically a service to automate a data pipeline. That is, using this, you can reliably move data between various AWS resources at scheduled times and on meeting some preconditions. For instance, you receive daily logs in your S3 buckets and you need to process them using EMR and move the output to a Redshift table. All of this can be automated using AWS Data Pipeline, and you will get processed data moved to Redshift on daily basis ready to be queried by your BI tool.

Application services

Application services include services, which you can use with applications. These include search functionality, queuing service, push notifications, and e-mail delivery among others.

Amazon CloudSearch (Beta)

Amazon CloudSearch is a search service that allows you to easily integrate fast and highly scalable search functionality into your applications. It now supports 34 languages. It also supports popular search features such as highlighting, autocomplete, and geospatial search.

Amazon SQS

SQS stands for Simple Queue Service. It provides a hosted queue to store messages as they are transferred between computers. It ensures that no messages are lost, as all messages are stored redundantly across multiple servers and data centers.

Amazon SNS

SNS stands for Simple Notification Service. It is basically a push messaging service. It allows you to push messages to mobile devices or distributed services. You can anytime seamlessly scale from a few messages a day to thousands of messages per hour.

Amazon SES

SES stands for Simple Email Service. It is basically an e-mail service for the cloud. You can use it for sending bulk and transactional e-mails. It provides real-time access to sending statistics and also provides alerts on delivery failures.

Amazon AppStream

Amazon AppStream is a service that helps you to stream heavy applications such as games or videos to your customers.

Amazon Elastic Transcoder

Amazon Elastic Transcoder is a service that lets you transcode media. It is a fully managed service that makes it easy to convert media files in the cloud with scalability and at a low cost.

Amazon SWF

SWF stands for Simple Workflow Service. It is a task coordination and state management service for various applications running on AWS.

Deployment and Management

Deployment and Management groups have services which AWS provides you to help with the deployment and management of your applications on AWS cloud infrastructure. This also includes services to monitor your applications and keep track of your AWS API activities.

AWS Identity and Access Management

The AWS Identity and Access Management (IAM) service enables you to create fine-grained control access to AWS services and resources for your users.

Amazon CloudWatch

Amazon CloudWatch is a web service that provides monitoring for various AWS cloud resources. It collects metrics specific to the resource. It also allows you to programmatically access your monitoring data and build graphs or set alarms to help you better manage your infrastructure. Basic monitoring metrics (at 5-minute frequency) for Amazon EC2 instances are free of charge. It will cost you if you opt for detailed monitoring. For pricing, you can refer to http://aws.amazon.com/cloudwatch/pricing/.

AWS Elastic Beanstalk

AWS Elastic Beanstalk is a service that helps you to easily deploy web applications and services built on popular programming languages such as Java, .NET, PHP, Node.js, Python, and Ruby. There is no additional charge for this service; you only pay for the underlying AWS infrastructure that you create for your application.

AWS CloudFormation

AWS CloudFormation is a service that provides you with an easy way to create a set of related AWS resources and provision them in an orderly and predictable fashion. This service makes it easier to replicate a working cloud infrastructure. There are various templates provided by AWS; you may use any one of them as it is or you can create your own.

AWS OpsWorks

AWS OpsWorks is a service built for DevOps. It is an application management service that makes it easy to manage an entire application stack from load balancers to databases.

AWS CloudHSM

The AWS CloudHSM service allows you to use dedicated Hardware Security Module (HSM) appliances within the AWS Cloud. You may need to meet some corporate, contractual, or regulatory compliance requirements for data security, which you can achieve by using CloudHSM.

AWS CloudTrail

AWS CloudTrail is simply a service that logs API requests to AWS from your account. It logs API requests to AWS from all the available sources such as AWS Management Console, various AWS SDKs, and command-line tools.

AWS keeps on adding useful and innovative products to its repository of already vast set of services. AWS is clearly the leader among the cloud infrastructure providers.

AWS Pricing

Amazon provides a Free Tier across AWS products and services in order to help you get started and gain hands-on experience before you can build your solutions on top. Using a Free Tier, you can test your applications and gain the confidence required before a full-fledged use.

The following table lists some of the common services and what you can get in the Free Tier for them:

Service

Free Tier limit

Amazon EC2

750 hours per month of the Linux, RHEL, or SLES t2.micro instance usage

750 hours per month of the Windows t2.micro instance usage

Amazon S3

5 GB of standard storage, 20,000 Get requests, and 2,000 Put requests

Amazon EBS

30 GB of Amazon EBS: any combination of general purpose (SSD) or magnetic

2,000,000 I/Os (with EBS magnetic) and 1 GB of snapshot storage

Amazon RDS

750 hours per month of micro DB instance usage

20 GB of DB storage, 20 GB for backups, and 10,000,000 I/Os

The Free Tier is available only for the first 12 months from the sign up for new customers. When your 12 months expire or if your usage exceeds the Free Tier limits, you will need to pay standard rates, which AWS calls pay-as-you-go service rates. You can refer to each service's page for pricing details. For example, in order to get the pricing detail for EC2, you may refer to http://aws.amazon.com/ec2/pricing/.

Tip

You should keep a tab on your usage and use any service after you know that the pricing and your expected usage matches your budget. In order to track your AWS usage, sign in to the AWS management console and open the Billing and Cost Management console at https://console.aws.amazon.com/billing/home#/.