Programming MapReduce with Scalding

Book Image

Programming MapReduce with Scalding

By : Antonios Chalkiopoulos

Book Image

Programming MapReduce with Scalding

By: Antonios Chalkiopoulos

Overview of this book

Programming MapReduce with Scalding

Programming MapReduce with Scalding

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introduction to MapReduce

Introduction to MapReduce

The Hadoop platform

MapReduce abstractions

Introducing Cascading

Get Ready for Scalding

Get Ready for Scalding

Scala build tools

Hello World in Scala

Development editors

Installing Hadoop in five minutes

Running our first Scalding job

Submitting a Scalding job in Hadoop

Scalding by Example

Scalding by Example

Reading and writing files

Understanding the core capabilities of Scalding

Operations on groups

A simple example

Intermediate Examples

Intermediate Examples

Logfile analysis

Exploring ad targeting

Scalding Design Patterns

Scalding Design Patterns

The external operations pattern

The dependency injection pattern

The late bound dependency pattern

Testing and TDD

Testing and TDD

Introduction to testing

MapReduce testing challenges

Development lifecycle with testing strategy

TDD for Scalding developers

Black box testing

Running Scalding in Production

Running Scalding in Production

Executing Scalding in a Hadoop cluster

Scheduling execution

Coordinating job execution

Configuring using a property file

Configuring using Hadoop parameters

Monitoring Scalding jobs

Using slim JAR files

Scalding execution throttling

Using External Data Stores

Using External Data Stores

Interacting with external systems

NoSQL databases

Search platforms

Matrix Calculations and Machine Learning

Matrix Calculations and Machine Learning

Text similarity using TF-IDF

Setting a similarity using the Jaccard index

K-Means using Mahout

Other libraries

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Running our first Scalding job

After adding Scalding as a project dependency, we can now create our first Scalding job as src/main/scala/WordCountJob.scala:

import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
  TextLine( args("input") )
  .flatMap('line -> 'word) { line : String => 
    line.toLowerCase.split("\\s+") }
  .groupBy('word) { _.size }
  .write( Tsv( args("output") ) )
}

The Scalding code above implements a cascading flow using an input file as source and stores results into another file that is used as an output tap. The pipeline tokenizes lines into words and calculates the number of times each word appears in the input text.

Note

Find complete project files in the code accompanying this book at http://github.com/scalding-io/ProgrammingWithScalding.

We can create a dummy file to use as input with the following command:

$ echo "This is a happy day. A day to remember" > input.txt

Scalding supports two types of execution modes: local mode and...