Book Image

Programming MapReduce with Scalding

By : Antonios Chalkiopoulos

Book Image

Programming MapReduce with Scalding

By: Antonios Chalkiopoulos

Overview of this book

Programming MapReduce with Scalding

Programming MapReduce with Scalding

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introduction to MapReduce

Introduction to MapReduce

The Hadoop platform

MapReduce abstractions

Introducing Cascading

Get Ready for Scalding

Get Ready for Scalding

Scala build tools

Hello World in Scala

Development editors

Installing Hadoop in five minutes

Running our first Scalding job

Submitting a Scalding job in Hadoop

Scalding by Example

Scalding by Example

Reading and writing files

Understanding the core capabilities of Scalding

Operations on groups

A simple example

Intermediate Examples

Intermediate Examples

Logfile analysis

Exploring ad targeting

Scalding Design Patterns

Scalding Design Patterns

The external operations pattern

The dependency injection pattern

The late bound dependency pattern

Testing and TDD

Testing and TDD

Introduction to testing

MapReduce testing challenges

Development lifecycle with testing strategy

TDD for Scalding developers

Black box testing

Running Scalding in Production

Running Scalding in Production

Executing Scalding in a Hadoop cluster

Scheduling execution

Coordinating job execution

Configuring using a property file

Configuring using Hadoop parameters

Monitoring Scalding jobs

Using slim JAR files

Scalding execution throttling

Using External Data Stores

Using External Data Stores

Interacting with external systems

NoSQL databases

Search platforms

Matrix Calculations and Machine Learning

Matrix Calculations and Machine Learning

Text similarity using TF-IDF

Setting a similarity using the Jaccard index

K-Means using Mahout

Other libraries

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

acceptance testing / Introduction to testing
acceptance tests, TDD
- decomposing / Defining acceptance tests
access patterns, SQL
- SELECT / SQL databases
- INSERT / SQL databases
- UPDATE / SQL databases
- UPSERT / SQL databases
- DELETE / SQL databases
addTrap operation
- about / Pipe operations
Ad targeting
- about / Exploring ad targeting
- daily points calculation / Calculating daily points
- historic points calculation / Calculating historic points
- targeted ads generation / Generating targeted ads
advanced serialization files
- about / Reading and writing files
Algebird abstract algebra library
- using / Other libraries
algorithm, TDD
- decomposing / Decomposing the algorithm
Apache Lucene library
- about / Search platforms
Azkaban
- about / Scheduling execution

B

black box testing
- about / Black box testing
- benefit / Black box testing

C

Cascading
- about / MapReduce abstractions
- working with / Introducing Cascading
- pipe / What happens inside a pipe
- pipe assemblies / Pipe assemblies
- extensions / Cascading extensions
Cassandra
- about / NoSQL databases
ClusterWritable object / K-Means using Mahout
Comma Separated Values (CSV)
- about / Reading and writing files
composite operations
- unique operation / Composite operations
- crossWithTiny operation / Composite operations
- normalize operation / Composite operations
- partition operation / Composite operations
Concurrent
- about / Monitoring Scalding jobs
configuration
- performing, Hadoop parameters used / Configuring using Hadoop parameters
configuration data
- reading, from property file / Configuring using a property file
cron
- about / Scheduling execution
crossWithTiny operation
- about / Composite operations

D

daily points
- calculating / Calculating daily points
DataNode nodes
- about / The Hadoop platform
debug operation
- about / Pipe operations
delimited files
- about / Reading and writing files
Dependency Injection pattern
- about / The dependency injection pattern
- implementing / The dependency injection pattern
development editors, Scala
- about / Development editors
discard operation
- about / Map-like operations
domain-specific language (DSL) / Pipe assemblies
dot group operation
- about / Operations on groups
Driven
- about / Monitoring Scalding jobs
- URL / Monitoring Scalding jobs

E

ElaphantDB
- about / NoSQL databases
ElasticSearch
- about / Search platforms, Elastic search
- URL / Elastic search
- Scalding wrapper, implementing for / Elastic search
- advanced search tap, URL / Elastic search
Euclidean
- using / K-Means using Mahout
execution
- scheduling / Scheduling execution
execution throttling
- scalding / Scalding execution throttling
Extension Methods
- working, URL / The external operations pattern
external operations pattern
- about / The external operations pattern
- LogsSchemas object, creating / The external operations pattern
- implementing / The external operations pattern
- Scalding job responsibilities / The external operations pattern
external systems
- interacting with / Interacting with external systems

F

file formats, Scalding
- TextLine format / Reading and writing files
- delimited files / Reading and writing files
- advanced serialization files / Reading and writing files
files
- reading, with Scalding / Reading and writing files, TextLine parsing, Executing in the local and Hadoop modes
- writing, with Scalding / Reading and writing files, TextLine parsing, Executing in the local and Hadoop modes
- reading, best practices / Best practices to read and write files
- writing, best practices / Best practices to read and write files
filter operation
- about / Map-like operations
finalized job scalability
- analyzing / Completing the implementation, Exploring ad targeting
flatMap function
- about / Scala basics
flatMap operation
- about / Map-like operations
flatMapTo operation
- about / Map-like operations
foldLeft group operation
- about / Operations on groups
function literal, Scala
- about / Scala basics

G

Ganitha
- about / Other libraries
- URL / Other libraries
groupAll operation
- about / Grouping/reducing functions
groupBy function
- about / Scala basics
groupBy operation
- about / Grouping/reducing functions
grouping operations
- about / Grouping/reducing functions
- groupBy operation / Grouping/reducing functions
- groupAll operation / Grouping/reducing functions
group operations
- sizeAveStdev / Operations on groups
- toList / Operations on groups
- sortBy / Operations on groups
- last / Operations on groups
- take / Operations on groups
- takeWhile / Operations on groups
- drop / Operations on groups
- sortWithTake / Operations on groups
- sortedReverseTake / Operations on groups
- pivot / Operations on groups
- reducers / Operations on groups
- reduce / Operations on groups
- foldLeft / Operations on groups
- dot / Operations on groups
- histogram / Operations on groups
- hyperLogLog / Operations on groups
groups
- operations, performing on / Operations on groups
- composite operations, performing on / Composite operations

H

Hadoop
- installing / Installing Hadoop in five minutes
- Scalding job, submitting into / Submitting a Scalding job in Hadoop
Hadoop cluster
- Scalding, executing in / Executing Scalding in a Hadoop cluster
Hadoop parameters
- used, for configuration / Configuring using Hadoop parameters
Hadoop platform
- about / The Hadoop platform
HBase
- about / NoSQL databases, Understanding HBase
- reading from / Reading from HBase
- writing in / Writing in HBase
- advanced features, using / Using advanced HBase features
HDFS
- about / The Hadoop platform
HDFS mode
- used, for executing Scalding application / Executing in the local and Hadoop modes
head group operation
- about / Operations on groups
Hello World application
- executing, in Scala / Hello World in Scala
histogram group operation
- about / Operations on groups
historic points
- calculating / Calculating historic points
hyperLogLog group operation
- about / Operations on groups

I

insert operation
- about / Map-like operations
integration testing / Introduction to testing
integration tests, TDD
- implementing / Implementing integration tests

J

Jaccard index
- used, for setting text similarity / Setting a similarity using the Jaccard index, K-Means using Mahout
JDBC (Java Database Connectivity) / SQL databases
Jenkins
- about / Scheduling execution
job execution
- coordinating / Coordinating job execution
JobLibLoader class
- about / Using slim JAR files
JobRunner class
- about / Using slim JAR files
job scheduling
- tools / Scheduling execution
JobTracker
- about / The Hadoop platform
- used, for submitting Scalding job / Submitting a Scalding job in Hadoop
- URL / Scalding execution throttling
join operations
- about / Join operations
- joinWithSmaller / Join operations
- joinWithLarger / Join operations
- joinWithTiny / Join operations
joinWithLarger operation
- about / Join operations
joinWithSmaller operation
- about / Join operations
- syntax / Join operations
joinWithTiny operation
- about / Join operations

K

K-Means
- about / K-Means using Mahout
- implementing, Mahout used / K-Means using Mahout
- URL / K-Means using Mahout

L

last group operation
- about / Operations on groups
Late Bound Dependency pattern
- about / The late bound dependency pattern
- implementing / The late bound dependency pattern
left join
- about / Join operations
limit operation
- about / Map-like operations
lists
- about / Scala basics
- higher-order functions / Scala basics
Locality Sensitive Hashing (LSH) / Other libraries
logfile analysis
- about / Logfile analysis
- data-transformation jobs, implementing / Logfile analysis
- data-transformation jobs, executing / Logfile analysis
- bucketing and binning / Logfile analysis
- data-processing job, completing / Logfile analysis
- implementation, completing / Completing the implementation, Exploring ad targeting
logsAddDayColumn operation
- defining / The external operations pattern
logsCountVisits operation
- defining / The external operations pattern

M

Mahout
- used, for K-Means implementation / K-Means using Mahout
map-like operations
- about / Map-like operations
- map operation / Map-like operations
- mapTo operation / Map-like operations
- flatMap operation / Map-like operations
- flatMapTo operation / Map-like operations
- unpivot operation / Map-like operations
- pivot operation / Map-like operations
- project operation / Map-like operations
- discard operation / Map-like operations
- insert operation / Map-like operations
- limit operation / Map-like operations
- filter operation / Map-like operations
- sample operation / Map-like operations
- pack operation / Map-like operations
- unpack operation / Map-like operations
map operation
- about / Map-like operations
MapReduce
- about / The Hadoop platform
- shared nothing architecture / MapReduce
- working, example / A MapReduce example
- testing challenges / MapReduce testing challenges
MapReduce abstractions
- about / MapReduce abstractions
- Cascading / MapReduce abstractions
MapReduce logic, TDD
- implementing / Implementing the MapReduce logic
mapTo operation
- about / Map-like operations
maven-assembly-plugin / Using slim JAR files
mkString operation
- about / Operations on groups
MongoDB
- about / NoSQL databases
mvn package / Using slim JAR files

N

NameNode
- about / The Hadoop platform
NameNode service / Monitoring Scalding jobs
NameNode web interface / Submitting a Scalding job in Hadoop
name operation
- about / Pipe operations
normalize operation
- about / Composite operations
NoSQL databases
- about / NoSQL databases
- MongoDB / NoSQL databases
- Cassandra / NoSQL databases
- ElaphantDB / NoSQL databases
- HBase / NoSQL databases

O

One Separated Values (OSV)
- about / Reading and writing files
Oozie
- about / Scheduling execution
outer join
- about / Join operations
outlier detection
- about / K-Means using Mahout

P

pack operation
- about / Map-like operations
partition operation
- about / Composite operations
Pig
- about / MapReduce abstractions
- using / MapReduce abstractions
pipe assemblies
- about / Pipe assemblies
- Each / Pipe assemblies
- GroupBy / Pipe assemblies
- Every / Pipe assemblies
- CoGroup / Pipe assemblies
- SubAssembly / Pipe assemblies
pipe operations
- about / Pipe operations
pipes
- about / Introducing Cascading, What happens inside a pipe
- implementing / Pipe assemblies
- reusing / A simple example
pivot group operation
- about / Operations on groups
POJO
- about / Map-like operations
project operation
- about / Map-like operations
property file
- configuration data, reading from / Configuring using a property file

R

reduce group operation
- about / Operations on groups
reducers group operation
- about / Operations on groups
rename operation
- about / Pipe operations
right join
- about / Join operations

S

sample operation
- about / Map-like operations
Scala
- about / Why Scala?
- significance / Why Scala?
- basics / Scala basics
- trait / Scala basics
- lists / Scala basics
- tuples / Scala basics
- mehtods / Scala basics
- function literals / Scala basics
- Hello World application, executing in / Hello World in Scala
Scala build tools
- about / Scala build tools
Scala functions
- flatMap / Scala basics
- groupBy / Scala basics
Scala IDE
- URL / Development editors
Scalding
- used, for reading files / Reading and writing files
- used, for writing files / Reading and writing files, Best practices to read and write files
- TextLine parsing / TextLine parsing
- executing, in HDFS mode / Executing in the local and Hadoop modes
- executing, in local mode / Executing in the local and Hadoop modes
- core capabilities / Understanding the core capabilities of Scalding, Map-like operations, Join operations, Grouping/reducing functions
- executing, in Hadoop cluster / Executing Scalding in a Hadoop cluster
Scalding core capabilities
- map-like operations / Map-like operations
- join operations / Join operations
- pipe operations / Pipe operations
- grouping/reducing functions / Grouping/reducing functions
Scalding job
- running / Running our first Scalding job
- submitting, into Hadoop / Submitting a Scalding job in Hadoop
Scalding jobs
- monitoring / Monitoring Scalding jobs
ScaldingUnit framework / Implementing unit tests
scanLeft operation
- about / Calculating daily points
- running / Calculating daily points
search platforms
- about / Search platforms
- Elasticsearch / Elastic search
shared nothing architecture, MapReduce
- about / MapReduce
Shuffle
- about / A MapReduce example
Simple Build Tool (sbt)
- about / Scala build tools
sizeAveStdev group operation
- about / Operations on groups
slim JAR files
- using / Using slim JAR files
software testing / Introduction to testing
Solr
- about / Search platforms
sortBy group operation
- about / Operations on groups
sortedReverseTake group operation
- about / Operations on groups
sortWithTake group operation
- about / Operations on groups
SpyGlass
- URL / Reading from HBase
- used, for reading data from HBase / Reading from HBase
- used, for wrting data to HBase / Writing in HBase
SQL databases
- using / SQL databases
- access patterns / SQL databases
SQL dialects
- using / SQL databases
system testing / Introduction to testing
system tests, TDD
- defining / Defining and performing system tests
- performing / Defining and performing system tests

T

Tab Separated Values (TSV)
- about / Reading and writing files
take group operation
- about / Operations on groups
takeWhile group operation
- about / Operations on groups
targeted ads
- generating / Generating targeted ads
TaskTracker nodes
- about / The Hadoop platform
TDD
- implementing / Implementing the TDD methodology
- FOR Scalding developers / Implementing the TDD methodology, Implementing integration tests
- algorithm, decomposing / Decomposing the algorithm
- acceptance tests, defining / Defining acceptance tests
- integration tests tests, defining / Implementing integration tests
- unit tests, implementing / Implementing unit tests
- MapReduce logic, implementing / Implementing the MapReduce logic
- system tests, defining / Defining and performing system tests
testing strategy
- data science phase, data exploration / Development lifecycle with testing strategy
- data science phase, whiteboard design / Development lifecycle with testing strategy
- development tasks, TDD implementation / Development lifecycle with testing strategy
- development tasks, production deployment and monitoring / Development lifecycle with testing strategy
TextLine format
- about / Reading and writing files
TextLine parsing
- about / TextLine parsing
- example / TextLine parsing
text similarity
- computing, TF-IDF used / Text similarity using TF-IDF
- setting, Jaccard index used / Setting a similarity using the Jaccard index, K-Means using Mahout
TF-IDF
- about / Text similarity using TF-IDF
- used, for text similarity / Text similarity using TF-IDF
toList group operation
- about / Operations on groups
toList operation
- about / Calculating daily points
tools, for job scheduling
- cron / Scheduling execution
- Jenkins / Scheduling execution
- Oozie / Scheduling execution
- Azkaban / Scheduling execution
trait
- about / Scala basics
tuples, Scala
- about / Scala basics
Typed API
- about / Typed API

U

unique operation
- about / Composite operations
Unit/component testing / Introduction to testing
unit tests, TDD
- implementing / Implementing unit tests
unpack operation
- about / Map-like operations
unpivot group operation
- about / Operations on groups
user-defined functions (UDF)
- about / MapReduce abstractions

W

WritableSequenceFile object / K-Means using Mahout

Z

ZooKeeper
- about / The Hadoop platform