Book Image

Learning Apache Apex

By : Thomas Weise, Ananth Gundabattula, Munagala V. Ramanath, David Yan, Kenneth Knowles
Book Image

Learning Apache Apex

By: Thomas Weise, Ananth Gundabattula, Munagala V. Ramanath, David Yan, Kenneth Knowles

Overview of this book

Apache Apex is a next-generation stream processing framework designed to operate on data at large scale, with minimum latency, maximum reliability, and strict correctness guarantees. Half of the book consists of Apex applications, showing you key aspects of data processing pipelines such as connectors for sources and sinks, and common data transformations. The other half of the book is evenly split into explaining the Apex framework, and tuning, testing, and scaling Apex applications. Much of our economic world depends on growing streams of data, such as social media feeds, financial records, data from mobile devices, sensors and machines (the Internet of Things - IoT). The projects in the book show how to process such streams to gain valuable, timely, and actionable insights. Traditional use cases, such as ETL, that currently consume a significant chunk of data engineering resources are also covered. The final chapter shows you future possibilities emerging in the streaming space, and how Apache Apex can contribute to it.
Table of Contents (17 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Building and running the application


You can build the application using the usual Maven command:

mvn clean package -DskipTests

The first time you do this, it may take a few minutes to complete, as it downloads all the dependency artifacts, but subsequent builds should go much faster. When the build completes, you should see a directory called target and a file called etl-1.0-SNAPSHOT.apa within it.

This is the application archive file that needs to be deployed to run the application on an actual Hadoop cluster.

The application includes a test file that can be used to run the entire application in your favorite IDE (such as Eclipse or IntelliJ) without the need for an external cluster as described in Chapter 2,Getting Started with Application Development. You can also run the test from the command line using the following command:

mvn -Dtest=SampleApplicationTest#test test

On modern machines, this should complete successfully in about 40 seconds; if something is wrong, it will fail with a timeout...