Book Image

Mastering Hadoop

By : Sandeep Karanth
Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Table of Contents (21 chapters)
Mastering Hadoop
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Filtering inputs


Filtering inputs to a job based on certain attributes is often required. Data-level filtering can be done within the Maps, but it is more efficient to filter at the file level before the Map task is spawned. Filtering enables only interesting files to be processed by Map tasks and can have a positive effect on the runtime of the Map by eliminating unnecessary file fetch. For example, files generated only within a certain time period might be required for analysis.

Let's use the 441-grant proposal file corpus subset to illustrate filtering. Let's process those files whose names match a particular regular expression and have a minimum file size. Both of these are specified as job parameters—filter.name and filter.min.size, respectively. Implementation entails extending the Configured class and implementing the PathFilter interface as shown in the following snippet. The Configured class is the base class for things that can be configured using Configuration. The PathFilter interface...