Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

What data lifecycle management is


Data doesn't exist only at a point in time. Particularly for long-running production workflows, you are likely to acquire a significant quantity of data in a Hadoop cluster. Requirements rarely stay static for long, so alongside new logic you might also see the format of that data change or require multiple data sources to be used to provide the dataset processed in your application. We use the term data lifecycle management to describe an approach to handling the collection, storage, and transformation of data that ensures that data is where it needs to be, in the format it needs to be in, in a way that allows data and system evolution over time.

Importance of data lifecycle management

If you build data processing applications, you are by definition reliant on the data that is processed. Just as we consider the reliability of applications and systems, it becomes necessary to ensure that the data is also production-ready.

Data at some point needs to be ingested...