Introducing the Big Data ecosystem
Data management is of particular importance, especially when the data is in flux; either constantly changing or being routinely produced and updated. What is needed in these cases is a way of storing, structuring, and auditing data that allows for the continuous processing and refinement of models and results.
Here, we describe how to best hold and organize your data to integrate with Apache Spark and related tools within the context of a data architecture that is broad enough to fit the everyday requirement.
Data management
Even if, in the medium term, you only intend to play around with a bit of data at home; then without proper data management, more often than not, efforts will escalate to the point where it is easy to lose track of where you are and mistakes will happen. Taking the time to think about the organization of your data, and in particular, its ingestion, is crucial. There's nothing worse than waiting for a long running analytic to complete, collating the results and producing a report, only to discover you used the wrong version of data, or data is incomplete, has missing fields, or even worse you deleted your results!
The bad news is that, despite its importance, data management is an area that is consistently overlooked in both commercial and non-commercial ventures, with precious few off-the-shelf solutions available. The good news is that it is much easier to do great data science using the fundamental building blocks that this chapter describes.
Data management responsibilities
When we think about data, it is easy to overlook the true extent of the scope of the areas we need to consider. Indeed, most data "newbies" think about the scope in this way:
Obtain data
Place the data somewhere (anywhere)
Use the data
Throw the data away
In reality, there are a large number of other considerations, it is our combined responsibility to determine which ones apply to a given work piece. The following data management building blocks assist in answering or tracking some important questions about the data:
File integrity
Is the data file complete?
How do you know?
Was it part of a set?
Is the data file correct?
Was it tampered with in transit?
Data integrity
Is the data as expected?
Are all of the fields present?
Is there sufficient metadata?
Is the data quality sufficient?
Has there been any data drift?
Scheduling
Is the data routinely transmitted?
How often does the data arrive?
Was the data received on time?
Can you prove when the data was received?
Does it require acknowledgement?
Schema management
Is the data structured or unstructured?
How should the data be interpreted?
Can the schema be inferred?
Has the data changed over time?
Can the schema be evolved from the previous version?
Version Management
What is the version of the data?
Is the version correct?
How do you handle different versions of the data?
How do you know which version you're using?
Security
Is the data sensitive?
Does it contain personally identifiable information (PII)?
Does it contain personal health information (PHI)?
Does it contain payment card information (PCI)?
How should I protect the data?
Who is entitled to read/write the data?
Does it require anonymization/sanitization/obfuscation/encryption?
Disposal
How do we dispose of the data?
When do we dispose of the data?
If, after all that, you are still not convinced, before you go ahead and write that bash script using the gawk
and crontab
commands, keep reading and you will soon see that there is a far quicker, flexible, and safer method that allow you to start small and incrementally create commercial grade ingestion pipelines!
The right tool for the job
Apache Spark is the emerging de facto standard for scalable data processing. At the time of writing this book, it is the most active Apache Software Foundation (ASF) project and has a rich variety of companion tools available. There are new projects appearing every day, many of which overlap in functionality. So it takes time to learn what they do and decide whether they are appropriate to use. Unfortunately, there's no quick way around this. Usually, specific trade-offs must be made on a case-by-case basis; there is rarely a one-size-fits-all solution. Therefore, the reader is encouraged to explore the available tools and choose wisely!
Various technologies are introduced throughout this book, and the hope is that they will provide the reader with a taster of some of the more useful and practical ones to a level where they may start utilizing them in their own projects. And further, we hope to show that if the code is written carefully, technologies may be interchanged through clever use of Application Program Interface (APIs) (or high order functions in Spark Scala) even when a decision is proved to be incorrect.