In this section, we will review the various tools that enable the transformation of data in the Data Lake. We will review HCatalog, Hive, and Pig in detail, which are the popular methods to transform data in Data Lake. Next, we will look at how Azure PowerShell enables easy assembly of these scripts into a single procedure.
Apache HCatalog manages metadata of the structure of files in Hadoop. In Chapter 5, Ingest and Organize Data Lake, we registered stage tables with HCatalog, and in this chapter, we will leverage that information for transformation.
With Azure HDInsight, the metastore can be hosted in an embedded mode in Apache Derby, which comes with the standard Hadoop. However, one issue with this approach is that every time you shut down HDInsight, the metastore information is lost. An alternative is to store HCatalog information in a separately managed SQL database. Perform the following...