This recipe will show how to use partitioned tables to store data in Hive. Partitioned tables allow us to store datasets partitioned by one or more data columns for efficient querying. The real data will reside in separate directories, where the names of the directories will form the values of the partition column. Partitioned tables can improve the performance of some queries by reducing the amount of data that Hive has to process by reading only select partitions when using an appropriate where
predicate. A common example is to store transactional datasets (or other datasets with timestamps such as web logs) partitioned by the date. When the Hive table is partitioned by the date, we can query the data that belongs to a single day or a date range, reading only the data that belongs to those dates. In a non-partitioned table, this would result in a full table scan, reading all the data in that table, which can be very inefficient when you have terabytes of...
Hadoop MapReduce v2 Cookbook - Second Edition: RAW
Hadoop MapReduce v2 Cookbook - Second Edition: RAW
Overview of this book
Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Free Chapter
Getting Started with Hadoop v2
Cloud Deployments – Using Hadoop YARN on Cloud Environments
Hadoop Essentials – Configurations, Unit Tests, and Other APIs
Developing Complex Hadoop MapReduce Applications
Analytics
Hadoop Ecosystem – Apache Hive
Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop
Searching and Indexing
Classifications, Recommendations, and Finding Relationships
Mass Text Data Processing
Index
Customer Reviews