SQL-like interface to run queries on Big Data frameworks. such as Hadoop, Hive, Pig, Sqoop, and Flume. strengths also include Hive, Pig, Spark, Elasticsearch, Sqoop, Flume, Kafka, and Java. Apache Hive Cookbook - Selection from Apache Hive Cookbook [Book]. Programming Hive introduces Hive, an essential tool in the Hadoop ecosystem .. ramblipetasga.ga, and High Performance Cassandra Cookbook by .
|Language:||English, Spanish, German|
|Genre:||Health & Fitness|
|Distribution:||Free* [*Register to download]|
Apache Hive. Cookbook. Easy, hands-on recipes to help you understand Hive and . Packt offers eBook versions of every book published, with PDF and ePub. Read "Apache Hive Cookbook" by Shrey Mehrotra available from Rakuten Kobo. Sign up today and get $5 off your first download. Easy, hands-on recipes to. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or.
The output of the first statement is shown in the following screenshot. The highlighted statement shows that there are no reducers used while processing this query.
The total time taken by this query is 40 seconds:. In this case, the property hive. Now, let us run the set hive. The output of the second command is shown next:.
Joins and Join Optimization There are a few restrictions while using a map-side join.
The following are not supported:. In this case, the join is automatically converted into a bucket map join or a bucket sort merge map join, which is discussed in the later part of this chapter.
So use the set hive. Using a bucket map join In this recipe, you will learn how to use a bucket map join in Hive.
A bucket map join is used when the tables are large and all the tables used in the join are bucketed on the join columns. In this type of join, one table should have buckets in multiples of the number of buckets in another table. For example, if one table has 2 buckets then the other table must have either 2 buckets or a multiple of 2 buckets 2, 4, 6, and so on. If the preceding condition is satisfied then the joining can be done at the mapper side only, otherwise a normal inner join is performed.
This means that only the required buckets are fetched on the mapper side and not the complete table. That is, only the matching buckets of all small tables are replicated onto each mapper.
Doing this, the efficiency of the query is improved drastically. In a bucket map join, data is not sorted. Hive does not support a bucket map join by default. The following property needs to be set to true for the query to work as a bucket map join: In this type of join, not only tables need to be bucketed but also data needs to be bucketed while inserting.
For this, the following property needs to be set before inserting the data: Chapter 7 The general syntax for a bucket map join is as follows: Only the matching buckets are replicated onto each mapper. The second statement works in the same manner as the first one. The only difference is that in the preceding statement there is a join on more than two tables. Using a bucket sort merge map join In this recipe, you will learn how to use a bucket sort merge map join in Hive. A bucket sort merge map join is an advanced version of a bucket map join.
If the data in the tables is sorted and bucketed on the join columns at the same time then a bucket sort merge map join comes into the picture. In this type of join, all the tables must have an equal number of buckets as each mapper will read a bucket from each table and will perform a bucket sort merge map join.
It is mandatory for the data to be sorted in this join condition. The following parameter needs to be set to true for sorting the data or data can be sorted manually: Set hive. If data in the buckets is not sorted then there is a possibility that a wrong result or output is generated as Hive does not check whether the buckets are sorted or not. The following parameters need to be set for: BucketizedHiveInputFormat; set hive.
The general syntax for a bucket map join is as follows: Using a skew join In this recipe, you will learn how to use a skew join in Hive.
A skew join is used when there is a table with skew data in the joining column. A skew table is a table that is having values that are present in large numbers in the table compared to other data. Skew data is stored in a separate file while the rest of the data is stored in a separate file.
If there is a need to perform a join on a column of a table that is appearing quite often in the table, the data for that particular column will go to a single reducer, which will become a bottleneck while performing the join.
To reduce this, a skew join is used. The following parameter needs to be set for a skew join: How to do it Run the following command to use a bucket sort merge map join in Hive: There is a join that needs to be performed on the ID column that is present in both tables.
The Sales table is having a column ID, which is highly skewed on That is, the value 10 for the ID column is appearing in large numbers compared to other values for the same column. The skewed keys in Sales are only read and processed by the Mapper, and not sent to the reducer.
So these rows can be loaded into the memory. Alternatively, you can download the book from site, BN.
Click here for ordering and shipping details. Chapter No. Flag for inappropriate content. Related titles. Hadoop World: Hadoop Development at Facebook: Hive and HDFS. Jump to Page. Search inside document. Fr ee Apache Hive Cookbook What this book will do for you Learn different features and offering on the latest Hive Understand the workings and structure of Apache Hive Cookbook Hive was developed by Facebook and later open sourced in the Apache community.
Sa m pl Hive internals Get an insight into the latest developments in Inside the Cookbook Joins and Join Optimization In this chapter, you will learn: The output is shown in the following screenshot: The output is as shown next: A right outer join is as follows: A full outer join is as follows: Is an optional clause.
The table name can also be used instead of the alias name [alias2]: The output of the query is shown next: The following are examples to use cross joins in tables: The output of this command is shown next: The total time taken by this query is 40 seconds: The output of the second command is shown next: Style and approach Starting with the basics and covering the core concepts with the practical usage, this book is a complete guide to learn and explore Hive offerings.
Read more Collapse About the author Hanish Bansal is a software engineer with over 4 years of experience in developing big data applications.
He loves to study emerging solutions and applications mainly related to big data processing, NoSQL, natural language processing, and neural networks. He was also the technical reviewer of the book Apache Zookeeper Essentials. In his spare time, he loves to travel and listen to music. Saurabh Chauhan is a module lead with close to 8 years of experience in data warehousing and big data applications. Also, it teaches how to query, process and analyze the data using Hive.
The best part of this book is that it will help you in installing and configuring Hive in your environment.
Also, it will depict how Hive queries get converted into Map-Reduce jobs internally and the other operations. For beginners, this can be an ideal book to start with Apache Hive from scratch. However, before reading this book, we recommend having some basic knowledge about SQL for a better understanding of the Hive.
Though, the same applies here as well. Also, to configure Hive in any environment with different types of Hive Metastore supported, it is one of the best books. Also, it includes concepts to configure Hive clients and services.
Along with Hive partitions and Hive Bucketing, it also explains the different Hive optimization techniques. However, the best part of this book is the integration of the Hive with other frameworks including Spark.
Basically, it will help you start with Apache Hive easily, following with the practical approach to code in Hive. Well, the best part of this book that, it will help you to write the first line of Hive code. It also explains how the code is getting converted to MapReduce programs internally.