Saturday, May 21, 2016

Partitioning and Bucketing in Hive

Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example like if we are dealing with large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING), Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like . .../employees/country=ABC/DEPT=XYZ. If query limits for employee from country ABC t will only scan the contents of one directory ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the date as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS ; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

The below given questions and answers will give a clear idea about partitioning and bucketing together.

Question : Suppose if we specify 32 buckets in the CLUSED BY clause and the CREATE TABLE statement also contains the Partitioning clause, how partitions and buckets will be managed together ? Does number of partitions will be limited to 32 ? OR for each partition, 32 buckets will be created ? Is every bucket an HDFS file ?

Answer : A hive table can have both partitioning and bucketing. Based on your partition clause , for each partition will have 32 buckets created. Yes HDFS file.

1 comment: