본문 바로가기
빅데이터/hive

[hive] 다이나믹 파티션의 __HIVE_DEFAULT_PARTITION__

by hs_seo 2017. 2. 2.

하이브에서 다이나믹 파티션으로 데이터를 생성할 때 

다이나믹 파티션에 입력되는 이름이 null 이거나 공백이면 

하이브에서 지정된(hive.exec.default.partition.name) 이름을 이용하여 파티션을 생성한다. 


hive-default.xml 에 설정된 기본 이름이 __HIVE_DEFAULT_PARTITION__이다. 


아래와 같은 코드에서 country_code 칼럼에 공백이나 null 값이 있으면 

기본으로 설정된 이름의 파티션이 생성된다. 


insert into table partition_sample partition (country_code)

select country, 

       country_code 

  from world_name;



show partitions partition_sample ;

...

country_code=__HIVE_DEFAULT_PARTITION__


  • If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is HIVE_DEFAULT_PARTITION{}. Basically this partition will contain all "bad" rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by HIVE_DEFAULT_PARTITION{} if you select them Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition column values as well.
  • Dynamic partition insert could potentially be a resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:
    • hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed.
    • hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.
    • hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.




반응형