[hive] 하이브 테이블 조회시 파티션 개수 제약하기

하이브에서 테이블을 조회할 때 where 조건에 파티션 정보를 이용하면 하이브 메타스토어에서는 테이블의 파티션 정보를 가져와서 데이터를 조회할 로케이션을 설정합니다.

이때 MetaStoreDirectSql.java 에서 다음의쿼리를 이용하여 파티션 정보를 가져옵니다. 그런데 이때 파티션의 구간을 길게 잡아서 조회하는 파티션 개수가 많아지면 버퍼 오류가 발생하게 됩니다.

select PARTITIONS.PART_ID
  from PARTITIONS  
  inner join TBLS on PARTITIONS.TBL_ID = TBLS.TBL_ID and TBLS.TBL_NAME = "테이블명"
  inner join DBS on TBLS.DB_ID = DBS.DB_ID and DBS.NAME = "데이터베이스명"
  where PARTITIONS.PART_NAME in (?, ?)
;

데이터가 많을 때 다음과 같이 버퍼 오류가 발생합니다.

java.nio.BufferOverflowException
        at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:189) ~[?:1.8.0_121]
        at java.nio.ByteBuffer.put(ByteBuffer.java:859) ~[?:1.8.0_121]
        at org.mariadb.jdbc.internal.packet.send.SendExecutePrepareStatementPacket.send(SendExecutePrepareStatementPacket.java:105) ~[mariadb-java-client-1.3.6.jar:?]

이런 문제를 방지하기 위해서 하이브는 조회시에 파티션 개수를 제약할 수 있습니다.

다음의 설정값을 이용하면 조회시에 이용할 수 있는 파티션 개수를 설정할 수 있습니다.

hive.limit.query.max.table.partition

Default Value: -1
Added In: Hive 0.13.0 with HIVE-6492
Deprecated In: Hive 2.2.0 with HIVE-13884 (See hive.metastore.limit.partition.request.)
Removed In: Hive 3.0.0 with HIVE-17965

To protect the cluster, this controls how many partitions can be scanned for each partitioned table. The default value "-1" means no limit. The limit on partitions does not affect metadata-only queries.

hive> select yymmddval, count(*) 
    >   from p_table
    >  where yymmddval between 20180625 and 20190831 
    >  group by yymmddval 
    >  order by yymmddval;
FAILED: SemanticException Number of partitions scanned (=27) on table p_table exceeds limit (=15). This is controlled by hive.limit.query.max.table.partition.

'빅데이터 > hive' 카테고리의 다른 글

[hive] TEZ 작업중 GC overhead limit exceeded 오류 처리 (0)	2019.07.05
[hive] common.JvmPauseMonitor (JvmPauseMonitor.java:run(194)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1727ms 메시지 (0)	2019.07.01
[hive] TEZ엔진을 이용한 UNION ALL INSERT문에서 서브디렉토리 생성을 막는 방법 (0)	2019.04.09
[hive] explain을 이용하여 CBO 적용 여부 확인 (0)	2019.04.05
[hive] TEZ엔진의 리듀서 처리중 셔플 단계의 OutOfMemoryError: Java heap space 오류 처리 (0)	2019.04.05

'빅데이터 > hive' 카테고리의 다른 글

티스토리툴바