ORC 设计研读

中台 数据  收藏
0 / 256

前言

image.png

在做存储治理的时候,我们发现改变 hive 存储方式为 orc,最多在存储上有节约 50 倍之多。这是怎么实现的,这回我们通过研读 ORC 的设计,打开这算大门。

image.png

背景

ORC 全称 Optimized Row Columnar, 是在 RCFile 的局限性下产生的一个新产物,使用 orc 作为 hive 的存储能够显著提升大数据读写处理的效率。现在主要有3个大版本。ORC v2 还在设计研发中。

  • ORC v0 was released in Hive 0.11.
  • ORC v1 was released in Hive 0.12 and ORC 1.x.
  • ORC v2 is a work in progress and is rapidly evolving.

RCFile has limitations because it treats each column as a binary blob without semantics

改良点

  • a single file as the output of each task, which reduces the NameNode's load

  • Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)

  • light-weight indexes stored within the file

  • skip row groups that don't pass predicate filtering (剪枝)

image.png

  • seek to a given row

  • block-mode compression based on data type (块压缩编码)

  • run-length encoding for integer columns

  • dictionary encoding for string columns

  • concurrent reads of the same file using separate RecordReaders

  • ability to split files without scanning for markers

  • bound the amount of memory needed for reading or writing

  • metadata stored using Protocol Buffers, which allows addition and removal of fields

文件结构

This diagram illustrates the ORC file structure:

image.png

Stripe Structure

As shown in the diagram, each stripe in an ORC file holds index data, row data, and a stripe footer.

The stripe footer contains a directory of stream locations.

Row data is used in table scans.

Index data includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

可以参考以下导出 orcfiledump.json 文件。

/usr/local/service/hive/bin/hive --orcfiledump -j -p /apps/hive/warehouse/adm.db/adm_usr_base_info_df/pt=20210422/000279_0 > orcfile.json

https://oss.dataown.cn/data/2021/4/190cce2683fda30c.json

存储格式默认参数

image.png

查询测试

image.png

Hortonworks 也在上图测试了下查询效率,并在 HDP 版本中进一步进行优化。同时,如果我们的查询以以下方式居多,那么效率也会得到更大的提升。

  1. 经常确定性的从大的事实表中过滤特定列数据
  2. 筛选必要的字段

参考

  1. https://orc.apache.org/docs/
  2. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
  3. https://blog.cloudera.com/orcfile-in-hdp-2-better-compression-better-performance/
  4. orc_proto.proto