快捷導(dǎo)航

關(guān)于hive表的存儲(chǔ)格式ORC格式的使用詳解

更新時(shí)間：2023年07月04日 11:43:04 作者：longshenlmj

這篇文章主要介紹了關(guān)于hive表的存儲(chǔ)格式ORC格式的使用詳解,Hive?是基于?Hadoop?的一個(gè)數(shù)據(jù)倉(cāng)庫(kù)工具，可以將結(jié)構(gòu)化的數(shù)據(jù)文件映射為一張表，并提供類SQL查詢功能,需要的朋友可以參考下

hive表的源文件存儲(chǔ)格式：

1、TEXTFILE

默認(rèn)格式，建表時(shí)不指定默認(rèn)為這個(gè)格式，導(dǎo)入數(shù)據(jù)時(shí)會(huì)直接把數(shù)據(jù)文件拷貝到hdfs上不進(jìn)行處理。源文件可以直接通過hadoop fs -cat 查看

2、SEQUENCEFILE

一種Hadoop API提供的二進(jìn)制文件，使用方便、可分割、可壓縮等特點(diǎn)。 SEQUENCEFILE將數(shù)據(jù)以<key,value>的形式序列化到文件中。序列化和反序列化使用Hadoop 的標(biāo)準(zhǔn)的Writable 接口實(shí)現(xiàn)。key為空，用value 存放實(shí)際的值，這樣可以避免map 階段的排序過程。

三種壓縮選擇：NONE, RECORD, BLOCK。 Record壓縮率低，一般建議使用BLOCK壓縮。使用時(shí)設(shè)置參數(shù)，

SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK
create table test2(str STRING) STORED AS SEQUENCEFILE;

3、RCFILE

一種行列存儲(chǔ)相結(jié)合的存儲(chǔ)方式。首先，其將數(shù)據(jù)按行分塊，保證同一個(gè)record在一個(gè)塊上，避免讀一個(gè)記錄需要讀取多個(gè)block。其次，塊數(shù)據(jù)列式存儲(chǔ)，有利于數(shù)據(jù)壓縮和快速的列存取。理論上具有高查詢效率（但hive官方說效果不明顯，只有存儲(chǔ)上能省10%的空間，所以不好用，可以不用）。 RCFile結(jié)合行存儲(chǔ)查詢的快速和列存儲(chǔ)節(jié)省空間的特點(diǎn)

1）同一行的數(shù)據(jù)位于同一節(jié)點(diǎn)，因此元組重構(gòu)的開銷很低；

2) 塊內(nèi)列存儲(chǔ)，可以進(jìn)行列維度的數(shù)據(jù)壓縮，跳過不必要的列讀取。

查詢過程中，在IO上跳過不關(guān)心的列。實(shí)際過程是，在map階段從遠(yuǎn)端拷貝仍然拷貝整個(gè)數(shù)據(jù)塊到本地目錄，也并不是真正直接跳過列，而是通過掃描每一個(gè)row group的頭部定義來實(shí)現(xiàn)的。

但是在整個(gè)HDFS Block 級(jí)別的頭部并沒有定義每個(gè)列從哪個(gè)row group起始到哪個(gè)row group結(jié)束。所以在讀取所有列的情況下，RCFile的性能反而沒有SequenceFile高。

4、ORC

hive給出的新格式，屬于RCFILE的升級(jí)版。

5、自定義格式

用戶的數(shù)據(jù)文件格式不能被當(dāng)前 Hive 所識(shí)別的，時(shí)通過實(shí)現(xiàn)inputformat和outputformat來自定義輸入輸出格式

注意：

只有TEXTFILE表能直接加載數(shù)據(jù)，必須，本地load數(shù)據(jù)，和external外部表直接加載運(yùn)路徑數(shù)據(jù)，都只能用TEXTFILE表。更深一步，hive默認(rèn)支持的壓縮文件（hadoop默認(rèn)支持的壓縮格式），也只能用TEXTFILE表直接讀取。其他格式不行。可以通過TEXTFILE表加載后insert到其他表中。

換句話說，SequenceFile、RCFile表不能直接加載數(shù)據(jù)，數(shù)據(jù)要先導(dǎo)入到textfile表，再?gòu)膖extfile表通過insert select from 導(dǎo)入到SequenceFile,RCFile表。 SequenceFile、RCFile表的源文件不能直接查看，在hive中用select看。

RCFile源文件可以用 hive --service rcfilecat /xxxxxxxxxxxxxxxxxxxxxxxxxxx/000000_0查看，但是格式不同，很亂。

ORC格式

ORC是RCfile的升級(jí)版，性能有大幅度提升，而且數(shù)據(jù)可以壓縮存儲(chǔ)，壓縮比和Lzo壓縮差不多，比text文件壓縮比可以達(dá)到70%的空間。而且讀性能非常高，可以實(shí)現(xiàn)高效查詢。具體介紹https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

建表語(yǔ)句如下：同時(shí)，將ORC的表中的NULL取值，由默認(rèn)的\N改為'',

方式一：

create table if not exists test_orc(
? advertiser_id string,
? ad_plan_id string,
? cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
STORED AS ORC;
alter table test_orc set serdeproperties('serialization.null.format' = '');

查看結(jié)果

hive&gt; show create table test_orc;
CREATE ?TABLE `test_orc`(
? `advertiser_id` string,?
? `ad_plan_id` string,?
? `cnt` bigint)
PARTITIONED BY (?
? `day` string,?
? `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck',?
? `hour` tinyint)
ROW FORMAT DELIMITED?
? NULL DEFINED AS ''?
STORED AS INPUTFORMAT?
? 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'?
OUTPUTFORMAT?
? 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
? 'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
? 'last_modified_by'='pmp_bi',?
? 'last_modified_time'='1465992624',?
? 'transient_lastDdlTime'='1465992624')

方式二：

drop table test_orc;
create table if not exists test_orc(
  advertiser_id string,
  ad_plan_id string,
  cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
with serdeproperties('serialization.null.format' = '')
STORED AS ORC;

查看結(jié)果

hive&gt; show create table test_orc;
CREATE  TABLE `test_orc`(
  `advertiser_id` string, 
  `ad_plan_id` string, 
  `cnt` bigint)
PARTITIONED BY ( 
  `day` string, 
  `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', 
  `hour` tinyint)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1465992726')

方式三：

drop table test_orc;
create table if not exists test_orc(
  advertiser_id string,
  ad_plan_id string,
  cnt BIGINT
) partitioned by (day string, type TINYINT COMMENT '0 as bid, 1 as win, 2 as ck', hour TINYINT)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS ORC;

查看結(jié)果

hive&gt; show create table test_orc;
CREATE  TABLE `test_orc`(
  `advertiser_id` string, 
  `ad_plan_id` string, 
  `cnt` bigint)
PARTITIONED BY ( 
  `day` string, 
  `type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', 
  `hour` tinyint)
ROW FORMAT DELIMITED 
  NULL DEFINED AS '' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://namenode/hivedata/warehouse/pmp.db/test_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1465992916')

到此這篇關(guān)于關(guān)于hive表的存儲(chǔ)格式ORC格式的使用詳解的文章就介紹到這了,更多相關(guān)hive表的ORC格式內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

關(guān)于hive表的存儲(chǔ)格式ORC格式的使用詳解

目錄

hive表的源文件存儲(chǔ)格式：

1、TEXTFILE

2、SEQUENCEFILE

3、RCFILE

4、ORC

5、自定義格式

ORC格式

方式一：

方式二：

方式三：

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

關(guān)于hive表的存儲(chǔ)格式ORC格式的使用詳解

目錄

hive表的源文件存儲(chǔ)格式：

1、TEXTFILE

2、SEQUENCEFILE

3、RCFILE

4、ORC

5、自定義格式

ORC格式

方式一：

方式二：

方式三：

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

1、TEXTFILE

3、RCFILE

4、ORC

5、自定義格式