Clickhouse系列之整合Hive數據倉庫示例詳解

更新時間：2022年10月13日 14:55:44 作者：阿杰筆記

這篇文章主要為大家介紹了Clickhouse系列之整合Hive數據倉庫示例詳解，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進步，早日升職加薪

前言

什么是Hive？ Apache Hive 數據倉庫軟件便于使用SQL讀取、寫入和管理駐留在分布式存儲中的大型數據集。結構可以投射到已存儲的數據上。提供了一個命令行工具和JDBC驅動程序，用于將用戶連接到Hive。

Hive引擎允許您對HDFS配置單元表執(zhí)行SELECT查詢。目前支持如下輸入格式：

文本：僅支持簡單標量列類型，二進制除外；
ORC：支持除char以外的簡單標量列類型；僅支持數組等復雜類型；
parquet：支持所有簡單的標量列類型；僅支持數組等復雜類型。

正文

創(chuàng)建Hive引擎表詳細信息以及參數詳解

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]  
(  
name1 [type1] [ALIAS expr1],  
name2 [type2] [ALIAS expr2],  
...  
) ENGINE = Hive('thrift://host:port', 'database', 'table');  
PARTITION BY expr

表結構可以與原始配置單元表結構不同：

列名應該與原始配置單元表中的列名相同(推薦列名相同處理)，但您可以只使用其中的一些列，并且可以按任何順序使用，也可以使用從其他列計算的一些別名列。
列類型應與原始配置單元表中的列類型相同。
按表達式劃分應該與原始Hive表一致，按表達式劃分中的列應該在表結構中。

引擎參數：

thrift://host:port-配置單元元存儲地址
database—遠程數據庫名稱。
table—遠程表名稱。

實戰(zhàn)案例

為遠程文件系統(tǒng)啟用本地緩存。通過官方的基準測試表明，使用緩存的速度快了近兩倍。在使用緩存之前，將其添加到config.xml

<local_cache_for_remote_fs>
    <enable>true</enable>
    <root_dir>local_cache</root_dir>
    <limit_size>559096952</limit_size>
    <bytes_read_before_flush>1048576</bytes_read_before_flush>
</local_cache_for_remote_fs>

參數詳解：

enable:ClickHouse將在啟動后維護遠程文件系統(tǒng)（HDFS）的本地緩存（如果為true）。
root_dir：必需。用于存儲遠程文件系統(tǒng)的本地緩存文件的根目錄。
limit_size：必填。本地緩存文件的最大大?。ㄗ止?jié)）。
bytes_read_before_flush：從遠程文件系統(tǒng)下載文件時，在刷新到本地文件系統(tǒng)之前控制字節(jié)數。默認值為1MB。

盡管ClickHouse在啟用遠程文件系統(tǒng)本地緩存的情況下啟動時，我們仍然可以選擇不使用其查詢中設置為use_local_cache_for_remote_fs=0的緩存。use_local_cache_for_remote_fs默認為false。

ORC數據格式

Hive創(chuàng)建ORC數據格式表

CREATE TABLE `test`.`test_orc`(  
`f_tinyint` tinyint,  
`f_smallint` smallint,  
`f_int` int,  
`f_integer` int,  
`f_bigint` bigint,  
`f_float` float,  
`f_double` double,  
`f_decimal` decimal(10,0),  
`f_timestamp` timestamp,  
`f_date` date,  
`f_string` string,  
`f_varchar` varchar(100),  
`f_bool` boolean,  
`f_binary` binary,  
`f_array_int` array<int>,  
`f_array_string` array<string>,  
`f_array_float` array<float>,  
`f_array_array_int` array<array<int>>,  
`f_array_array_string` array<array<string>>,  
`f_array_array_float` array<array<float>>)  
PARTITIONED BY (  
`day` string)  
ROW FORMAT SERDE  
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  
STORED AS INPUTFORMAT  
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
OUTPUTFORMAT  
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'  
LOCATION  
'hdfs://testcluster/data/hive/test.db/test_orc'

insert into test.test_orc partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

Clickhouse創(chuàng)建Hive表引擎

CREATE TABLE test.test_orc
(
    `f_tinyint` Int8,
    `f_smallint` Int16,
    `f_int` Int32,
    `f_integer` Int32,
    `f_bigint` Int64,
    `f_float` Float32,
    `f_double` Float64,
    `f_decimal` Float64,
    `f_timestamp` DateTime,
    `f_date` Date,
    `f_string` String,
    `f_varchar` String,
    `f_bool` Bool,
    `f_binary` String,
    `f_array_int` Array(Int32),
    `f_array_string` Array(String),
    `f_array_float` Array(Float32),
    `f_array_array_int` Array(Array(Int32)),
    `f_array_array_string` Array(Array(String)),
    `f_array_array_float` Array(Array(Float32)),
    `day` String
)
ENGINE = Hive('thrift://202.168.117.26:9083', 'test', 'test_orc')
PARTITION BY day

通過Clickhouse查詢Hive數據

SELECT * FROM test.test_orc settings input_format_orc_allow_missing_columns = 1\G

Parquet數據格式

Hive創(chuàng)建Parquet數據格式表

CREATE TABLE `test`.`test_parquet`(  
`f_tinyint` tinyint,  
`f_smallint` smallint,  
`f_int` int,  
`f_integer` int,  
`f_bigint` bigint,  
`f_float` float,  
`f_double` double,  
`f_decimal` decimal(10,0),  
`f_timestamp` timestamp,  
`f_date` date,  
`f_string` string,  
`f_varchar` varchar(100),  
`f_char` char(100),  
`f_bool` boolean,  
`f_binary` binary,  
`f_array_int` array<int>,  
`f_array_string` array<string>,  
`f_array_float` array<float>,  
`f_array_array_int` array<array<int>>,  
`f_array_array_string` array<array<string>>,  
`f_array_array_float` array<array<float>>)  
PARTITIONED BY (  
`day` string)  
ROW FORMAT SERDE  
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  
STORED AS INPUTFORMAT  
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
OUTPUTFORMAT  
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'  
LOCATION  
'hdfs://testcluster/data/hive/test.db/test_parquet'

insert into test.test_parquet partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

Clickhouse創(chuàng)建Hive表引擎

CREATE TABLE test.test_parquet  
(  
`f_tinyint` Int8,  
`f_smallint` Int16,  
`f_int` Int32,  
`f_integer` Int32,  
`f_bigint` Int64,  
`f_float` Float32,  
`f_double` Float64,  
`f_decimal` Float64,  
`f_timestamp` DateTime,  
`f_date` Date,  
`f_string` String,  
`f_varchar` String,  
`f_char` String,  
`f_bool` Bool,  
`f_binary` String,  
`f_array_int` Array(Int32),  
`f_array_string` Array(String),  
`f_array_float` Array(Float32),  
`f_array_array_int` Array(Array(Int32)),  
`f_array_array_string` Array(Array(String)),  
`f_array_array_float` Array(Array(Float32)),  
`day` String  
)  
ENGINE = Hive('thrift://localhost:9083', 'test', 'test_parquet')  
PARTITION BY day

通過Clickhouse查詢Hive數據

SELECT * FROM test.test_parquet settings input_format_parquet_allow_missing_columns = 1\G

TextFile數據格式

Hive創(chuàng)建TextFile數據格式表

CREATE TABLE `test`.`test_text`(  
`f_tinyint` tinyint,  
`f_smallint` smallint,  
`f_int` int,  
`f_integer` int,  
`f_bigint` bigint,  
`f_float` float,  
`f_double` double,  
`f_decimal` decimal(10,0),  
`f_timestamp` timestamp,  
`f_date` date,  
`f_string` string,  
`f_varchar` varchar(100),  
`f_char` char(100),  
`f_bool` boolean,  
`f_binary` binary,  
`f_array_int` array<int>,  
`f_array_string` array<string>,  
`f_array_float` array<float>,  
`f_array_array_int` array<array<int>>,  
`f_array_array_string` array<array<string>>,  
`f_array_array_float` array<array<float>>)  
PARTITIONED BY (  
`day` string)  
ROW FORMAT SERDE  
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  
STORED AS INPUTFORMAT  
'org.apache.hadoop.mapred.TextInputFormat'  
OUTPUTFORMAT  
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'  
LOCATION  
'hdfs://testcluster/data/hive/test.db/test_text'

insert into test.test_text partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

Clickhouse創(chuàng)建Hive表引擎

CREATE TABLE test.test_text  
(  
`f_tinyint` Int8,  
`f_smallint` Int16,  
`f_int` Int32,  
`f_integer` Int32,  
`f_bigint` Int64,  
`f_float` Float32,  
`f_double` Float64,  
`f_decimal` Float64,  
`f_timestamp` DateTime,  
`f_date` Date,  
`f_string` String,  
`f_varchar` String,  
`f_char` String,  
`f_bool` Bool,  
`day` String  
)  
ENGINE = Hive('thrift://localhost:9083', 'test', 'test_text')  
PARTITION BY day

通過Clickhouse查詢Hive數據

SELECT * FROM test.test_text settings input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort'\G

總結

本節(jié)主要講解了Clickhouse整合Hive數倉，利用了Hive引擎并通過thrift方式去連接，需要注意這種連接參數的設置以及代表意義。另外，這個過程我們需要注意的是，推薦開啟緩存，這樣查詢速度會快很多。與此同時，也對Hive常用的三種數據類型ORC,Parquet,TextFile進行了一個實戰(zhàn)案例操作，更多關于Clickhouse整合Hive數據倉庫的資料請關注腳本之家其它相關文章！

您可能感興趣的文章:

數據庫設計的完整性約束表現在哪些方面
數據完整性是指數據的正確性、完備性和一致性，是衡量數據庫質量好壞的規(guī)范。數據庫完整性由各式各樣的完整性約束來確保，因而可以說數據庫完整性規(guī)劃即是數據庫完整性約束的規(guī)劃。那么，數據庫設計的完整性約束表現哪些方面?
2015-10-10
數據庫設計技巧奉送了
數據庫設計技巧奉送了...
2007-03-03
hive數據倉庫新增字段方法
這篇文章主要為大家介紹了hive中新增字段的方法示例，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進步，早日升職加薪
2022-06-06
powerdesigner?for?mysql腳本要求字段、表名有注釋操作
在PowerDesigner中,可以通過修改DBMS設置為MySQL數據庫添加字段和表名的注釋,具體步驟包括編輯當前的DBMS設置,并在相應的Script選項下調整Column和Table的配置,本文給大家介紹powerdesigner?for?mysql腳本要求字段、表名有注釋操作,感興趣的朋友跟隨小編一起看看吧
2023-07-07
一文詳解SQL中為什么不要使用1=1
很多時候使用where 1=1 可以很方便的解決我們的問題,但是這樣很可能會造成非常大的性能損失,這篇文章主要給大家介紹了關于SQL中為什么不要使用1=1的相關資料,需要的朋友可以參考下
2024-03-03
如何查看Navicat加密的數據庫密碼
本機裝的MySQL數據庫密碼忘記了，打開了Navicat連接過數據庫，不過密碼是加密的，怎么辦呢？今天小編給大家分享如何查看Navicat加密的數據庫密碼，感興趣的朋友一起看看吧
2023-04-04
Navicat快速導入和導出sql文件的方法
Navicat是MySQL非常好用的可視化管理工具，功能非常強大，能滿足我們日常數據庫開發(fā)的所有需求。今天教大家如何導入和導出SQL文件，感興趣的朋友跟隨小編一起看看吧
2021-05-05
TDSQL 安裝部署附圖的實現(圖文)
這篇文章主要介紹了TDSQL 安裝部署附圖的實現(圖文)，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2020-10-10
sql Union和Union All的使用方法
UNION指令的目的是將兩個SQL語句的結果合并起來。從這個角度來看，我們會產生這樣的感覺，UNION跟JOIN似乎有些許類似，因為這兩個指令都可以由多個表格中擷取資料。
2009-07-07
[數據庫] 通用分頁存儲過程
[數據庫] 通用分頁存儲過程...
2007-02-02