pandas?dataframe寫入到hive方式

更新時間：2023年08月21日 08:34:58 作者：taiguangxing

這篇文章主要介紹了pandas?dataframe寫入到hive方式,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教

pandas dataframe寫入hive表

關(guān)鍵流程主要分為兩步：

1.將pandas dataframe轉(zhuǎn)換為sparkdataframe

這一步驟主要使用spark自帶的接口：

spark_df = spark.createDataFrame(pd_df)

2.將spark_df寫入到hive的幾種方式

spark_df.write.mode('overwrite').format("hive").saveAsTable("dbname.tablename")

以下是一個demo的完整代碼：

import pandas as pd
import numpy as np
from pyspark import SparkContext,SparkConf
from pyspark.sql import HiveContext,SparkSession
from pyspark.sql import SQLContext
pd_df = pd.DataFrame(np.random.randint(0,10,(3,4)),columns=['a','b','c'])
spark = SparkSession.builder.appName('pd_2_hive').master('local').enableHiveSupport().getOrCreate()
spark_df = spark.createDataFrame(pd_df)
#spark dataframe 有接口可以直接寫入到hive
spark_df.write.mode('overwrite').format("hive").saveAsTable("dbname.tablename")
'''
其中 overwrite 代表如果表中存在數(shù)據(jù)，那么新數(shù)據(jù)會將原來的數(shù)據(jù)覆蓋,此外還有append等模式,詳細介紹如下：
        * `append`: Append contents of this :class:`DataFrame` to existing data.
        * `overwrite`: Overwrite existing data.
        * `error` or `errorifexists`: Throw an exception if data already exists.
        * `ignore`: Silently ignore this operation if data already exists.
'''
#此外還可以將spark_df 注冊為臨時表，之后通過sql的方式寫到hive里
spark_df.registerTempTable('tmp_table')
tmp_sql = '''create table dbname.tablename as select * from tmp_table'''
spark.sql(tmp_sql)
spark.stop()

至此，便完成了pandas dataframe 寫入到 hive表的過程。

如何把dataframe直接保存到hive表中？

有多種方式把一個dataframe保存到hive表中：

1.直接把dataframe的內(nèi)容寫入到目標hive表

df.write().mode("overwrite").saveAsTable("tableName");
或
df.select(df.col("col1"),df.col("col2")) .write().mode("overwrite").saveAsTable("schemaName.tableName");
或
df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");

2.注冊一張臨時表，再通過sql語句插入到目標表

df.createOrReplaceTempView("$tempTableName")
spark.sql("insert into table dbName.$hive_table_name PARTITION($partition_column) select * from $tempTableName")

注意：

第2種方式可以指定寫入的分區(qū)，而臨時表會在任務(wù)完成時自動清除，但最好是在不使用時主動清除掉。