快捷導(dǎo)航

Python連接Hadoop數(shù)據(jù)中遇到的各種坑(匯總)

更新時(shí)間：2020年04月14日 11:03:26 作者：wx0628

這篇文章主要介紹了Python連接Hadoop數(shù)據(jù)中遇到的各種坑，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

最近準(zhǔn)備使用Python+Hadoop+Pandas進(jìn)行一些深度的分析與機(jī)器學(xué)習(xí)相關(guān)工作。（當(dāng)然隨著學(xué)習(xí)過(guò)程的進(jìn)展，現(xiàn)在準(zhǔn)備使用Python+Spark+Hadoop這樣一套體系來(lái)搭建后續(xù)的工作環(huán)境），當(dāng)然這是后話。
但是這項(xiàng)工作首要條件就是將Python與Hadoop進(jìn)行打通，本來(lái)認(rèn)為很容易的一項(xiàng)工作，沒(méi)有想到竟然遇到各種坑，花費(fèi)了整整半天時(shí)間。后來(lái)也在網(wǎng)上看到大家在咨詢(xún)相同的問(wèn)題，但是真正解決這個(gè)問(wèn)題的帖子又幾乎沒(méi)有，所以現(xiàn)在將Python連接Hadoop數(shù)據(jù)庫(kù)過(guò)程中遇到的各種坑進(jìn)行一個(gè)匯總，然后與大家進(jìn)行分享，以盡量避免大家花費(fèi)寶貴的時(shí)間。

（說(shuō)明一下：這篇文章中的各種坑的解決，翻閱了網(wǎng)上無(wú)數(shù)的帖子，最好一GIT上面一個(gè)帖子的角落里面帶了這么一句，否則很容易翻船。但是由于帖子太多，所以我就不一一帖出來(lái)了）

首先是選組件，我選擇的是使用：impala+Python3.7來(lái)連接Hadoop數(shù)據(jù)庫(kù)，如果你不是的話，就不要浪費(fèi)寶貴時(shí)間繼續(xù)閱讀了。

執(zhí)行的代碼如下：

import impala.dbapi as ipdb
conn = ipdb.connect(host="192.168.XX.XXX",port=10000,user="xxx",password="xxxxxx",database="xxx",auth_mechanism='PLAIN')
cursor = conn.cursor()
#其中xxxx是表名，為了不涉及到公司的信息，我把表名隱藏掉了，大家自己換成自己數(shù)據(jù)庫(kù)表名
cursor.execute('select * From xxxx')
print(cursor.description) # prints the result set's schema
for rowData in cursor.fetchall():
  print(rowData)
conn.close()

坑一：提示語(yǔ)法錯(cuò)誤

現(xiàn)象：

/Users/wangxxin/miniconda3/bin/python3.7 /Users/wangxxin/Documents/Python/PythonDataAnalyze/project/knDt/pyHiveTest.py
Traceback (most recent call last):
File "/Users/wangxxin/Documents/Python/PythonDataAnalyze/project/knDt/pyHiveTest.py", line 1, in <module>
    import impala.dbapi as ipdb
File "/Users/wangxxin/miniconda3/lib/python3.7/site-packages/impala/dbapi.py", line 28, in <module>
    import impala.hiveserver2 as hs2
File "/Users/wangxxin/miniconda3/lib/python3.7/site-packages/impala/hiveserver2.py", line 340
    async=True)

解決辦法：將參數(shù)async全部修改為“async_”（當(dāng)然這個(gè)可以隨便，只要上下文一致，并且不是關(guān)鍵字即可），原因：在Python3.0中，已經(jīng)將async標(biāo)為關(guān)鍵詞，如果再使用async做為參數(shù)，會(huì)提示語(yǔ)法錯(cuò)誤；應(yīng)該包括以下幾個(gè)地方：

#hiveserver2.py文件338行左右
op = self.session.execute(self._last_operation_string,
                 configuration,
                 async_=True)
#hiveserver2.py文件1022行左右
def execute(self, statement, configuration=None, async_=False):
  req = TExecuteStatementReq(sessionHandle=self.handle,
                statement=statement,
                confOverlay=configuration,
                runAsync=async_)

坑二：提供的Parser.py文件有問(wèn)題，加載的時(shí)候會(huì)報(bào)錯(cuò)

解決辦法：

#根據(jù)網(wǎng)上的意見(jiàn)對(duì)原代碼進(jìn)行調(diào)整
elif url_scheme in ('c', 'd', 'e', 'f'):
  with open(path) as fh:
    data = fh.read()
elif url_scheme in ('http', 'https'):
  data = urlopen(path).read()
else:
  raise ThriftParserError('ThriftPy does not support generating module '
              'with path in protocol \'{}\''.format(
                url_scheme))

以上的坑一、坑二建議你直接修改。這兩點(diǎn)是肯定要調(diào)整的；

坑三：上面的兩個(gè)問(wèn)題處理好之后，繼續(xù)運(yùn)行，會(huì)報(bào)如下錯(cuò)誤：

TProtocolException: TProtocolException(type=4)

解決辦法：

原因是由于connect方法里面沒(méi)有增加參數(shù)：auth_mechanism='PLAIN，修改如下所示：

import impala.dbapi as ipdb
conn = ipdb.connect(host="192.168.XX.XXX",port=10000,user="xxx",password="xxxxxx",database="xxx",auth_mechanism='PLAIN')`

坑四：?jiǎn)栴}三修改好之后，繼續(xù)運(yùn)行程序，你會(huì)發(fā)現(xiàn)繼續(xù)報(bào)錯(cuò)：

AttributeError: 'TSocket' object has no attribute 'isOpen'

解決辦法：

由于是thrift-sasl的版本太高了(0.3.0)，故將thrift-sasl的版本降級(jí)到0.2.1

pip uninstall thrift-sasl
pip install thrift-sasl==0.2.1

坑五：處理完這個(gè)問(wèn)題后，繼續(xù)運(yùn)行，繼續(xù)報(bào)錯(cuò)（這個(gè)時(shí)間解決有點(diǎn)快崩潰的節(jié)奏了，但是請(qǐng)堅(jiān)持住，其實(shí)你已經(jīng)很快接近最后結(jié)果了）：

thriftpy.transport.TTransportException: TTransportException(type=1, message="Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: Unable to find a callback: 2'")

解決辦法：這個(gè)是最麻煩的，也是目前最難找到解決辦法的。

I solved the issue, had to uninstall the package SASL and install PURE-SASL, when impyla can´t find the sasl package it works with pure-sasl and then everything goes well.

主要原因其實(shí)還是因?yàn)閟asl和pure-sasl有沖突，這種情況下，直接卸載sasl包就可能了。

pip uninstall SASL

坑六：但是執(zhí)行完成，繼續(xù)完成，可能還是會(huì)報(bào)錯(cuò)：

TypeError: can't concat str to bytes

定位到錯(cuò)誤的最后一條，在init.py第94行（標(biāo)黃的部分）

header = struct.pack(">BI", status, len(body))
#按照網(wǎng)上的提供的辦法增加對(duì)BODY的處理
if (type(body) is str):
 body = body.encode()
self._trans.write(header + body)
self._trans.flush()

經(jīng)過(guò)以上步驟，大家應(yīng)該可以連接Hive庫(kù)查詢(xún)數(shù)據(jù)，應(yīng)該是不存在什么問(wèn)題了。

最后總結(jié)一下，連接Hadoop數(shù)據(jù)庫(kù)中各種依賴(lài)包，請(qǐng)大家仔細(xì)核對(duì)一下依賴(lài)包（最好是依賴(lài)包相同，也就是不多不少[我說(shuō)的是相關(guān)的包]，這樣真的可以避免很多問(wèn)題的出現(xiàn)）

序號(hào)	包名	版本號(hào)	安裝命令行
1	pure_sasl	0.5.1	pip install pure_sasl==0.5.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
2	thrift	0.9.3	pip install thrift==0.9.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
3	bitarray	0.8.3	pip install bitarray==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
4	thrift_sasl	0.2.1	pip install thrift_sasl==0.2.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
5	thriftpy	0.3.9	pip install thriftpy==0.3.9 -i https://pypi.tuna.tsinghua.edu.cn/simple
6	impyla	0.14.1	pip install impyla==0.14.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

建議按順序安裝，我這邊之前有依賴(lài)包的問(wèn)題，但是最終我是通過(guò)conda進(jìn)行安裝的。
其中在安裝thriftpy、thrift_sasl、impyla報(bào)的時(shí)候報(bào)錯(cuò)，想到自己有conda，直接使用conda install，會(huì)自動(dòng)下載依賴(lài)的包，如下所示（供沒(méi)有conda環(huán)境的同學(xué)參考）