Python爬取百度翻譯實現(xiàn)中英互譯功能
由于下學(xué)期報了一個Python的入門課程
所以寒假一直在自己摸索,畢竟到時候不能掛科,也是水水學(xué)分
最近心血來潮打算試試爬一下百度翻譯
肝了一天終于搞出來了
話不多說,直接開搞(環(huán)境是Python 3.8 PyCharm Community Edition 2021.3.1)
基礎(chǔ)步驟
百度翻譯會識別到爬蟲,所以得用headers隱藏一下
以chorme瀏覽器為例
在百度翻譯頁面點(diǎn)擊鼠標(biāo)右鍵,選擇“檢查”(或者直接F12)
顯示以下界面
依次選Network-Fetch/XHR-Headers

然后就能看到我們需要的標(biāo)頭
我們需要的是Cookie和User-Agent,用于表示是特定用戶通過瀏覽器打開此網(wǎng)站
也就是偽裝爬蟲
然后我們復(fù)制到Pycharm當(dāng)中即可
1 headers = {"User-Agent": Your User-Agent, "Cookie": Your Cookie}
2 # 后面填寫你獲取到的User-Agent和Cookie即可提交表單
偽裝好了之后,需要準(zhǔn)備讓爬蟲向網(wǎng)站提交表單
但是我們提交之前需要看看我們要提交哪些數(shù)據(jù)
繼續(xù)查看網(wǎng)站

我們可以看到有一個表單數(shù)據(jù)
from: en to: zh # 從英文轉(zhuǎn)中文 query: ant # 搜索ant單詞 transtype: realtime # 可能是實時查詢的意思? simple_means_flag: 3 sign: 210056.513977 token: 97c823341ff704dea2625870404fcec4 # 百度翻譯用于識別的關(guān)鍵信息sign和token domain: common
這就是我們要提交的數(shù)據(jù)
但是我們提交表單的是動態(tài)的,所以要重新寫一下data
也就是
data = {
"from": "en",
"to": "zh",
"query": custom_input,
"transtype": "translang",
"simple_means_flag": "3",
"token": '97c823341ff704dea2625870404fcec4'
}獲取響應(yīng)并處理結(jié)果
我們考慮到提交了數(shù)據(jù)之后,咱們需要接收網(wǎng)頁的反饋
所以繼續(xù)看看返回來的翻譯在哪

我們會發(fā)現(xiàn),我們想要的和現(xiàn)實的似乎有些差別
結(jié)果是有了,但是不是中文,是Unicode
辦法總是有的
response = requests.post(url='https://fanyi.baidu.com/v2transapi', headers=headers, timeout=1, data=data)
response.encoding = 'utf-8'
print(response.status_code) # 獲取狀態(tài)碼
print(re.search("[\\u4e00-\\u9fa5]+", response.content.decode('unicode_escape'), flags=re.S)[0]) # 正則表達(dá)式查找漢字這樣打印出來的就是中文了~
挺意外的
差不多就可以提交了!
然后我興沖沖的去提交數(shù)據(jù)
百度翻譯給了我一個大嘴巴深刻的教訓(xùn)
請輸入要選擇的翻譯模式 [1]英譯中 [2]中譯英 1 請輸入要翻譯的英文 apple 200 未知錯誤 進(jìn)程已結(jié)束,退出代碼0
這是咋回事?apple的翻譯應(yīng)該是蘋果而不是未知錯誤啊
然后我發(fā)現(xiàn),前面的data漏了一個sign
sign是不同的單詞算出來的不一樣的,但是相對于單詞是固定的
幸好網(wǎng)上巨佬多,找到了sign的算法
有興趣可以看看sign算法的獲取
最后把sign貼上去,就成功了!
消除警告
但是會出現(xiàn)一個Warning
請輸入要選擇的翻譯模式
[1]英譯中
[2]中譯英
1
請輸入要翻譯的英文
apple
200
蘋果
F:/Python/New/main.py:40: DeprecationWarning: invalid escape sequence '\/'
print(re.search("[\\u4e00-\\u9fa5]+", response.content.decode('unicode_escape'), flags=re.S)[0])
進(jìn)程已結(jié)束,退出代碼0翻譯結(jié)果底下出現(xiàn)一個警告,不好看
于是想辦法,加入了這個
import warnings
warnings.filterwarnings("ignore", category=Warning) # 關(guān)閉棄用報錯就沒有錯誤了~
至此,英譯中功能就做的差不多了
中譯英是基本一樣的,但是返回的東西很多,可以通過這個語句來篩選
print(re.findall(pattern='[a-zA-Z]+', string=response.content.decode('unicode_escape'), flags=re.S)[4])差不多就是這樣咯~
全部代碼:
main.py
import requests
from sign import sign
import re
import warnings
warnings.filterwarnings("ignore", category=Warning) # 關(guān)閉棄用報錯
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/66.0.3359.139 Safari/537.36",
"Cookie": 'BIDUPSID=248487DDE4F4874C768DD664800AFB01; '
'PSTM=1624632627; '
'__yjs_duid'
'=1_9e9a49b48ccf294be969148528d703281624677345512; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; '
'REALTIME_TRANS_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; APPGUIDE_10_0_2=1; '
'BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BAIDUID=39C416629357EBAB497629178C0541C1:FG=1; '
'BDUSS'
'=m9DMm1RUFZTTFBCNmdZUUFhY3lpeUR4Y3NNRW5SdThvb3FpTnZDNWdXNWRyeEJpSVFBQUFBJCQAAAAAAAAAAAEAAACSX1uneHp5MjAwMzIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAF0i6WFdIulhZ; BDUSS_BFESS=m9DMm1RUFZTTFBCNmdZUUFhY3lpeUR4Y3NNRW5SdThvb3FpTnZDNWdXNWRyeEJpSVFBQUFBJCQAAAAAAAAAAAEAAACSX1uneHp5MjAwMzIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAF0i6WFdIulhZ; H_PS_PSSID=35410_35105_31254_35774_34584_35490_35693_35796_35324_26350_35744; BAIDUID_BFESS=39C416629357EBAB497629178C0541C1:FG=1; BCLID=11903837222192425398; BDSFRCVID=meFOJeC627p69AjHgenlU9pUEeQF9_oTH6aoc1Pmnv6SwQ5bF3wEEG0PEM8g0Kub1VDqogKKQgOTHRCF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tbFqoKI5JK03J-Fk-R6BMtCbMfQyetJyaR0tXJvvWJ5TMCoJ0-c25-InbPvwblL8-NT42-ovyJ6_ShPC-tnc3M4nKxC82Mb8Qa743qbX3l02Vhvae-t2ynLIQMFLQ-RMW23I0h7mWUoTsxA45J7cM4IseboJLfT-0bc4KKJxbnLWeIJIjjC5DTOXjH8OtTnfb5kXWnbEatD_Hn7zeUDWeM4pbt-qJqTzLNQLWqnjBpRBSDTx3fo1j4tUXxTnBT5KaKTvaCTw5l7KHq32yqKKQlKkQN3TWxuO5bRi5Roy-q3FDn3oypQJXp0n04bly5jtMgOBBJ0yQ4b4OR5JjxonDh83bG7MJPKtfJut_I05JID-bnPk5PQ_b-40Mq0X5-RLfKj-Kq7F5l8-hC3xj6rNMxksbfTQL6cjQmT-blLXXb7xOKQphP-a0-uH5Gjg-h_tKeFeLh5N3KJmsqC9bT3v5tjL34OD2-biWa6M2MbdLqOP_IoG2Mn8M4bb3qOpBtQmJeTxoUJ25DnJhbLGe4bK-TrLjHKftxK; BCLID_BFESS=11903837222192425398; BDSFRCVID_BFESS=meFOJeC627p69AjHgenlU9pUEeQF9_oTH6aoc1Pmnv6SwQ5bF3wEEG0PEM8g0Kub1VDqogKKQgOTHRCF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF_BFESS=tbFqoKI5JK03J-Fk-R6BMtCbMfQyetJyaR0tXJvvWJ5TMCoJ0-c25-InbPvwblL8-NT42-ovyJ6_ShPC-tnc3M4nKxC82Mb8Qa743qbX3l02Vhvae-t2ynLIQMFLQ-RMW23I0h7mWUoTsxA45J7cM4IseboJLfT-0bc4KKJxbnLWeIJIjjC5DTOXjH8OtTnfb5kXWnbEatD_Hn7zeUDWeM4pbt-qJqTzLNQLWqnjBpRBSDTx3fo1j4tUXxTnBT5KaKTvaCTw5l7KHq32yqKKQlKkQN3TWxuO5bRi5Roy-q3FDn3oypQJXp0n04bly5jtMgOBBJ0yQ4b4OR5JjxonDh83bG7MJPKtfJut_I05JID-bnPk5PQ_b-40Mq0X5-RLfKj-Kq7F5l8-hC3xj6rNMxksbfTQL6cjQmT-blLXXb7xOKQphP-a0-uH5Gjg-h_tKeFeLh5N3KJmsqC9bT3v5tjL34OD2-biWa6M2MbdLqOP_IoG2Mn8M4bb3qOpBtQmJeTxoUJ25DnJhbLGe4bK-TrLjHKftxK; delPer=0; PSINO=3; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1641456854,1642661186,1642662678,1642687449; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1642688201; BA_HECTOR=248g858580ak84a0u91guirq80q; ab_sr=1.0.1_MjM4OGFjMTZiZjUyMmYxMmU5NDhjY2FkZDkzNzRkMzZkZGUxN2RmMmY1NzEwYzg5ZDlmYTk2YTIzZmM0ODBlMzJlYzAwNDMxNjllNjk3OGMxZDJmMzI1NjNiNjlhNjExNTEzYmNkZTFlZWNjYzI4ZGVmZTA4NDk3ODBjYThlYzM='}
if __name__ == '__main__':
print("請輸入要選擇的翻譯模式")
choose = int(input("[1]英譯中\(zhòng)n[2]中譯英\n"))
while choose != 1 and choose != 2:
print("錯誤!請重新輸入")
choose = int(input("[1]英譯中\(zhòng)n[2]中譯英\n"))
data = {}
if choose == 1:
custom_input = input('請輸入要翻譯的英文\n')
data = {
"from": "en",
"to": "zh",
"query": custom_input,
"transtype": "translang",
"simple_means_flag": "3",
"token": '97c823341ff704dea2625870404fcec4',
"sign": sign(custom_input)
}
response = requests.post(url='https://fanyi.baidu.com/v2transapi', headers=headers, timeout=1, data=data)
response.encoding = 'utf-8'print(re.search("[\\u4e00-\\u9fa5]+", response.content.decode('unicode_escape'), flags=re.S)[0])
elif choose == 2:
custom_input = input('請輸入要翻譯成英文的中文\n')
data = {
"from": "zh",
"to": "en",
"query": custom_input,
"transtype": "translang",
"simple_means_flag": "3",
"token": '97c823341ff704dea2625870404fcec4',
"sign": sign(custom_input)
}
response = requests.post(url='https://fanyi.baidu.com/v2transapi', headers=headers, timeout=1, data=data)
response.encoding = 'utf-8'
print(re.findall(pattern='[a-zA-Z]+', string=response.content.decode('unicode_escape'), flags=re.S)[4])sign.py
import js2py
import requests
import re
def sign(word):
session = requests.Session()
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"}
session.headers = headers
response = session.get("http://fanyi.baidu.com/")
gtk = re.findall(";window.gtk = ('.*?');", response.content.decode())[0]
word = word
context = js2py.EvalJs()
js = r'''
function a(r) {
if (Array.isArray(r)) {
for (var o = 0, t = Array(r.length); o < r.length; o++)
t[o] = r[o];
return t
}
return Array.from(r)
}
function n(r, o) {
for (var t = 0; t < o.length - 2; t += 3) {
var a = o.charAt(t + 2);
a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a),
a = "+" === o.charAt(t + 1) ? r >>> a : r << a,
r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
}
return r
}
function e(r) {
var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
if (null === o) {
var t = r.length;
t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))
} else {
for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++)
"" !== e[C] && f.push.apply(f, a(e[C].split(""))),
C !== h - 1 && f.push(o[C]);
var g = f.length;
g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""))
}
var u = void 0
, l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
u = 'null !== i ? i : (i = window[l] || "") || ""';
for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
var A = r.charCodeAt(v);
128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)),
S[c++] = A >> 18 | 240,
S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224,
S[c++] = A >> 6 & 63 | 128),
S[c++] = 63 & A | 128)
}
for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++)
p += S[b],
p = n(p, F);
return p = n(p, D),
p ^= s,
0 > p && (p = (2147483647 & p) + 2147483648),
p %= 1e6,
p.toString() + "." + (p ^ m)
}
'''
js = js.replace('\'null !== i ? i : (i = window[l] || "") || ""\'', gtk)
# 執(zhí)行js
context.execute(js)
# 調(diào)用函數(shù)得到sign
sign = context.e(word)
return sign運(yùn)行示例:

到此這篇關(guān)于Python爬取百度翻譯實現(xiàn)中英互譯功能的文章就介紹到這了,更多相關(guān)Python爬取百度翻譯內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
利用Python實現(xiàn)網(wǎng)絡(luò)測試的腳本分享
這篇文章主要給大家介紹了關(guān)于利用Python實現(xiàn)網(wǎng)絡(luò)測試的方法,文中給出了詳細(xì)的示例代碼供大家參考學(xué)習(xí),對大家具有一定的參考學(xué)習(xí)價值,需要的朋友們下面來一起看看吧。2017-05-05
Python中元組的基礎(chǔ)介紹及常用操作總結(jié)
元組是一種不可變序列。元組變量的賦值要在定義時就進(jìn)行,這就像C語言中的const變量或是C++的引用,定義時賦值之后就不允許有修改。元組存在的意義是:元組在映射中可以作為鍵使用,因為要保證鍵的不變性。元組作為很多內(nèi)置函數(shù)和方法的返回值存在2021-09-09
Python 解析庫json及jsonpath pickle的實現(xiàn)
這篇文章主要介紹了Python 解析庫json及jsonpath pickle的實現(xiàn),文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2020-08-08

