快捷導(dǎo)航

python用于url解碼和中文解析的小腳本(python url decoder)

更新時(shí)間：2013年08月11日 13:40:20 作者：

這篇文章主要介紹了python用于url解碼和中文解析的代碼，需要的朋友可以參考下

 
# -*- coding: utf8 -*- 
#! python 
print(repr("測(cè)試報(bào)警，xxxx是大豬頭".decode("UTF8").encode("GBK")).replace("\\x","%")) 

注意第一個(gè) decode("UTF8") 要與文件聲明的編碼一樣。

最開始對(duì)這個(gè)問(wèn)題的接觸，來(lái)自于一個(gè)Javascript解謎闖關(guān)的小游戲，某一關(guān)的提示如下：

剛開始的幾關(guān)都是很簡(jiǎn)單很簡(jiǎn)單的哦～～這一關(guān)只是簡(jiǎn)單的字符串變形而已…..

后面是一大長(zhǎng)串開頭是%5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684這樣的字符串。
這種東西以前經(jīng)常在瀏覽器的地址欄見到，就是一直不知道怎么轉(zhuǎn)換成能看懂的東東，
網(wǎng)上google了一下，結(jié)合python的url解碼和unicode解碼，解決方式如下:

復(fù)制代碼代碼如下:

import urllib escaped_str="%5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684%5Cu9875%5Cu9762%5Cu540d%5Cu5b57%5Cu662f%5Cx20%5Cx69%5Cx32%5Cx6a%5Cx62%5Cx6a%5Cx33%5Cx69%5Cx34%5Cx62%5Cx62%5Cx35%5Cx34%5Cx62%5Cx35%5Cx32%5Cx69%5Cx62%5Cx33%5Cx2e%5Cx68%5Cx74%5Cx6d"
print urllib.unquote(escaped_str).decode('unicode-escape') 

最近，我對(duì)firefox的autoproxy插件中的gfwlist中的中文詞匯（用過(guò)代理的同學(xué)們，你們懂的）產(chǎn)生了興趣，然而這些網(wǎng)址都是用url編碼的，比如http://zh.wikipedia.org/wiki/%E9%97%A8，需要使用正則表達(dá)式將被url編碼的中文字符提取出來(lái)，寫了個(gè)小腳本如下：

復(fù)制代碼代碼如下:

import urllib 
import re 
with open("listfile","r") as f: 
    for url_str in f: 
        match=re.compile("((%\w{2}){3,})").findall(url_str) 
        #漢字url編碼的樣式是：百分號(hào)+2個(gè)十六進(jìn)制數(shù)，重復(fù)3次 

        if match!=None: 
            #如果匹配成功，則將提取出的部分轉(zhuǎn)換為中文 
            for trans in match: 
                print urllib.unquote(trans[0]), 

然而這個(gè)腳本仍有一些缺點(diǎn)，對(duì)于列表文件中的某些中文字符仍然不能正常解碼，比如下面這幾行測(cè)試代碼

復(fù)制代碼代碼如下:

import urllib 
a="http://zh.wikipedia.org/wiki/%BD%F0%B6"
b="http://zh.wikipedia.org/wiki/%E9%97%A8"
de=urllib.unquote 
print de(a),de(b) 

輸出結(jié)果就是前者可以正確解碼，而后者不可以，個(gè)人覺得原因可能和big5編碼有關(guān)，如果誰(shuí)知道什么解決辦法，還請(qǐng)告訴我一下~

以下是補(bǔ)充：

de(a).decode(“gbk”,”ignore”)
de(b).decode(“utf8″,”ignore”)

這樣你可以得到這些字串的unicode編碼。

你用的unquote不是decoder, 你需要作必要的decode和encode。我一直用utf8作我默認(rèn)環(huán)境的，我覺得你大概用的gbk吧，所以後者的解碼你那邊失敗了。猜編碼是很累的事情，如果大家都用utf8倒也好，但是有些人習(xí)慣了gb。

http://yac163.svn.sourceforge.net/viewvc/yac163/trunk/yac163-nox/Pic.py?revision=198&view=markup

參考我這個(gè)很古老code裡面的#102-147行給每個(gè)decode和encode調(diào)用加上(…,”ignore”)。

復(fù)制代碼代碼如下:

def strdecode( string,charset=None ):
     if isinstance(string,unicode):
         return string
     if charset:
         try:
             return string.decode(charset)
         except UnicodeDecodeError:
             return _strdecode(string)
     else:
         return _strdecode(string)

def _strdecode(string):
try:

         return string.decode('utf8')
     except UnicodeDecodeError:
         try:
             return string.decode('gb2312')
         except UnicodeDecodeError:
             try:

                 return string.decode('gbk')
             except UnicodeDecodeError:
                 return string.decode('gb18030')

def strencode( string,charset=None ):
     if isinstance(string,str):
         return string
     if charset:
         try:
             return string.encode(charset)
         except UnicodeEncodeError:
             return _strencode(string)
     else:
         return _strencode(string)
def _strencode(string):

     try:
         return string.encode('utf8')
     except UnicodeEncodeError:
         try:
             return string.encode('gb2312')
         except UnicodeEncodeError:
             try:
                 return string.encode('gbk')
             except UnicodeEncodeError:
                 return string.encode('gb18030')

您可能感興趣的文章: