python腳本爬取字體文件的實現(xiàn)方法

更新時間：2017年04月29日 15:44:36 作者：Myths

這篇文章主要給大家介紹了利用python腳本爬取字體文件的實現(xiàn)方法，文中分享了爬取兩個不同網(wǎng)站的示例代碼，相信對大家具有一定的參考價值，需要的朋友們下面來一起看看吧。

前言

大家應(yīng)該都有所體會，為了提高驗證碼的識別準確率，我們當然要首先得到足夠多的測試數(shù)據(jù)。驗證碼下載下來容易，但是需要人腦手工識別著實讓人受不了，于是我就想了個折衷的辦法——自己造驗證碼。

為了保證多樣性，首先當然需要不同的字模了，直接用類似ttf格式的字體文件即可，網(wǎng)上有很多ttf格式的字體包供我們下載。當然，我不會傻到手動下載解壓縮，果斷要寫個爬蟲了。

實現(xiàn)方法

網(wǎng)站一：fontsquirrel.com

這個網(wǎng)站的字體可以免費下載，但是有很多下載點都是外鏈連接到其他網(wǎng)站的，這部分得忽略掉。

#coding:utf-8
import urllib2,cookielib,sys,re,os,zipfile
import numpy as np
#網(wǎng)站登陸
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('User-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36))')]
urllib2.install_opener(opener)
#搜索可下載連接
def search(path):
 request=urllib2.Request(path)
 response=urllib2.urlopen(request)
 html=response.read()
 html=html.replace('\n',' ')#將所有的回車去掉，因為正則表達式是單行匹配。。。。。。
 urls=re.findall(r'<a href="(.*?)" rel="external nofollow" >(.*?)</a>',html)
 for i in urls:
  url,inner=i
  if not re.findall(r'Download ',inner)==[] and re.findall(r'offsite',inner)==[] and url not in items:
   items.append(url)
items=[]#保存下載地址
for i in xrange(15):
 host='http://www.fontsquirrel.com/fonts/list/find_fonts/'+str(i*50)+'?filter%5Bdownload%5D=local'
 search(host)
if not os.path.exists('ttf'):
 os.mkdir('ttf')
os.chdir('ttf')
def unzip(rawfile,outputdir):
 if zipfile.is_zipfile(rawfile):
  print 'yes'
  fz=zipfile.ZipFile(rawfile,'r')
  for files in fz.namelist():
   print(files) #打印zip歸檔中目錄
   fz.extract(files,outputdir)#解壓縮文件
 else:
  print 'no'
for i in items: 
 print i
 request=urllib2.Request('http://www.fontsquirrel.com'+i)
 response=urllib2.urlopen(request)
 html=response.read()
 name=i.split('/')[-1]+'.zip'
 f=open(name,'w')
 f.write(html)
 f.close()#文件記得關(guān)閉，否則下面unzip會出錯
 unzip(name,'./')
 os.remove(name)
os.listdir(os.getcwd())
os.chdir('../')
files=os.listdir('ttf/')
for i in files:#刪除無用文件
 if not (i.split('.')[-1]=='ttf' or i.split('.')[-1]=='otf'):
  if os.path.isdir(i):
   os.removedirs('ttf/'+i)
  else:
   os.remove('ttf/'+i)
print len(os.listdir('ttf/'))

搞到了2000+個字體，種類也挺多的，蠻好。

網(wǎng)站二：dafont.com

這個網(wǎng)站的字體花樣比較多，下載起來也比較方便，惡心的是他的文件名的編碼好像有點問題。

#coding:utf-8
import urllib2,cookielib,sys,re,os,zipfile
import shutil
import numpy as np
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('User-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36))')]
urllib2.install_opener(opener)
items=[]
def search(path):
 request=urllib2.Request(path)
 response=urllib2.urlopen(request)
 html=response.read()
 html=html.replace('\n',' ')
 urls=re.findall(r'href=\"(http://dl.dafont.com/dl/\?f=.*?)\" >',html)
 items.extend(urls)
for i in xrange(117):
 host='http://www.dafont.com/new.php?page='+str(i+1)
 search(host)
 print 'Page'+str(i+1)+'done'
 items=list(set(items))
 print len(items)
if not os.path.exists('ttf2'):
 os.mkdir('ttf2')
os.chdir('ttf2')
def unzip(rawfile,outputdir):
 if zipfile.is_zipfile(rawfile):
  print 'yes'
  fz=zipfile.ZipFile(rawfile,'r')
  for files in fz.namelist():
   print(files) #打印zip歸檔中目錄
   fz.extract(files,outputdir)
 else:
  print 'no'
for i in items: 
 print i
 request=urllib2.Request(i)
 response=urllib2.urlopen(request)
 html=response.read()
 name=i.split('=')[-1]+'.zip'
 f=open(name,'w')
 f.write(html)
 f.close()
 unzip(name,'./')
 os.remove(name)
print os.listdir(os.getcwd())
for root ,dire,fis in os.walk('./'):#遞歸遍歷文件夾
 for i in fis:
  if not (i.split('.')[-1]=='ttf' or i.split('.')[-1]=='otf'):
   os.remove(root+i)
   print i
for i in os.listdir('./'):
 if os.path.isdir(i):
  os.rmdir(i)
os.chdir('../')

總體操作跟之前的差不多，跑了幾十分鐘下了4000多的字體。

總結(jié)

以上就是這篇文章的全部內(nèi)容了，希望本文的內(nèi)容對大家學(xué)習(xí)或者使用python能帶來一定的幫助，如果有疑問大家可以留言交流，謝謝大家對腳本之家的支持。

您可能感興趣的文章: