python爬蟲豆瓣網(wǎng)的模擬登錄實(shí)現(xiàn)
思路
一、想要實(shí)現(xiàn)登錄豆瓣關(guān)鍵點(diǎn)
分析真實(shí)post地址 ----尋找它的formdata,如下圖,按瀏覽器的F12可以找到。

實(shí)戰(zhàn)操作
- 實(shí)現(xiàn):模擬登錄豆瓣,驗(yàn)證碼處理,登錄到個(gè)人主頁就算是success
- 數(shù)據(jù):沒有抓取數(shù)據(jù),此實(shí)戰(zhàn)主要是模擬登錄和處理驗(yàn)證碼的學(xué)習(xí)。要是有需求要抓取數(shù)據(jù),編寫相關(guān)的抓取規(guī)則即可抓取內(nèi)容。
登錄成功展示如圖:

spiders文件夾中DouBan.py主要代碼如下:
# -*- coding: utf-8 -*-
import scrapy,urllib,re
from scrapy.http import Request,FormRequest
import ruokuai
'''
遇到不懂的問題?Python學(xué)習(xí)交流群:821460695滿足你的需求,資料都已經(jīng)上傳群文件,可以自行下載!
'''
class DoubanSpider(scrapy.Spider):
name = "DouBan"
allowed_domains = ["douban.com"]
#start_urls = ['http://douban.com/']
header={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"} #供登錄模擬使用
def start_requests(self):
url='https://www.douban.com/accounts/login'
return [Request(url=url,meta={"cookiejar":1},callback=self.parse)]#可以傳遞一個(gè)標(biāo)示符來使用多個(gè)。如meta={'cookiejar': 1}這句,后面那個(gè)1就是標(biāo)示符
def parse(self, response):
captcha=response.xpath('//*[@id="captcha_image"]/@src').extract() #獲取驗(yàn)證碼圖片的鏈接
print captcha
if len(captcha)>0:
'''此時(shí)有驗(yàn)證碼'''
#人工輸入驗(yàn)證碼
#urllib.urlretrieve(captcha[0],filename="C:/Users/pujinxiao/Desktop/learn/douban20170405/douban/douban/spiders/captcha.png")
#captcha_value=raw_input('查看captcha.png,有驗(yàn)證碼請(qǐng)輸入:')
#用快若打碼平臺(tái)處理驗(yàn)證碼--------驗(yàn)證碼是任意長(zhǎng)度字母,成功率較低
captcha_value=ruokuai.get_captcha(captcha[0])
reg=r'<Result>(.*?)</Result>'
reg=re.compile(reg)
captcha_value=re.findall(reg,captcha_value)[0]
print '驗(yàn)證碼為:',captcha_value
data={
"form_email": "weisuen007@163.com",
"form_password": "weijc7789",
"captcha-solution": captcha_value,
#"redir": "https://www.douban.com/people/151968962/", #設(shè)置需要轉(zhuǎn)向的網(wǎng)址,由于我們需要爬取個(gè)人中心頁,所以轉(zhuǎn)向個(gè)人中心頁
}
else:
'''此時(shí)沒有驗(yàn)證碼'''
print '無驗(yàn)證碼'
data={
"form_email": "weisuen007@163.com",
"form_password": "weijc7789",
#"redir": "https://www.douban.com/people/151968962/",
}
print '正在登陸中......'
####FormRequest.from_response()進(jìn)行登陸
return [
FormRequest.from_response(
response,
meta={"cookiejar":response.meta["cookiejar"]},
headers=self.header,
formdata=data,
callback=self.get_content,
)
]
def get_content(self,response):
title=response.xpath('//title/text()').extract()[0]
if u'登錄豆瓣' in title:
print '登錄失敗,請(qǐng)重試!'
else:
print '登錄成功'
'''
可以繼續(xù)后續(xù)的爬取工作
'''
ruokaui.py代碼如下:
我所用的是若塊打碼平臺(tái),選擇url識(shí)別驗(yàn)證碼,直接給打碼平臺(tái)驗(yàn)證碼圖片的鏈接地址,傳回驗(yàn)證碼的值。
# -*- coding: utf-8 -*-
import sys, hashlib, os, random, urllib, urllib2
from datetime import *
'''
遇到不懂的問題?Python學(xué)習(xí)交流群:821460695滿足你的需求,資料都已經(jīng)上傳群文件,可以自行下載!
'''
class APIClient(object):
def http_request(self, url, paramDict):
post_content = ''
for key in paramDict:
post_content = post_content + '%s=%s&'%(key,paramDict[key])
post_content = post_content[0:-1]
#print post_content
req = urllib2.Request(url, data=post_content)
req.add_header('Content-Type', 'application/x-www-form-urlencoded')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(req, post_content)
return response.read()
def http_upload_image(self, url, paramKeys, paramDict, filebytes):
timestr = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
boundary = '------------' + hashlib.md5(timestr).hexdigest().lower()
boundarystr = '\r\n--%s\r\n'%(boundary)
bs = b''
for key in paramKeys:
bs = bs + boundarystr.encode('ascii')
param = "Content-Disposition: form-data; name=\"%s\"\r\n\r\n%s"%(key, paramDict[key])
#print param
bs = bs + param.encode('utf8')
bs = bs + boundarystr.encode('ascii')
header = 'Content-Disposition: form-data; name=\"image\"; filename=\"%s\"\r\nContent-Type: image/gif\r\n\r\n'%('sample')
bs = bs + header.encode('utf8')
bs = bs + filebytes
tailer = '\r\n--%s--\r\n'%(boundary)
bs = bs + tailer.encode('ascii')
import requests
headers = {'Content-Type':'multipart/form-data; boundary=%s'%boundary,
'Connection':'Keep-Alive',
'Expect':'100-continue',
}
response = requests.post(url, params='', data=bs, headers=headers)
return response.text
def arguments_to_dict(args):
argDict = {}
if args is None:
return argDict
count = len(args)
if count <= 1:
print 'exit:need arguments.'
return argDict
for i in [1,count-1]:
pair = args[i].split('=')
if len(pair) < 2:
continue
else:
argDict[pair[0]] = pair[1]
return argDict
def get_captcha(image_url):
client = APIClient()
while 1:
paramDict = {}
result = ''
act = raw_input('請(qǐng)輸入打碼方式url:')
if cmp(act, 'info') == 0:
paramDict['username'] = raw_input('username:')
paramDict['password'] = raw_input('password:')
result = client.http_request('http://api.ruokuai.com/info.xml', paramDict)
elif cmp(act, 'register') == 0:
paramDict['username'] = raw_input('username:')
paramDict['password'] = raw_input('password:')
paramDict['email'] = raw_input('email:')
result = client.http_request('http://api.ruokuai.com/register.xml', paramDict)
elif cmp(act, 'recharge') == 0:
paramDict['username'] = raw_input('username:')
paramDict['id'] = raw_input('id:')
paramDict['password'] = raw_input('password:')
result = client.http_request('http://api.ruokuai.com/recharge.xml', paramDict)
elif cmp(act, 'url') == 0:
paramDict['username'] = '********'
paramDict['password'] = '********'
paramDict['typeid'] = '2000'
paramDict['timeout'] = '90'
paramDict['softid'] = '76693'
paramDict['softkey'] = 'ec2b5b2a576840619bc885a47a025ef6'
paramDict['imageurl'] = image_url
result = client.http_request('http://api.ruokuai.com/create.xml', paramDict)
elif cmp(act, 'report') == 0:
paramDict['username'] = raw_input('username:')
paramDict['password'] = raw_input('password:')
paramDict['id'] = raw_input('id:')
result = client.http_request('http://api.ruokuai.com/create.xml', paramDict)
elif cmp(act, 'upload') == 0:
paramDict['username'] = '********'
paramDict['password'] = '********'
paramDict['typeid'] = '2000'
paramDict['timeout'] = '90'
paramDict['softid'] = '76693'
paramDict['softkey'] = 'ec2b5b2a576840619bc885a47a025ef6'
paramKeys = ['username',
'password',
'typeid',
'timeout',
'softid',
'softkey'
]
from PIL import Image
imagePath = raw_input('Image Path:')
img = Image.open(imagePath)
if img is None:
print 'get file error!'
continue
img.save("upload.gif", format="gif")
filebytes = open("upload.gif", "rb").read()
result = client.http_upload_image("http://api.ruokuai.com/create.xml", paramKeys, paramDict, filebytes)
elif cmp(act, 'help') == 0:
print 'info'
print 'register'
print 'recharge'
print 'url'
print 'report'
print 'upload'
print 'help'
print 'exit'
elif cmp(act, 'exit') == 0:
break
return result
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
- python爬取企查查企業(yè)信息之selenium自動(dòng)模擬登錄企查查
- python中requests模擬登錄的三種方式(攜帶cookie/session進(jìn)行請(qǐng)求網(wǎng)站)
- Python模擬登錄和登錄跳轉(zhuǎn)的參考示例
- Python3以GitHub為例來實(shí)現(xiàn)模擬登錄和爬取的實(shí)例講解
- Python實(shí)現(xiàn)微信表情包炸群功能
- Python基礎(chǔ)進(jìn)階之海量表情包多線程爬蟲功能的實(shí)現(xiàn)
- 基于Javascript實(shí)現(xiàn)頁面商品個(gè)數(shù)增減功能
- Python自動(dòng)生產(chǎn)表情包
- Python模擬登錄微博并爬取表情包
相關(guān)文章
django ListView的使用 ListView中獲取url中的參數(shù)值方式
這篇文章主要介紹了django ListView的使用 ListView中獲取url中的參數(shù)值方式,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過來看看吧2020-03-03
Python類的動(dòng)態(tài)綁定實(shí)現(xiàn)原理
這篇文章主要介紹了Python類的動(dòng)態(tài)綁定實(shí)現(xiàn)原理,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2020-03-03
Python?flask框架post接口調(diào)用示例
這篇文章主要介紹了Python?flask框架post接口調(diào)用,結(jié)合實(shí)例形式分析了基于flask框架的post、get請(qǐng)求響應(yīng)及接口調(diào)用相關(guān)操作技巧,需要的朋友可以參考下2019-07-07
20行Python代碼實(shí)現(xiàn)一款永久免費(fèi)PDF編輯工具
本文主要介紹了Python代碼實(shí)現(xiàn)一款永久免費(fèi)PDF編輯工具,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2022-07-07

