利用NodeJS和PhantomJS抓取網(wǎng)站頁(yè)面信息以及網(wǎng)站截圖
安裝PhantomJS
首先,去PhantomJS官網(wǎng)下載對(duì)應(yīng)平臺(tái)的版本,或者下載源代碼自行編譯。然后將PhantomJS配置進(jìn)環(huán)境變量,輸入
$ phantomjs
如果有反應(yīng),那么就可以進(jìn)行下一步了。
利用PhantomJS進(jìn)行簡(jiǎn)單截圖
這里我們?cè)O(shè)置了窗口大小為1024 * 800:
page.viewportSize = { width: 1024, height: 800 };
截取從(0, 0)為起點(diǎn)的1024 * 800大小的圖像:
禁止Javascript,允許圖片載入,并將userAgent改為"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/19.0":
然后利用page.open打開(kāi)頁(yè)面,最后截圖輸出到./snapshot/test.png中:
NodeJS與PhantomJS通訊
我們先來(lái)看看PhantomJS能做什么通訊。
命令行傳參例如:
phantomjs snapshot.js http://www.baidu.com
命令行傳參只能在PhantomJS開(kāi)啟時(shí)進(jìn)行傳參,在運(yùn)行過(guò)程中就無(wú)能為力了。
標(biāo)準(zhǔn)輸出能從PhantomJS向NodeJS輸出數(shù)據(jù),但卻沒(méi)法從NodeJS傳數(shù)據(jù)給PhantomJS。
不過(guò)測(cè)試中,標(biāo)準(zhǔn)輸出是這幾種方式傳輸最快的,在大量數(shù)據(jù)傳輸中應(yīng)當(dāng)考慮。
PhantomJS向NodeJS服務(wù)發(fā)出HTTP請(qǐng)求,然后NodeJS返回相應(yīng)的數(shù)據(jù)。
這種方式很簡(jiǎn)單,但是請(qǐng)求只能由PhantomJS發(fā)出。
值得注意的是PhantomJS 1.9.0支持Websocket了,不過(guò)可惜是hixie-76 Websocket,不過(guò)畢竟還是提供了一種NodeJS主動(dòng)向PhantomJS通訊的方案了。
測(cè)試中,我們發(fā)現(xiàn)PhantomJS連上本地的Websocket服務(wù)居然需要1秒左右,暫時(shí)不考慮這種方法吧。
phantomjs-node成功將PhantomJS作為NodeJS的一個(gè)模塊來(lái)使用,但我們看看作者的原理解釋:
I will answer that question with a question. How do you communicate with a process that doesn't support shared memory, sockets, FIFOs, or standard input?
Well, there's one thing PhantomJS does support, and that's opening webpages. In fact, it's really good at opening web pages. So we communicate with PhantomJS by spinning up an instance of ExpressJS, opening Phantom in a subprocess, and pointing it at a special webpage that turns socket.io messages into alert()
calls. Those alert()
calls are picked up by Phantom and there you go!
The communication itself happens via James Halliday's fantastic dnode library, which fortunately works well enough when combined with browserify to run straight out of PhantomJS's pidgin Javascript environment.
實(shí)際上phantomjs-node使用的也是HTTP或者Websocket來(lái)進(jìn)行通訊,不過(guò)其依賴龐大,我們只想做一個(gè)簡(jiǎn)單的東西,暫時(shí)還是不考慮這個(gè)東東吧。
設(shè)計(jì)圖
讓我們開(kāi)始吧
我們?cè)诘谝话嬷羞x用HTTP進(jìn)行實(shí)現(xiàn)。
首先利用cluster進(jìn)行簡(jiǎn)單的進(jìn)程守護(hù)(index.js):
module.exports = (function () {
"use strict"
var cluster = require('cluster')
, fs = require('fs');
if(!fs.existsSync('./snapshot')) {
fs.mkdirSync('./snapshot');
}
if (cluster.isMaster) {
cluster.fork();
cluster.on('exit', function (worker) {
console.log('Worker' + worker.id + ' died :(');
process.nextTick(function () {
cluster.fork();
});
})
} else {
require('./extract.js');
}
})();
然后利用connect做我們的對(duì)外API(extract.js):
module.exports = (function () {
"use strict"
var connect = require('connect')
, fs = require('fs')
, spawn = require('child_process').spawn
, jobMan = require('./lib/jobMan.js')
, bridge = require('./lib/bridge.js')
, pkg = JSON.parse(fs.readFileSync('./package.json'));
var app = connect()
.use(connect.logger('dev'))
.use('/snapshot', connect.static(__dirname + '/snapshot', { maxAge: pkg.maxAge }))
.use(connect.bodyParser())
.use('/bridge', bridge)
.use('/api', function (req, res, next) {
if (req.method !== "POST" || !req.body.campaignId) return next();
if (!req.body.urls || !req.body.urls.length) return jobMan.watch(req.body.campaignId, req, res, next);
var campaignId = req.body.campaignId
, imagesPath = './snapshot/' + campaignId + '/'
, urls = []
, url
, imagePath;
function _deal(id, url, imagePath) {
// just push into urls list
urls.push({
id: id,
url: url,
imagePath: imagePath
});
}
for (var i = req.body.urls.length; i--;) {
url = req.body.urls[i];
imagePath = imagesPath + i + '.png';
_deal(i, url, imagePath);
}
jobMan.register(campaignId, urls, req, res, next);
var snapshot = spawn('phantomjs', ['snapshot.js', campaignId]);
snapshot.stdout.on('data', function (data) {
console.log('stdout: ' + data);
});
snapshot.stderr.on('data', function (data) {
console.log('stderr: ' + data);
});
snapshot.on('close', function (code) {
console.log('snapshot exited with code ' + code);
});
})
.use(connect.static(__dirname + '/html', { maxAge: pkg.maxAge }))
.listen(pkg.port, function () { console.log('listen: ' + 'http://localhost:' + pkg.port); });
})();
這里我們引用了兩個(gè)模塊bridge和jobMan。
其中bridge是HTTP通訊橋梁,jobMan是工作管理器。我們通過(guò)campaignId來(lái)對(duì)應(yīng)一個(gè)job,然后將job和response委托給jobMan管理。然后啟動(dòng)PhantomJS進(jìn)行處理。
通訊橋梁負(fù)責(zé)接受或者返回job的相關(guān)信息,并交給jobMan(bridge.js):
module.exports = (function () {
"use strict"
var jobMan = require('./jobMan.js')
, fs = require('fs')
, pkg = JSON.parse(fs.readFileSync('./package.json'));
return function (req, res, next) {
if (req.headers.secret !== pkg.secret) return next();
// Snapshot APP can post url information
if (req.method === "POST") {
var body = JSON.parse(JSON.stringify(req.body));
jobMan.fire(body);
res.end('');
// Snapshot APP can get the urls should extract
} else {
var urls = jobMan.getUrls(req.url.match(/campaignId=([^&]*)(\s|&|$)/)[1]);
res.writeHead(200, {'Content-Type': 'application/json'});
res.statuCode = 200;
res.end(JSON.stringify({ urls: urls }));
}
};
})();
如果request method為POST,則我們認(rèn)為PhantomJS正在給我們推送job的相關(guān)信息。而為GET時(shí),則認(rèn)為其要獲取job的信息。
jobMan負(fù)責(zé)管理job,并發(fā)送目前得到的job信息通過(guò)response返回給client(jobMan.js):
module.exports = (function () {
"use strict"
var fs = require('fs')
, fetch = require('./fetch.js')
, _jobs = {};
function _send(campaignId){
var job = _jobs[campaignId];
if (!job) return;
if (job.waiting) {
job.waiting = false;
clearTimeout(job.timeout);
var finished = (job.urlsNum === job.finishNum)
, data = {
campaignId: campaignId,
urls: job.urls,
finished: finished
};
job.urls = [];
var res = job.res;
if (finished) {
_jobs[campaignId] = null;
delete _jobs[campaignId]
}
res.writeHead(200, {'Content-Type': 'application/json'});
res.statuCode = 200;
res.end(JSON.stringify(data));
}
}
function register(campaignId, urls, req, res, next) {
_jobs[campaignId] = {
urlsNum: urls.length,
finishNum: 0,
urls: [],
cacheUrls: urls,
res: null,
waiting: false,
timeout: null
};
watch(campaignId, req, res, next);
}
function watch(campaignId, req, res, next) {
_jobs[campaignId].res = res;
// 20s timeout
_jobs[campaignId].timeout = setTimeout(function () {
_send(campaignId);
}, 20000);
}
function fire(opts) {
var campaignId = opts.campaignId
, job = _jobs[campaignId]
, fetchObj = fetch(opts.html);
if (job) {
if (+opts.status && fetchObj.title) {
job.urls.push({
id: opts.id,
url: opts.url,
image: opts.image,
title: fetchObj.title,
description: fetchObj.description,
status: +opts.status
});
} else {
job.urls.push({
id: opts.id,
url: opts.url,
status: +opts.status
});
}
if (!job.waiting) {
job.waiting = true;
setTimeout(function () {
_send(campaignId);
}, 500);
}
job.finishNum ++;
} else {
console.log('job can not found!');
}
}
function getUrls(campaignId) {
var job = _jobs[campaignId];
if (job) return job.cacheUrls;
}
return {
register: register,
watch: watch,
fire: fire,
getUrls: getUrls
};
})();
這里我們用到fetch對(duì)html進(jìn)行抓取其title和description,fetch實(shí)現(xiàn)比較簡(jiǎn)單(fetch.js):
module.exports = (function () {
"use strict"
return function (html) {
if (!html) return { title: false, description: false };
var title = html.match(/\<title\>(.*?)\<\/title\>/)
, meta = html.match(/\<meta\s(.*?)\/?\>/g)
, description;
if (meta) {
for (var i = meta.length; i--;) {
if(meta[i].indexOf('name="description"') > -1 || meta[i].indexOf('name="Description"') > -1){
description = meta[i].match(/content\=\"(.*?)\"/)[1];
}
}
}
(title && title[1] !== '') ? (title = title[1]) : (title = 'No Title');
description || (description = 'No Description');
return {
title: title,
description: description
};
};
})();
最后是PhantomJS運(yùn)行的源代碼,其啟動(dòng)后通過(guò)HTTP向bridge獲取job信息,然后每完成job的其中一個(gè)url就通過(guò)HTTP返回給bridge(snapshot.js):
var webpage = require('webpage')
, args = require('system').args
, fs = require('fs')
, campaignId = args[1]
, pkg = JSON.parse(fs.read('./package.json'));
function snapshot(id, url, imagePath) {
var page = webpage.create()
, send
, begin
, save
, end;
page.viewportSize = { width: 1024, height: 800 };
page.clipRect = { top: 0, left: 0, width: 1024, height: 800 };
page.settings = {
javascriptEnabled: false,
loadImages: true,
userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/1.9.0'
};
page.open(url, function (status) {
var data;
if (status === 'fail') {
data = [
'campaignId=',
campaignId,
'&url=',
encodeURIComponent(url),
'&id=',
id,
'&status=',
].join('');
postPage.open('http://localhost:' + pkg.port + '/bridge', 'POST', data, function () {});
} else {
page.render(imagePath);
var html = page.content;
// callback NodeJS
data = [
'campaignId=',
campaignId,
'&html=',
encodeURIComponent(html),
'&url=',
encodeURIComponent(url),
'&image=',
encodeURIComponent(imagePath),
'&id=',
id,
'&status=',
].join('');
postMan.post(data);
}
// release the memory
page.close();
});
}
var postMan = {
postPage: null,
posting: false,
datas: [],
len: 0,
currentNum: 0,
init: function (snapshot) {
var postPage = webpage.create();
postPage.customHeaders = {
'secret': pkg.secret
};
postPage.open('http://localhost:' + pkg.port + '/bridge?campaignId=' + campaignId, function () {
var urls = JSON.parse(postPage.plainText).urls
, url;
this.len = urls.length;
if (this.len) {
for (var i = this.len; i--;) {
url = urls[i];
snapshot(url.id, url.url, url.imagePath);
}
}
});
this.postPage = postPage;
},
post: function (data) {
this.datas.push(data);
if (!this.posting) {
this.posting = true;
this.fire();
}
},
fire: function () {
if (this.datas.length) {
var data = this.datas.shift()
, that = this;
this.postPage.open('http://localhost:' + pkg.port + '/bridge', 'POST', data, function () {
that.fire();
// kill child process
setTimeout(function () {
if (++this.currentNum === this.len) {
that.postPage.close();
phantom.exit();
}
}, 500);
});
} else {
this.posting = false;
}
}
};
postMan.init(snapshot);
- JS實(shí)現(xiàn)預(yù)加載視頻音頻/視頻獲取截圖(返回canvas截圖)
- JS打開(kāi)攝像頭并截圖上傳示例
- node.js實(shí)現(xiàn)快速截圖
- JavaScript+html5 canvas實(shí)現(xiàn)本地截圖教程
- JavaScript實(shí)現(xiàn)網(wǎng)頁(yè)截圖功能
- JavaScript獲取某年某月的最后一天附截圖
- javascript在網(wǎng)頁(yè)中實(shí)現(xiàn)讀取剪貼板粘貼截圖功能
- JS圖片自動(dòng)輪換效果實(shí)現(xiàn)思路附截圖
- 詳解js獲取video任意時(shí)間的畫(huà)面截圖
相關(guān)文章
JavaScript中的this原理及6種常見(jiàn)使用場(chǎng)景詳解
這篇文章主要介紹了JavaScript中的this原理及6種常見(jiàn)使用場(chǎng)景詳解,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2020-02-02小程序圓形進(jìn)度條及面積圖實(shí)現(xiàn)的方法
做微信小程序的朋友大都接觸過(guò)或自己動(dòng)手寫(xiě)過(guò)自定義組件,下面這篇文章主要給大家介紹了關(guān)于小程序圓形進(jìn)度條及面積圖實(shí)現(xiàn)的相關(guān)資料,文中通過(guò)實(shí)例代碼介紹的非常詳細(xì),需要的朋友可以參考下2022-05-05Javascript實(shí)現(xiàn)動(dòng)態(tài)菜單添加的實(shí)例代碼
在注冊(cè)信息的時(shí)候,常常需要通過(guò)下拉菜單讓用戶選擇,而且希望用戶在第一個(gè)下拉框做的選擇,影響第二個(gè)下拉框的內(nèi)容。有時(shí)候,如果第一個(gè)下拉框不作出選擇,第二個(gè)下拉框根本不會(huì)頁(yè)面上顯示,為了給用戶呈現(xiàn)一個(gè)更清晰的頁(yè)面。2013-07-07ES6學(xué)習(xí)筆記之let與const用法實(shí)例分析
這篇文章主要介紹了ES6學(xué)習(xí)筆記之let與const用法,結(jié)合實(shí)例形式分析了ES6中l(wèi)et與const的功能、使用方法及相關(guān)操作注意事項(xiàng),需要的朋友可以參考下2020-01-01JavaScript實(shí)現(xiàn)頁(yè)面一鍵全選或反選
這篇文章主要為大家詳細(xì)介紹了JavaScript實(shí)現(xiàn)頁(yè)面一鍵全選或反選,文中示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2022-07-07ES5 模擬 ES6 的 Symbol 實(shí)現(xiàn)私有成員功能示例
這篇文章主要介紹了ES5 模擬 ES6 的 Symbol 實(shí)現(xiàn)私有成員功能,結(jié)合實(shí)例形式分析了ES5 模擬 ES6 的 Symbol 實(shí)現(xiàn)私有成員功能相關(guān)原理、實(shí)現(xiàn)方法與操作注意事項(xiàng),需要的朋友可以參考下2020-05-05javascript中for/in循環(huán)及使用技巧
如果您希望一遍又一遍地運(yùn)行相同的代碼,并且每次的值都不同,那么使用循環(huán)是很方便的,本篇文章給大家介紹javascript中for/in循環(huán)及使用技巧 ,需要的朋友可以參考下2015-09-09javascript檢查瀏覽器是否已經(jīng)啟用XX功能
本文給大家分享的是檢測(cè)瀏覽器是否支持cookie功能,檢查瀏覽器是否已經(jīng)啟用Java支持功能以及獲取當(dāng)前瀏覽器的信息,十分的實(shí)用,有需要的小伙伴可以參考下。2015-07-07通過(guò)js獲取上傳的圖片信息(臨時(shí)保存路徑,名稱,大?。┤缓笸ㄟ^(guò)ajax傳遞給后端的方法
最近有朋友向我請(qǐng)教,使用js獲取上傳圖片的信息然后通過(guò)ajax傳遞給后端,怎么實(shí)現(xiàn)呢?通過(guò)上網(wǎng)搜索大量資料,下面小編把我的解決辦法整理,分享給大家,需要的朋友可以參考下2015-10-10