快捷導(dǎo)航

NodeJs爬蟲(chóng)框架Spider基礎(chǔ)使用教程

更新時(shí)間：2023年07月24日 09:13:48 作者：GeoffZhu

這篇文章主要為大家介紹了NodeJs爬蟲(chóng)框架Spider基礎(chǔ)使用教程，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進(jìn)步，早日升職加薪

gz-spider

一個(gè)基于Puppeteer和Axios的NodeJs爬蟲(chóng)框架源碼倉(cāng)庫(kù)

為什么需要爬蟲(chóng)框架

爬蟲(chóng)框架可以簡(jiǎn)化開(kāi)發(fā)流程，提供統(tǒng)一規(guī)范，提升效率。一套優(yōu)秀的爬蟲(chóng)框架會(huì)利用多線程，多進(jìn)程，分布式，IP池等能力，幫助開(kāi)發(fā)者快速開(kāi)發(fā)出易于維護(hù)的工業(yè)級(jí)爬蟲(chóng)，長(zhǎng)期受用。

特性

可配置代理
支持任務(wù)重試
支持Puppeteer
異步隊(duì)列服務(wù)友好
多進(jìn)程友好

安裝

npm i gz-spider --save

使用

const spider = require('gz-spider');
// 每個(gè)爬蟲(chóng)是一個(gè)方法，需要通過(guò)setProcesser注冊(cè)
spider.setProcesser({
  ['getGoogleSearchResult']: async (fetcher, params) => {
    // fetcher.page是原始的puppeteer page，可以直接用于打開(kāi)頁(yè)面
    let resp = await fetcher.axios.get(`https://www.google.com/search?q=${params}`);
    // throw 'Retry', will retry this processer
    // throw 'ChangeProxy', will retry this processer use new proxy
    // throw 'Fail', will finish this processer with message(fail) Immediately
    if (resp.status === 200) {
      // Data processing start
      let result = resp.data + 1;
      // Data processing end
      return result;
    } else {
      throw 'retry';
    }
  }
});
// 開(kāi)始爬取
spider.getData('getGoogleSearchResult', params).then(userInfo => {
  console.log(userInfo);
});

配置

框架由三部分組成，fetcher、strategy、processer。

Fetcher

spider.setFetcher({
  axiosTimeout: 5000,
  proxyTimeout: 180 * 1000
  proxy() {
    // 支持返回Promise，可以從遠(yuǎn)端拉取代理的配置
    return {
      host: '127.0.0.1',
      port: '9000'
    }
  }
});

axiosTimeout: [Number] 每次爬蟲(chóng)請(qǐng)求的超時(shí)時(shí)間
proxyTimeout: [Number] 更新代理IP時(shí)間，代理IP有超時(shí)的場(chǎng)景使用，會(huì)重新執(zhí)行proxy function，使用新的代理IP
proxy: [Object | Function] 當(dāng) proxy是[Function], 支持異步，可以從遠(yuǎn)端拉取代理的配置
- proxy.host [String]
- proxy.port [String]

Strategy

spider.setStrategy({
  retryTimes: 2
});

retryTimes: [Number] 最大重試次數(shù)

與任務(wù)隊(duì)列結(jié)合使用

流程獲取任務(wù) -> `spider.getData(processerKey, processerIn)` -> 完成任務(wù)并帶上處理好的數(shù)據(jù)

用MySql模擬任務(wù)隊(duì)列

創(chuàng)建spider-task表, 至少包含'id', 'status', 'processer_key', 'processer_input', 'processer_output'
寫(xiě)一個(gè)拉取未完成任務(wù)的接口, 例如 GET /spider/task
寫(xiě)一個(gè)完成任務(wù)的接口，例如 PUT /spider/task

const axios = require('axios');
while (true) {
  // 獲取任務(wù)
  let resp = await axios.get('http://127.0.0.1:8080/spider/task');
  if (!resp.data.task) break;
  let { id, processerKey, processerInput } = resp.data.task;
  let processerOutput = await spider.getData(processerKey, processerInput);
  // 完成任務(wù)并帶上處理好的數(shù)據(jù)
  await axios.put('http://127.0.0.1:8080/spider/task', {
    id, processerOutput,
    status: 'success'
  });
}

對(duì)爬蟲(chóng)的一些理解

爬蟲(chóng)的運(yùn)行方式就決定了它無(wú)法做到長(zhǎng)久穩(wěn)定和實(shí)時(shí)。在設(shè)計(jì)爬蟲(chóng)框架的時(shí)候，圍繞的點(diǎn)是異步任務(wù)隊(duì)列。工程上爬蟲(chóng)框架會(huì)提供一個(gè)高效的數(shù)據(jù)處理流水線，并可適配多種任務(wù)隊(duì)列。

gz-spider分為三個(gè)組成部分，fetcher，strategy和processer。

fetcher抓取器，其中包含常用的http和puppeteer，并且可以掛各種類(lèi)型的代理。
strategy策略中心，負(fù)責(zé)配置爬取失敗后的各種策略。
processer負(fù)責(zé)從原始數(shù)據(jù)結(jié)構(gòu)處理為目標(biāo)數(shù)據(jù)的過(guò)程，也是爬蟲(chóng)框架用戶要寫(xiě)的部分

License

MIT

以上就是NodeJs爬蟲(chóng)框架Spider基礎(chǔ)使用教程的詳細(xì)內(nèi)容，更多關(guān)于NodeJs爬蟲(chóng)框架Spider的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: