快捷導(dǎo)航

node.js實(shí)現(xiàn)爬蟲教程

更新時(shí)間：2020年08月25日 09:57:19 作者：garfieldzf

這篇文章主要為大家介紹了node.js基礎(chǔ)模塊http、網(wǎng)頁分析工具cherrio實(shí)現(xiàn)爬蟲的相關(guān)資料，文中示例代碼介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們可以參考一下

一、前言

說是爬蟲初探，其實(shí)并沒有用到爬蟲相關(guān)第三方類庫，主要用了node.js基礎(chǔ)模塊http、網(wǎng)頁分析工具cherrio。使用http直接獲取url路徑對應(yīng)網(wǎng)頁資源，然后使用cherrio分析。這里我主要學(xué)習(xí)過的案例自己敲了一遍，加深理解。在coding的過程中，我第一次把jq獲取后的對象直接用forEach遍歷，直接報(bào)錯(cuò)，是因?yàn)閖q沒有對應(yīng)的這個(gè)方法，只有js數(shù)組可以調(diào)用。

二、知識(shí)點(diǎn)

①：superagent抓去網(wǎng)頁工具。我暫時(shí)未用到。
②：cherrio 網(wǎng)頁分析工具，你可以理解其為服務(wù)端的jQuery，因?yàn)檎Z法都一樣。

效果圖

1、抓取整個(gè)網(wǎng)頁

2、分析后的數(shù)據(jù)，提供的示例為案例實(shí)現(xiàn)的例子。

爬蟲初探源碼分析

var http=require('http');
var cheerio=require('cheerio');
 
var url='http://www.imooc.com/learn/348';
 
/****************************
打印得到的數(shù)據(jù)結(jié)構(gòu)
[{
 chapterTitle:'',
 videos:[{
 title:'',
 id:''
 }]
}]
********************************/
function printCourseInfo(courseData){
 courseData.forEach(function(item){
 var chapterTitle=item.chapterTitle;
 console.log(chapterTitle+'\n');
 item.videos.forEach(function(video){
 console.log(' 【'+video.id+'】'+video.title+'\n');
 })
 });
}
 
 
/*************
分析從網(wǎng)頁里抓取到的數(shù)據(jù)
**************/
function filterChapter(html){
 var courseData=[];
 
 var $=cheerio.load(html);
 var chapters=$('.chapter');
 chapters.each(function(item){
 var chapter=$(this);
 var chapterTitle=chapter.find('strong').text(); //找到章節(jié)標(biāo)題
 var videos=chapter.find('.video').children('li');
 
 var chapterData={
 chapterTitle:chapterTitle,
 videos:[]
 };
 
 videos.each(function(item){
 var video=$(this).find('.studyvideo');
 var title=video.text();
 var id=video.attr('href').split('/video')[1];
 
 chapterData.videos.push({
 title:title,
 id:id
 })
 })
 
 courseData.push(chapterData);
 });
 
 return courseData;
}
 
http.get(url,function(res){
 var html='';
 
 res.on('data',function(data){
 html+=data;
 })
 
 res.on('end',function(){
 var courseData=filterChapter(html);
 printCourseInfo(courseData);
 })
}).on('error',function(){
 console.log('獲取課程數(shù)據(jù)出錯(cuò)');
})

參考資料：

https://github.com/alsotang/node-lessons/tree/master/lesson3

http://www.imooc.com/video/7965

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: