快捷導(dǎo)航

詳細(xì)Java批量獲取微信公眾號(hào)方法

更新時(shí)間：2017年12月04日 10:33:07 投稿：laozhang

本篇文章給大家講解了用JAVA如何實(shí)現(xiàn)向爬蟲(chóng)一樣獲取微信公眾號(hào)和其基本信息等，需要你正巧需要，那跟著學(xué)習(xí)參考下吧。

最近需要爬取微信公眾號(hào)的文章信息。在網(wǎng)上找了找發(fā)現(xiàn)微信公眾號(hào)爬取的難點(diǎn)在于公眾號(hào)文章鏈接在pc端是打不開(kāi)的，要用微信的自帶瀏覽器（拿到微信客戶(hù)端補(bǔ)充的參數(shù)，才可以在其它平臺(tái)打開(kāi)），這就給爬蟲(chóng)程序造成很大困擾。后來(lái)在知乎上看到了一位大牛用php寫(xiě)的微信公眾號(hào)爬取程序，就直接按大佬的思路整了整搞成java的了。改造途中遇到蠻多細(xì)節(jié)問(wèn)題，拿出來(lái)分享一下。

系統(tǒng)的基本思路是在安卓模擬器上運(yùn)行微信，模擬器設(shè)置代理，通過(guò)代理服務(wù)器攔截微信數(shù)據(jù)，將得到的數(shù)據(jù)發(fā)送給自己的程序進(jìn)行處理。

需要準(zhǔn)備的環(huán)境：nodejs，anyproxy代理，安卓模擬器

nodejs下載地址：http://nodejs.cn/download/，我下載的是windows版的，下好直接安裝就行。安裝好后，直接運(yùn)行C:\Program Files\nodejs\npm.cmd 會(huì)自動(dòng)配置好環(huán)境。

anyproxy安裝：按上一步安裝好nodejs之后，直接在cmd運(yùn)行 npm install -g anyproxy 就會(huì)安裝了

安卓模擬器隨便在網(wǎng)上下一個(gè)就好了，一大堆。

首先為代理服務(wù)器安裝證書(shū)，anyproxy默認(rèn)不解析https鏈接，安裝證書(shū)后就可以解析了，在cmd執(zhí)行anyproxy --root 就會(huì)安裝證書(shū)，之后還得在模擬器也下載這個(gè)證書(shū)。

然后輸入anyproxy -i 命令打開(kāi)代理服務(wù)。（記得加上參數(shù)?。?/p>

記住這個(gè)ip和端口，之后安卓模擬器的代理就用這個(gè)?，F(xiàn)在用瀏覽器打開(kāi)網(wǎng)頁(yè)：http://localhost:8002/ 這是anyproxy的網(wǎng)頁(yè)界面，用于顯示http傳輸數(shù)據(jù)。

點(diǎn)擊上面紅框框里面的菜單，會(huì)出一個(gè)二維碼，用安卓模擬器掃碼識(shí)別，模擬器（手機(jī)）就會(huì)下載證書(shū)了，安裝上就好了。

現(xiàn)在準(zhǔn)備為模擬器設(shè)置代理，代理方式設(shè)置為手動(dòng)，代理ip為運(yùn)行anyproxy機(jī)器的ip，端口是8001

到這里準(zhǔn)備工作基本完成，在模擬器上打開(kāi)微信隨便打開(kāi)一個(gè)公眾號(hào)的文章，就能從你剛打開(kāi)的web界面中看到anyproxy抓取到的數(shù)據(jù)：

上面紅框內(nèi)就是微信文章的鏈接，點(diǎn)擊進(jìn)去可以看到具體的數(shù)據(jù)。如果response body里面什么都沒(méi)有可能證書(shū)安裝有問(wèn)題。

如果上面都走通了，就可以接著往下走了。

這里我們靠代理服務(wù)抓微信數(shù)據(jù)，但總不能抓取一條數(shù)據(jù)就自己操作一下微信，那樣還不如直接人工復(fù)制。所以我們需要微信客戶(hù)端自己跳轉(zhuǎn)頁(yè)面。這時(shí)就可以使用anyproxy攔截微信服務(wù)器返回的數(shù)據(jù)，往里面注入頁(yè)面跳轉(zhuǎn)代碼，再把加工的數(shù)據(jù)返回給模擬器實(shí)現(xiàn)微信客戶(hù)端自動(dòng)跳轉(zhuǎn)。

打開(kāi)anyproxy中的一個(gè)叫rule_default.js的js文件，windows下該文件在：C:\Users\Administrator\AppData\Roaming\npm\node_modules\anyproxy\lib

在文件里面有個(gè)叫replaceServerResDataAsync: function(req,res,serverResData,callback)的方法，這個(gè)方法就是負(fù)責(zé)對(duì)anyproxy拿到的數(shù)據(jù)進(jìn)行各種操作。一開(kāi)始應(yīng)該只有callback(serverResData)；這條語(yǔ)句的意思是直接返回服務(wù)器響應(yīng)數(shù)據(jù)給客戶(hù)端。直接刪掉這條語(yǔ)句，替換成大牛寫(xiě)的如下代碼。這里的代碼我并沒(méi)有做什么改動(dòng)，里面的注釋也解釋的給非常清楚，直接按邏輯看懂就行，問(wèn)題不大。

 replaceServerResDataAsync: function(req,res,serverResData,callback){
     if(/mp\/getmasssendmsg/i.test(req.url)){//當(dāng)鏈接地址為公眾號(hào)歷史消息頁(yè)面時(shí)(第一種頁(yè)面形式)
       //console.log("開(kāi)始第一種頁(yè)面爬取");
       if(serverResData.toString() !== ""){
         6         try {//防止報(bào)錯(cuò)退出程序
          var reg = /msgList = (.*?);/;//定義歷史消息正則匹配規(guī)則
          var ret = reg.exec(serverResData.toString());//轉(zhuǎn)換變量為string
          HttpPost(ret[1],req.url,"/InternetSpider/getData/showBiz");//這個(gè)函數(shù)是后文定義的，將匹配到的歷史消息json發(fā)送到自己的服務(wù)器
          var http = require('http');
           http.get('http://xxx/getWxHis', function(res) {//這個(gè)地址是自己服務(wù)器上的一個(gè)程序，目的是為了獲取到下一個(gè)鏈接地址，將地址放在一個(gè)js腳本中，將頁(yè)面自動(dòng)跳轉(zhuǎn)到下一頁(yè)。后文將介紹getWxHis.php的原理。
             res.on('data', function(chunk){
             callback(chunk+serverResData);//將返回的代碼插入到歷史消息頁(yè)面中，并返回顯示出來(lái)
             })
           });
         }catch(e){//如果上面的正則沒(méi)有匹配到，那么這個(gè)頁(yè)面內(nèi)容可能是公眾號(hào)歷史消息頁(yè)面向下翻動(dòng)的第二頁(yè)，因?yàn)闅v史消息第一頁(yè)是html格式的，第二頁(yè)就是json格式的。
         //console.log("開(kāi)始第一種頁(yè)面爬取向下翻形式");
           try {
             var json = JSON.parse(serverResData.toString());
             if (json.general_msg_list != []) {
             HttpPost(json.general_msg_list,req.url,"/xxx/showBiz");//這個(gè)函數(shù)和上面的一樣是后文定義的，將第二頁(yè)歷史消息的json發(fā)送到自己的服務(wù)器
             }
           }catch(e){
            console.log(e);//錯(cuò)誤捕捉
           }
           callback(serverResData);//直接返回第二頁(yè)json內(nèi)容
         }
       }
       //console.log("開(kāi)始第一種頁(yè)面爬取 結(jié)束");
     }else if(/mp\/profile_ext\?action=home/i.test(req.url)){//當(dāng)鏈接地址為公眾號(hào)歷史消息頁(yè)面時(shí)(第二種頁(yè)面形式)
       try {
         var reg = /var msgList = \'(.*?)\';/;//定義歷史消息正則匹配規(guī)則（和第一種頁(yè)面形式的正則不同）
         var ret = reg.exec(serverResData.toString());//轉(zhuǎn)換變量為string
         HttpPost(ret[1],req.url,"/xxx/showBiz");//這個(gè)函數(shù)是后文定義的，將匹配到的歷史消息json發(fā)送到自己的服務(wù)器
         var http = require('http');
         http.get('xxx/getWxHis', function(res) {//這個(gè)地址是自己服務(wù)器上的一個(gè)程序，目的是為了獲取到下一個(gè)鏈接地址，將地址放在一個(gè)js腳本中，將頁(yè)面自動(dòng)跳轉(zhuǎn)到下一頁(yè)。后文將介紹getWxHis.php的原理。
             res.on('data', function(chunk){
             callback(chunk+serverResData);//將返回的代碼插入到歷史消息頁(yè)面中，并返回顯示出來(lái)
             })
           });
       }catch(e){
         //console.log(e);
         callback(serverResData);
       }
     }else if(/mp\/profile_ext\?action=getmsg/i.test(req.url)){//第二種頁(yè)面表現(xiàn)形式的向下翻頁(yè)后的json
       try {
         var json = JSON.parse(serverResData.toString());
         if (json.general_msg_list != []) {
           HttpPost(json.general_msg_list,req.url,"/xxx/showBiz");//這個(gè)函數(shù)和上面的一樣是后文定義的，將第二頁(yè)歷史消息的json發(fā)送到自己的服務(wù)器
         }
       }catch(e){
         console.log(e);
       }
       callback(serverResData);
     }else if(/mp\/getappmsgext/i.test(req.url)){//當(dāng)鏈接地址為公眾號(hào)文章閱讀量和點(diǎn)贊量時(shí)
       try {
         HttpPost(serverResData,req.url,"/xxx/getMsgExt");//函數(shù)是后文定義的，功能是將文章閱讀量點(diǎn)贊量的json發(fā)送到服務(wù)器
       }catch(e){
 
       }
       callback(serverResData);
     }else if(/s\?__biz/i.test(req.url) || /mp\/rumor/i.test(req.url)){//當(dāng)鏈接地址為公眾號(hào)文章時(shí)（rumor這個(gè)地址是公眾號(hào)文章被辟謠了）
       try {
         var http = require('http');
         http.get('http://xxx/getWxPost', function(res) {//這個(gè)地址是自己服務(wù)器上的另一個(gè)程序，目的是為了獲取到下一個(gè)鏈接地址，將地址放在一個(gè)js腳本中，將頁(yè)面自動(dòng)跳轉(zhuǎn)到下一頁(yè)。后文將介紹getWxPost.php的原理。
           res.on('data', function(chunk){
             callback(chunk+serverResData);
           })
         });
       }catch(e){
         callback(serverResData);
       }
     }else{
       callback(serverResData);
     }
     //callback(serverResData);
   },

這里簡(jiǎn)單解釋一下，微信公眾號(hào)的歷史消息頁(yè)鏈接有兩種形式：一種以 mp.weixin.qq.com/mp/getmasssendmsg 開(kāi)頭，另一種是 mp.weixin.qq.com/mp/profile_ext 開(kāi)頭。歷史頁(yè)是可以向下翻的，如果向下翻將觸發(fā)js事件發(fā)送請(qǐng)求得到j(luò)son數(shù)據(jù)（下一頁(yè)內(nèi)容）。還有公眾號(hào)文章鏈接，以及文章的閱讀量和點(diǎn)贊量的鏈接（返回的是json數(shù)據(jù)），這幾種鏈接的形式是固定的可以通過(guò)邏輯判斷來(lái)區(qū)分。這里有個(gè)問(wèn)題就是歷史頁(yè)如果需要全部爬取到該怎么做到。我的思路是通過(guò)js去模擬鼠標(biāo)向下滑動(dòng)，從而觸發(fā)提交加載下一部分列表的請(qǐng)求?；蛘咧苯永胊nyproxy分析下滑加載的請(qǐng)求，直接向微信服務(wù)器發(fā)生這個(gè)請(qǐng)求。但都有一個(gè)問(wèn)題就是如何判斷已經(jīng)沒(méi)有余下數(shù)據(jù)了。我是爬取最新數(shù)據(jù)，暫時(shí)沒(méi)這個(gè)需求，可能以后要。如果有需求的可以嘗試一下。

下圖是上文中的HttpPost方法內(nèi)容。

 function HttpPost(str,url,path) {//將json發(fā)送到服務(wù)器，str為json內(nèi)容，url為歷史消息頁(yè)面地址，path是接收程序的路徑和文件名
     console.log("開(kāi)始執(zhí)行轉(zhuǎn)發(fā)操作");
     try{
     var http = require('http');
     var data = {
         str: encodeURIComponent(str),
         url: encodeURIComponent(url)
     };
     data = require('querystring').stringify(data);
     var options = {
         method: "POST",
         host: "xxx",//注意沒(méi)有http://，這是服務(wù)器的域名。
         port: xxx,
         path: path,//接收程序的路徑和文件名
         headers: {
             'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
             "Content-Length": data.length
         }
     };
     var req = http.request(options, function (res) {
         res.setEncoding('utf8');
         res.on('data', function (chunk) {
             console.log('BODY: ' + chunk);
         });
     });
     req.on('error', function (e) {
         console.log('problem with request: ' + e.message);
     });
     
     req.write(data);
     req.end();
     }catch(e){
         console.log("錯(cuò)誤信息："+e);
     }
     console.log("轉(zhuǎn)發(fā)操作結(jié)束");
 }

做完以上工作，接下來(lái)就是按自己業(yè)務(wù)來(lái)完成服務(wù)端代碼了，我們的服務(wù)用于接收代理服務(wù)器發(fā)過(guò)來(lái)的數(shù)據(jù)進(jìn)行處理，進(jìn)行持久化操作，同時(shí)向代理服務(wù)器發(fā)送需要注入到微信的js代碼。針對(duì)代理服務(wù)器攔截到的幾種不同鏈接發(fā)來(lái)的數(shù)據(jù)，我們就需要設(shè)計(jì)相應(yīng)的方法來(lái)處理這些數(shù)據(jù)。從anyproxy處理微信數(shù)據(jù)的js方法replaceServerResDataAsync: function(req,res,serverResData,callback)中，我們可以知道至少需要對(duì)公眾號(hào)歷史頁(yè)數(shù)據(jù)、公眾號(hào)文章頁(yè)數(shù)據(jù)、公眾號(hào)文章點(diǎn)贊量和閱讀量數(shù)據(jù)設(shè)計(jì)三種方法來(lái)處理。同時(shí)我們還需要設(shè)計(jì)一個(gè)方法來(lái)生成爬取任務(wù)，完成公眾號(hào)的輪尋爬取。如果需要爬取更多數(shù)據(jù)，可以從anyproxy抓取到的鏈接中分析出更多需要的數(shù)據(jù)，然后往replaceServerResDataAsync: function(req,res,serverResData,callback)中添加判定，攔截到需要的數(shù)據(jù)發(fā)送到自己的服務(wù)器，相應(yīng)的在服務(wù)端添加方法處理該類(lèi)數(shù)據(jù)就行了。

我是用java寫(xiě)的服務(wù)端代碼。

處理公眾號(hào)歷史頁(yè)數(shù)據(jù)方法：

public void getMsgJson(String str ,String url) throws UnsupportedEncodingException {
    // TODO Auto-generated method stub
    String biz = "";
    Map<String,String> queryStrs = HttpUrlParser.parseUrl(url);
    if(queryStrs != null){
      biz = queryStrs.get("__biz");
      biz = biz + "==";
    }
    /**
     * 從數(shù)據(jù)庫(kù)中查詢(xún)biz是否已經(jīng)存在，如果不存在則插入，
     * 這代表著我們新添加了一個(gè)采集目標(biāo)公眾號(hào)。
     */
    List<WeiXin> results = weiXinMapper.selectByBiz(biz);
    if(results == null || results.size() == 0){
      WeiXin weiXin = new WeiXin();
      weiXin.setBiz(biz);
      weiXin.setCollect(System.currentTimeMillis());
      weiXinMapper.insert(weiXin);
    }
    //System.out.println(str);
    //解析str變量
    List<Object> lists = JsonPath.read(str, "['list']");
    for(Object list : lists){
      Object json = list;
      int type = JsonPath.read(json, "['comm_msg_info']['type']");
      if(type == 49){//type=49表示是圖文消息
        String content_url = JsonPath.read(json, "$.app_msg_ext_info.content_url");
        content_url = content_url.replace("\\", "").replaceAll("amp;", "");//獲得圖文消息的鏈接地址
        int is_multi = JsonPath.read(json, "$.app_msg_ext_info.is_multi");//是否是多圖文消息
        Integer datetime = JsonPath.read(json, "$.comm_msg_info.datetime");//圖文消息發(fā)送時(shí)間
        /**
         * 在這里將圖文消息鏈接地址插入到采集隊(duì)列庫(kù)tmplist中
         * （隊(duì)列庫(kù)將在后文介紹，主要目的是建立一個(gè)批量采集隊(duì)列，
         * 另一個(gè)程序?qū)⒏鶕?jù)隊(duì)列安排下一個(gè)采集的公眾號(hào)或者文章內(nèi)容）
         */
        try{
          if(content_url != null && !"".equals(content_url)){
            TmpList tmpList = new TmpList();
            tmpList.setContentUrl(content_url);
            tmpListMapper.insertSelective(tmpList);
          }
        }catch(Exception e){
          System.out.println("隊(duì)列已存在,不插入！");
        }
        
        /**
         * 在這里根據(jù)$content_url從數(shù)據(jù)庫(kù)post中判斷一下是否重復(fù)
         */
        List<Post> postList = postMapper.selectByContentUrl(content_url);
        boolean contentUrlExist = false;
        if(postList != null && postList.size() != 0){
          contentUrlExist = true;
        }
      
        
        if(!contentUrlExist){//'數(shù)據(jù)庫(kù)post中不存在相同的$content_url'
          Integer fileid = JsonPath.read(json, "$.app_msg_ext_info.fileid");//一個(gè)微信給的id
          String title = JsonPath.read(json, "$.app_msg_ext_info.title");//文章標(biāo)題
          String title_encode = URLEncoder.encode(title, "utf-8");
          String digest = JsonPath.read(json, "$.app_msg_ext_info.digest");//文章摘要
          String source_url = JsonPath.read(json, "$.app_msg_ext_info.source_url");//閱讀原文的鏈接
          source_url = source_url.replace("\\", "");
          String cover = JsonPath.read(json, "$.app_msg_ext_info.cover");//封面圖片
          cover = cover.replace("\\", "");
          /**
           * 存入數(shù)據(jù)庫(kù)
           */
//          System.out.println("頭條標(biāo)題："+title);
//          System.out.println("微信ID："+fileid);
//          System.out.println("文章摘要:"+digest);
//          System.out.println("閱讀原文鏈接:"+source_url);
//          System.out.println("封面圖片地址:"+cover);          
          
          Post post = new Post();
          post.setBiz(biz);
          post.setTitle(title);
          post.setTitleEncode(title_encode);
          post.setFieldId(fileid);
          post.setDigest(digest);
          post.setSourceUrl(source_url);
          post.setCover(cover);
          post.setIsTop(1);//標(biāo)記一下是頭條內(nèi)容
          post.setIsMulti(is_multi);
          post.setDatetime(datetime);
          post.setContentUrl(content_url);
          
          postMapper.insert(post);
        }
      
        if(is_multi == 1){//如果是多圖文消息
          List<Object> multiLists = JsonPath.read(json, "['app_msg_ext_info']['multi_app_msg_item_list']");
          for(Object multiList : multiLists){
            Object multiJson = multiList;          
            content_url = JsonPath.read(multiJson, "['content_url']").toString().replace("\\", "").replaceAll("amp;", "");//圖文消息鏈接地址
            /**
             * 這里再次根據(jù)$content_url判斷一下數(shù)據(jù)庫(kù)中是否重復(fù)以免出錯(cuò)
             */
            contentUrlExist = false;
            List<Post> posts = postMapper.selectByContentUrl(content_url);
            if(posts != null && posts.size() != 0){
              contentUrlExist = true;
            }
            if(!contentUrlExist){//'數(shù)據(jù)庫(kù)中不存在相同的$content_url'
              /**
               * 在這里將圖文消息鏈接地址插入到采集隊(duì)列庫(kù)中
               * （隊(duì)列庫(kù)將在后文介紹，主要目的是建立一個(gè)批量采集隊(duì)列，
               * 另一個(gè)程序?qū)⒏鶕?jù)隊(duì)列安排下一個(gè)采集的公眾號(hào)或者文章內(nèi)容）
               */
              if(content_url != null && !"".equals(content_url)){
                TmpList tmpListT = new TmpList();
                tmpListT.setContentUrl(content_url);
                tmpListMapper.insertSelective(tmpListT);
              }
              
              String title = JsonPath.read(multiJson, "$.title");
              String title_encode = URLEncoder.encode(title, "utf-8");
              Integer fileid = JsonPath.read(multiJson, "$.fileid");
              String digest = JsonPath.read(multiJson, "$.digest");
              String source_url = JsonPath.read(multiJson, "$.source_url");
              source_url = source_url.replace("\\", "");
              String cover = JsonPath.read(multiJson, "$.cover");
              cover = cover.replace("\\", "");            
//              System.out.println("標(biāo)題:"+title);
//              System.out.println("微信ID:"+fileid);
//              System.out.println("文章摘要:"+digest);
//              System.out.println("閱讀原文鏈接:"+source_url);
//              System.out.println("封面圖片地址:"+cover);              
              Post post = new Post();
              post.setBiz(biz);
              post.setTitle(title);
              post.setTitleEncode(title_encode);
              post.setFieldId(fileid);
              post.setDigest(digest);
              post.setSourceUrl(source_url);
              post.setCover(cover);
              post.setIsTop(0);//標(biāo)記一下不是頭條內(nèi)容
              post.setIsMulti(is_multi);
              post.setDatetime(datetime);
              post.setContentUrl(content_url);
              
              postMapper.insert(post);
              
            }
          }
        }      
      }    
    }
  }

處理公眾號(hào)文章頁(yè)的方法：

public String getWxPost() {
    // TODO Auto-generated method stub
    /**
     * 當(dāng)前頁(yè)面為公眾號(hào)文章頁(yè)面時(shí)，讀取這個(gè)程序
     * 首先刪除采集隊(duì)列表中l(wèi)oad=1的行
     * 然后從隊(duì)列表中按照“order by id asc”選擇多行(注意這一行和上面的程序不一樣)
     */
    tmpListMapper.deleteByLoad(1);
    List<TmpList> queues = tmpListMapper.selectMany(5);
    String url = "";
    if(queues != null && queues.size() != 0 && queues.size() > 1){
      TmpList queue = queues.get(0);
      url = queue.getContentUrl();
      queue.setIsload(1);
      int result = tmpListMapper.updateByPrimaryKey(queue);
      System.out.println("update result:"+result);
    }else{
      System.out.println("getpost queues is null?"+queues==null?null:queues.size());
      WeiXin weiXin = weiXinMapper.selectOne();
      String biz = weiXin.getBiz();
      if((Math.random()>0.5?1:0) == 1){
        url = "http://mp.weixin.qq.com/mp/getmasssendmsg?__biz=" + biz + 
            "#wechat_webview_type=1&wechat_redirect";//拼接公眾號(hào)歷史消息url地址（第一種頁(yè)面形式）
      }else{
        url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=" + biz + 
            "#wechat_redirect";//拼接公眾號(hào)歷史消息url地址（第二種頁(yè)面形式）
      }
      url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=" + biz + 
          "#wechat_redirect";//拼接公眾號(hào)歷史消息url地址（第二種頁(yè)面形式）
      //更新剛才提到的公眾號(hào)表中的采集時(shí)間time字段為當(dāng)前時(shí)間戳。
      weiXin.setCollect(System.currentTimeMillis());
      int result = weiXinMapper.updateByPrimaryKey(weiXin);
      System.out.println("getPost weiXin updateResult:"+result);
    }
    int randomTime = new Random().nextInt(3) + 3;
    String jsCode = "<script>setTimeout(function(){window.location.href='"+url+"';},"+randomTime*1000+");</script>";
    return jsCode;
    
  }

處理公眾號(hào)點(diǎn)贊量和閱讀量的方法：

public void getMsgExt(String str,String url) {
    // TODO Auto-generated method stub
    String biz = "";
    String sn = "";
    Map<String,String> queryStrs = HttpUrlParser.parseUrl(url);
    if(queryStrs != null){
      biz = queryStrs.get("__biz");
      biz = biz + "==";
      sn = queryStrs.get("sn");
      sn = "%" + sn + "%";
    }
    /**
     * $sql = "select * from `文章表` where `biz`='".$biz."'
     * and `content_url` like '%".$sn."%'" limit 0,1;
     * 根據(jù)biz和sn找到對(duì)應(yīng)的文章
     */
    Post post = postMapper.selectByBizAndSn(biz, sn);
    
    if(post == null){
      System.out.println("biz:"+biz);
      System.out.println("sn:"+sn);
      tmpListMapper.deleteByLoad(1);
      return;
    }
    
//    System.out.println("json數(shù)據(jù):"+str);
    Integer read_num;
    Integer like_num;
    try{
      read_num = JsonPath.read(str, "['appmsgstat']['read_num']");//閱讀量
      like_num = JsonPath.read(str, "['appmsgstat']['like_num']");//點(diǎn)贊量
    }catch(Exception e){
      read_num = 123;//閱讀量
      like_num = 321;//點(diǎn)贊量
      System.out.println("read_num:"+read_num);
      System.out.println("like_num:"+like_num);
      System.out.println(e.getMessage());
    }    
    
    /**
     * 在這里同樣根據(jù)sn在采集隊(duì)列表中刪除對(duì)應(yīng)的文章，代表這篇文章可以移出采集隊(duì)列了
     * $sql = "delete from `隊(duì)列表` where `content_url` like '%".$sn."%'" 
     */
    tmpListMapper.deleteBySn(sn);
    
    //然后將閱讀量和點(diǎn)贊量更新到文章表中。
    post.setReadnum(read_num);
    post.setLikenum(like_num);
    postMapper.updateByPrimaryKey(post);
    
  }

處理跳轉(zhuǎn)向微信注入js的方法：

public String getWxHis() {
    String url = "";
    // TODO Auto-generated method stub
    /**
     * 當(dāng)前頁(yè)面為公眾號(hào)歷史消息時(shí)，讀取這個(gè)程序
     * 在采集隊(duì)列表中有一個(gè)load字段，當(dāng)值等于1時(shí)代表正在被讀取
     * 首先刪除采集隊(duì)列表中l(wèi)oad=1的行
     * 然后從隊(duì)列表中任意select一行
     */
    tmpListMapper.deleteByLoad(1);
    TmpList queue = tmpListMapper.selectRandomOne();
    System.out.println("queue is null?"+queue);
    if(queue == null){//隊(duì)列表為空
      /**
       * 隊(duì)列表如果空了，就從存儲(chǔ)公眾號(hào)biz的表中取得一個(gè)biz，
       * 這里我在公眾號(hào)表中設(shè)置了一個(gè)采集時(shí)間的time字段，按照正序排列之后，
       * 就得到時(shí)間戳最小的一個(gè)公眾號(hào)記錄，并取得它的biz
       */
      WeiXin weiXin = weiXinMapper.selectOne();
      
      String biz = weiXin.getBiz();
      url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=" + biz + 
          "#wechat_redirect";//拼接公眾號(hào)歷史消息url地址（第二種頁(yè)面形式）
      //更新剛才提到的公眾號(hào)表中的采集時(shí)間time字段為當(dāng)前時(shí)間戳。
      weiXin.setCollect(System.currentTimeMillis());
      int result = weiXinMapper.updateByPrimaryKey(weiXin);
      System.out.println("getHis weiXin updateResult:"+result);
    }else{
      //取得當(dāng)前這一行的content_url字段
      url = queue.getContentUrl();
      //將load字段update為1
      tmpListMapper.updateByContentUrl(url);
    }
    //將下一個(gè)將要跳轉(zhuǎn)的$url變成js腳本，由anyproxy注入到微信頁(yè)面中。
    //echo "<script>setTimeout(function(){window.location.href='".$url."';},2000);</script>";
    int randomTime = new Random().nextInt(3) + 3;
    String jsCode = "<script>setTimeout(function(){window.location.href='"+url+"';},"+randomTime*1000+");</script>";
    return jsCode;
  }

以上就是對(duì)處理代理服務(wù)器攔截到的數(shù)據(jù)進(jìn)行處理的程序。這里有一個(gè)需要注意的問(wèn)題，程序會(huì)對(duì)數(shù)據(jù)庫(kù)中的每個(gè)收錄的公眾號(hào)進(jìn)行輪循訪問(wèn)，甚至是已經(jīng)存儲(chǔ)的文章也會(huì)再次訪問(wèn)，目的是為了一直更新文章的閱讀數(shù)和點(diǎn)贊數(shù)。如果需要抓取大量的公眾號(hào)建議對(duì)添加任務(wù)隊(duì)列的代碼進(jìn)行修改，添加條件限制，否則公眾號(hào)一多輪循抓取重復(fù)數(shù)據(jù)將十分影響效率。

至此就將微信公眾號(hào)的文章鏈接全部爬取到，而且這個(gè)鏈接是永久有效而且可以在瀏覽器打開(kāi)的鏈接，接下來(lái)就是寫(xiě)爬蟲(chóng)程序從數(shù)據(jù)庫(kù)中拿鏈接爬取文章內(nèi)容等信息了。

我是用webmagic寫(xiě)的爬蟲(chóng)，輕量好用。

public class SpiderModel implements PageProcessor{
  
  private static PostMapper postMapper;
  
  private static List<Post> posts;
  
  // 抓取網(wǎng)站的相關(guān)配置，包括編碼、抓取間隔、重試次數(shù)等
  private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
  
  public Site getSite() {
    // TODO Auto-generated method stub
    return this.site;
  }
  
  public void process(Page page) {
    // TODO Auto-generated method stub
    Post post = posts.remove(0);
    String content = page.getHtml().xpath("http://div[@id='js_content']").get();
    //存在和諧文章 此處做判定如果有直接刪除記錄或設(shè)置表示位表示文章被和諧
    if(content == null){
      System.out.println("文章已和諧！");
      //postMapper.deleteByPrimaryKey(post.getId());
      return;
    }
    String contentSnap = content.replaceAll("data-src", "src").replaceAll("preview.html", "player.html");//快照
    String contentTxt = HtmlToWord.stripHtml(content);//純文本內(nèi)容
    
    Selectable metaContent = page.getHtml().xpath("http://div[@id='meta_content']");
    String pubTime = null;
    String wxname = null;
    String author = null;
    if(metaContent != null){
      pubTime = metaContent.xpath("http://em[@id='post-date']").get();
      if(pubTime != null){
        pubTime = HtmlToWord.stripHtml(pubTime);//文章發(fā)布時(shí)間
      }
      wxname = metaContent.xpath("http://a[@id='post-user']").get();
      if(wxname != null){
        wxname = HtmlToWord.stripHtml(wxname);//公眾號(hào)名稱(chēng)
      }
      author = metaContent.xpath("http://em[@class='rich_media_meta rich_media_meta_text' and @id!='post-date']").get();
      if(author != null){
        author = HtmlToWord.stripHtml(author);//文章作者
      }
    }
    
//    System.out.println("發(fā)布時(shí)間:"+pubTime);
//    System.out.println("公眾號(hào)名稱(chēng):"+wxname);
//    System.out.println("文章作者:"+author);
    
    String title = post.getTitle().replaceAll("&nbsp;", "");//文章標(biāo)題
    String digest = post.getDigest();//文章摘要
    int likeNum = post.getLikenum();//文章點(diǎn)贊數(shù)
    int readNum = post.getReadnum();//文章閱讀數(shù)
    String contentUrl = post.getContentUrl();//文章鏈接
    
    WechatInfoBean wechatBean = new WechatInfoBean();
    wechatBean.setTitle(title);
    wechatBean.setContent(contentTxt);//純文本內(nèi)容
    wechatBean.setSourceCode(contentSnap);//快照
    wechatBean.setLikeCount(likeNum);
    wechatBean.setViewCount(readNum);
    wechatBean.setAbstractText(digest);//摘要
    wechatBean.setUrl(contentUrl);
    wechatBean.setPublishTime(pubTime);
    wechatBean.setSiteName(wxname);//站點(diǎn)名稱(chēng) 公眾號(hào)名稱(chēng)
    wechatBean.setAuthor(author);
    wechatBean.setMediaType("微信公眾號(hào)");//來(lái)源媒體類(lèi)型
    
    WechatStorage.saveWechatInfo(wechatBean);
    
    //標(biāo)示文章已經(jīng)被爬取
    post.setIsSpider(1);
    postMapper.updateByPrimaryKey(post);
    
  }  
  
  public static void startSpider(List<Post> inposts,PostMapper myPostMapper,String... urls){
    
    long startTime, endTime;
    startTime = System.currentTimeMillis();
    postMapper = myPostMapper;
    posts = inposts;
    
    HttpClientDownloader httpClientDownloader = new HttpClientDownloader();    
    SpiderModel spiderModel = new SpiderModel();
    Spider mySpider = Spider.create(spiderModel).addUrl(urls);
    mySpider.setDownloader(httpClientDownloader);
    try {
      SpiderMonitor.instance().register(mySpider);
      mySpider.thread(1).run();
    } catch (JMException e) {
      e.printStackTrace();
    }
    
    endTime = System.currentTimeMillis();
    System.out.println("爬取時(shí)間" + ((endTime - startTime) / 1000) + "秒--");
    
  }
  
}

其它的一些無(wú)關(guān)邏輯的存儲(chǔ)數(shù)據(jù)代碼就不貼了，這里我把代理服務(wù)器抓取到的數(shù)據(jù)存在了mysql，把自己的爬蟲(chóng)程序爬到的數(shù)據(jù)存儲(chǔ)在了mongodb。

下面是自己爬取到的公眾號(hào)號(hào)的信息：