詳解Java豆瓣電影爬蟲(chóng)——小爬蟲(chóng)成長(zhǎng)記（附源碼）

更新時(shí)間：2016年12月12日 09:51:10 作者：JackieZheng

這篇文章主要介紹了詳解Java豆瓣電影爬蟲(chóng)——小爬蟲(chóng)成長(zhǎng)記（附源碼），具有一定的參考價(jià)值，有需要的可以了解一下。

以前也用過(guò)爬蟲(chóng)，比如使用nutch爬取指定種子，基于爬到的數(shù)據(jù)做搜索，還大致看過(guò)一些源碼。當(dāng)然，nutch對(duì)于爬蟲(chóng)考慮的是十分全面和細(xì)致的。每當(dāng)看到屏幕上唰唰過(guò)去的爬取到的網(wǎng)頁(yè)信息以及處理信息的時(shí)候，總感覺(jué)這很黑科技。正好這次借助梳理Spring MVC的機(jī)會(huì)，想自己弄個(gè)小爬蟲(chóng)，簡(jiǎn)單沒(méi)關(guān)系，有些小bug也無(wú)所謂，我需要的只是一個(gè)能針對(duì)某個(gè)種子網(wǎng)站能爬取我想要的信息就可以了。有Exception就去解決，可能是一些API使用不當(dāng)，也可能是遇到了http請(qǐng)求狀態(tài)異常，又或是數(shù)據(jù)庫(kù)讀寫(xiě)有問(wèn)題，就是在這個(gè)報(bào)exception和解決exception的過(guò)程中，JewelCrawler（兒子的小名）已經(jīng)可以能夠獨(dú)立的爬取數(shù)據(jù)，并且還有一項(xiàng)基于Word2Vec算法做個(gè)情感分析的小技能。

后面可能還會(huì)有未知的Exception等著解決，也有一些性能需要優(yōu)化，比如和數(shù)據(jù)庫(kù)的交互，數(shù)據(jù)的讀寫(xiě)等等。但是目測(cè)年內(nèi)沒(méi)有太多精力放這上面了，所以今天做一個(gè)簡(jiǎn)單的總結(jié)，而且前兩篇主要側(cè)重的是功能和結(jié)果，這篇來(lái)說(shuō)說(shuō)JewelCrawler是如何誕生的，并將代碼放到Github上（源碼地址在文章最后），有興趣的可以關(guān)注下（僅供交流學(xué)習(xí)，請(qǐng)勿他用，考慮下douban君。多一點(diǎn)真誠(chéng)，少一點(diǎn)傷害）

環(huán)境介紹

開(kāi)發(fā)工具：Intellij idea 14

數(shù)據(jù)庫(kù): Mysql 5.5 + 數(shù)據(jù)庫(kù)管理工具Navicat（可用來(lái)連接查詢數(shù)據(jù)庫(kù)）

語(yǔ)言：Java

Jar包管理：Maven

版本管理：Git

目錄結(jié)構(gòu)

其中

　　com.ansj.vec是Word2Vec算法的Java版本實(shí)現(xiàn)

　　com.jackie.crawler.doubanmovie是爬蟲(chóng)實(shí)現(xiàn)模塊，其中又包括

有些包是空的，因?yàn)檫@些模塊還沒(méi)有用上，其中

　　　　constants包是存放常量類(lèi)
　　　　crawl包存放爬蟲(chóng)入口程序
　　　　entity包映射數(shù)據(jù)庫(kù)表的實(shí)體類(lèi)
　　　　test包存放測(cè)試類(lèi)
　　　　utils包存放工具類(lèi)

resource模塊存放的是配置文件和資源文件，比如

　　　　beans.xml：Spring上下文的配置文件
　　　　seed.properties：種子文件
　　　　stopwords.dic：停用詞庫(kù)
　　　　comment12031715.txt：爬取的短評(píng)數(shù)據(jù)
　　　　tokenizerResult.txt：使用IKAnalyzer分詞后的結(jié)果文件
　　　　vector.mod：基于Word2Vec算法訓(xùn)練的模型數(shù)據(jù)

test模塊是測(cè)試模塊，用于編寫(xiě)UT.

數(shù)據(jù)庫(kù)配置

1. 添加依賴(lài)的包

JewelCrawler使用的maven管理，所以只需要在pom.xml中添加相應(yīng)的依賴(lài)就可以了

<dependency>

  <groupId>org.springframework</groupId>

  <artifactId>spring-jdbc</artifactId>

  <version>4.1.1.RELEASE</version>

</dependency>

<dependency>

  <groupId>commons-pool</groupId>

  <artifactId>commons-pool</artifactId>

  <version>1.6</version>

</dependency>

<dependency>

  <groupId>commons-dbcp</groupId>

  <artifactId>commons-dbcp</artifactId>

  <version>1.4</version>

</dependency>

<dependency>

  <groupId>mysql</groupId>

  <artifactId>mysql-connector-java</artifactId>

  <version>5.1.38</version>

</dependency>

<dependency>

  <groupId>mysql</groupId>

  <artifactId>mysql-connector-java</artifactId>

  <version>5.1.38</version>

</dependency>

2. 聲明數(shù)據(jù)源bean

我們需要在beans.xml中聲明數(shù)據(jù)源的bean

 <context:property-placeholder location="classpath*:*.properties"/>

<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">

  <property name="driverClassName" value="${jdbc.driver}"/>

  <property name="url" value="${jdbc.url}"/>

  <property name="username" value="${jdbc.username}"/>

  <property name="password" value="${jdbc.password}"/>

</bean>

注意: 這里是綁定了外部配置文件jdbc.properties，具體數(shù)據(jù)源的參數(shù)從該文件讀取。

如果遇到問(wèn)題“SQL [insert into user(id) values(?)]; Field 'name' doesn't have a default value;”解決方法是設(shè)置表的相應(yīng)字段為自增長(zhǎng)字段。

解析頁(yè)面遇到的問(wèn)題

對(duì)于爬到的網(wǎng)頁(yè)數(shù)據(jù)需要解析dom結(jié)構(gòu)，拿到自己想要的數(shù)據(jù)，期間遇到如下錯(cuò)誤

org.htmlparser.Node不識(shí)別

解決方法：添加jar包依賴(lài)

<dependency>

  <groupId>org.htmlparser</groupId>

  <artifactId>htmlparser</artifactId>

  <version>1.6</version>

</dependency>

org.apache.http.HttpEntity不識(shí)別

解決方法：添加jar包依賴(lài)

<dependency>

  <groupId>org.apache.httpcomponents</groupId>

  <artifactId>httpclient</artifactId>

  <version>4.5.2</version>

</dependency>

當(dāng)然這是期間遇到的問(wèn)題，最后用的是Jsoup做的頁(yè)面解析。

maven倉(cāng)庫(kù)下載速度慢

之前使用的是默認(rèn)的maven中央倉(cāng)庫(kù)，下載jar包的速度很慢，不知道是我的網(wǎng)絡(luò)問(wèn)題還是其他原因，后來(lái)在網(wǎng)上找到了阿里云的maven倉(cāng)庫(kù)，更新后，相比之前簡(jiǎn)直是秒下，吐血推薦。

<mirrors>

  <mirror>

   <id>alimaven</id>

   <name>aliyun maven</name>

   <url>http://maven.aliyun.com/nexus/content/groups/public/</url>

   <mirrorOf>central</mirrorOf>    

  </mirror>

</mirrors>

找到maven的settings.xml文件，添加這個(gè)鏡像即可。

讀取resource模塊下文件的一種方法

比如讀取seed.properties文件

@Test

  public void testFile(){

    File seedFile = new File(this.getClass().getResource("/seed.properties").getPath());

    System.out.print("===========" + seedFile.length() + "===========" );

  }

有關(guān)正則表達(dá)式

使用regrex正則表達(dá)式的時(shí)候，如果匹配上了定義的Pattern，則需要先調(diào)用matcher的find方法然后才能使用group方法找到子串。直接調(diào)用group方法是沒(méi)有辦法找到你想要的結(jié)果的。

　　我看了下上面Matcher類(lèi)的源碼

package java.util.regex;

import java.util.Objects;

public final class Matcher implements MatchResult {

  /**

   * The Pattern object that created this Matcher.

   */

  Pattern parentPattern;

 

  /**

   * The storage used by groups. They may contain invalid values if

   * a group was skipped during the matching.

   */

  int[] groups;

 

  /**

   * The range within the sequence that is to be matched. Anchors

   * will match at these "hard" boundaries. Changing the region

   * changes these values.

   */

  int from, to;

 

  /**

   * Lookbehind uses this value to ensure that the subexpression

   * match ends at the point where the lookbehind was encountered.

   */

  int lookbehindTo;

 

  /**

   * The original string being matched.

   */

  CharSequence text;

 

  /**

   * Matcher state used by the last node. NOANCHOR is used when a

   * match does not have to consume all of the input. ENDANCHOR is

   * the mode used for matching all the input.

   */

  static final int ENDANCHOR = 1;

  static final int NOANCHOR = 0;

  int acceptMode = NOANCHOR;

 

  /**

   * The range of string that last matched the pattern. If the last

   * match failed then first is -1; last initially holds 0 then it

   * holds the index of the end of the last match (which is where the

   * next search starts).

   */

  int first = -1, last = 0;

 

  /**

   * The end index of what matched in the last match operation.

   */

  int oldLast = -1;

 

  /**

   * The index of the last position appended in a substitution.

   */

  int lastAppendPosition = 0;

 

  /**

   * Storage used by nodes to tell what repetition they are on in

   * a pattern, and where groups begin. The nodes themselves are stateless,

   * so they rely on this field to hold state during a match.

   */

  int[] locals;

 

  /**

   * Boolean indicating whether or not more input could change

   * the results of the last match.

   *

   * If hitEnd is true, and a match was found, then more input

   * might cause a different match to be found.

   * If hitEnd is true and a match was not found, then more

   * input could cause a match to be found.

   * If hitEnd is false and a match was found, then more input

   * will not change the match.

   * If hitEnd is false and a match was not found, then more

   * input will not cause a match to be found.

   */

  boolean hitEnd;

 

  /**

   * Boolean indicating whether or not more input could change

   * a positive match into a negative one.

   *

   * If requireEnd is true, and a match was found, then more

   * input could cause the match to be lost.

   * If requireEnd is false and a match was found, then more

   * input might change the match but the match won't be lost.

   * If a match was not found, then requireEnd has no meaning.

   */

  boolean requireEnd;

 

  /**

   * If transparentBounds is true then the boundaries of this

   * matcher's region are transparent to lookahead, lookbehind,

   * and boundary matching constructs that try to see beyond them.

   */

  boolean transparentBounds = false;

 

  /**

   * If anchoringBounds is true then the boundaries of this

   * matcher's region match anchors such as ^ and $.

   */

  boolean anchoringBounds = true;

 

  /**

   * No default constructor.

   */

  Matcher() {

  }

 

/**

 * All matchers have the state used by Pattern during a match.

 */

Matcher(Pattern parent, CharSequence text) {

  this.parentPattern = parent;

  this.text = text;

 

  // Allocate state storage

  int parentGroupCount = Math.max(parent.capturingGroupCount, 10);

  groups = new int[parentGroupCount * 2];

  locals = new int[parent.localCount];

 

  // Put fields into initial states

  reset();

}

....

/**

 * Returns the input subsequence matched by the previous match.

 *

 * <p> For a matcher <i>m</i> with input sequence <i>s</i>,

 * the expressions <i>m.</i><tt>group()</tt> and

 * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt>

 * are equivalent. </p>

 *

 * <p> Note that some patterns, for example <tt>a*</tt>, match the empty

 * string. This method will return the empty string when the pattern

 * successfully matches the empty string in the input. </p>

 *

 * @return The (possibly empty) subsequence matched by the previous match,

 *     in string form

 *

 * @throws IllegalStateException

 *     If no match has yet been attempted,

 *     or if the previous match operation failed

 */

public String group() {

  return group(0);

}

/**

 * Returns the input subsequence captured by the given group during the

 * previous match operation.

 *

 * <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index

 * <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and

 * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt>

 * are equivalent. </p>

 *

 * <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left

 * to right, starting at one. Group zero denotes the entire pattern, so

 * the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>.

 * </p>

 *

 * <p> If the match was successful but the group specified failed to match

 * any part of the input sequence, then <tt>null</tt> is returned. Note

 * that some groups, for example <tt>(a*)</tt>, match the empty string.

 * This method will return the empty string when such a group successfully

 * matches the empty string in the input. </p>

 *

 * @param group

 *     The index of a capturing group in this matcher's pattern

 *

 * @return The (possibly empty) subsequence captured by the group

 *     during the previous match, or <tt>null</tt> if the group

 *     failed to match part of the input

 *

 * @throws IllegalStateException

 *     If no match has yet been attempted,

 *     or if the previous match operation failed

 *

 * @throws IndexOutOfBoundsException

 *     If there is no capturing group in the pattern

 *     with the given index

 */

public String group(int group) {

  if (first < 0)

    throw new IllegalStateException("No match found");

  if (group < 0 || group > groupCount())

    throw new IndexOutOfBoundsException("No group " + group);

  if ((groups[group*2] == -1) || (groups[group*2+1] == -1))

    return null;

  return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();

}

/**

 * Attempts to find the next subsequence of the input sequence that matches

 * the pattern.

 *

 * <p> This method starts at the beginning of this matcher's region, or, if

 * a previous invocation of the method was successful and the matcher has

 * not since been reset, at the first character not matched by the previous

 * match.

 *

 * <p> If the match succeeds then more information can be obtained via the

 * <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p>

 *

 * @return <tt>true</tt> if, and only if, a subsequence of the input

 *     sequence matches this matcher's pattern

 */

public boolean find() {

  int nextSearchIndex = last;

  if (nextSearchIndex == first)

    nextSearchIndex++;

 

  // If next search starts before region, start it at region

  if (nextSearchIndex < from)

    nextSearchIndex = from;

 

  // If next search starts beyond region then it fails

  if (nextSearchIndex > to) {

    for (int i = 0; i < groups.length; i++)

      groups[i] = -1;

    return false;

  }

  return search(nextSearchIndex);

}

 

/**

 * Initiates a search to find a Pattern within the given bounds.

 * The groups are filled with default values and the match of the root

 * of the state machine is called. The state machine will hold the state

 * of the match as it proceeds in this matcher.

 *

 * Matcher.from is not set here, because it is the "hard" boundary

 * of the start of the search which anchors will set to. The from param

 * is the "soft" boundary of the start of the search, meaning that the

 * regex tries to match at that index but ^ won't match there. Subsequent

 * calls to the search methods start at a new "soft" boundary which is

 * the end of the previous match.

 */

boolean search(int from) {

  this.hitEnd = false;

  this.requireEnd = false;

  from    = from < 0 ? 0 : from;

  this.first = from;

  this.oldLast = oldLast < 0 ? from : oldLast;

  for (int i = 0; i < groups.length; i++)

    groups[i] = -1;

  acceptMode = NOANCHOR;

  boolean result = parentPattern.root.match(this, from, text);

  if (!result)

    this.first = -1;

  this.oldLast = this.last;

  return result;

}

...

}

原因是這樣的：這里如果不先調(diào)用find方法，直接調(diào)用group，可以發(fā)現(xiàn)group方法調(diào)用group(int group)，該方法的方法體中有if first<0,顯然這里這個(gè)條件是成立的，因?yàn)閒irst的初始值就是-1，所以這里會(huì)拋異常。但是如果調(diào)用find方法，可以發(fā)現(xiàn)，最終會(huì)調(diào)用search(nextSearchIndex)，注意這里的nextSearchIndex已被last賦值，而last的值為0，再跳轉(zhuǎn)到search方法中

boolean search(int from) {

  this.hitEnd = false;

  this.requireEnd = false;

  from    = from < 0 ? 0 : from;

  this.first = from;

  this.oldLast = oldLast < 0 ? from : oldLast;

  for (int i = 0; i < groups.length; i++)

    groups[i] = -1;

  acceptMode = NOANCHOR;

  boolean result = parentPattern.root.match(this, from, text);

  if (!result)

    this.first = -1;

  this.oldLast = this.last;

  return result;

}

這個(gè)nextSearchIndex傳給了from，而from在方法體中被賦值給了first，所以，調(diào)用了find方法之后，這個(gè)的first就不在是-1，也就不是拋異常了。

源碼已經(jīng)上傳至百度網(wǎng)盤(pán)：http://pan.baidu.com/s/1dFwtvNz

以上說(shuō)的問(wèn)題比較碎，都是在遇到問(wèn)題和解決問(wèn)題的時(shí)候的一些總結(jié)。在具體操作的時(shí)候還會(huì)遇到其他問(wèn)題，有問(wèn)題或者建議的話歡迎提出來(lái)^^。

最后放幾張截止目前爬取的數(shù)據(jù)

Record表