詳解Java豆瓣電影爬蟲——小爬蟲成長記(附源碼)
以前也用過爬蟲,比如使用nutch爬取指定種子,基于爬到的數(shù)據(jù)做搜索,還大致看過一些源碼。當(dāng)然,nutch對(duì)于爬蟲考慮的是十分全面和細(xì)致的。每當(dāng)看到屏幕上唰唰過去的爬取到的網(wǎng)頁信息以及處理信息的時(shí)候,總感覺這很黑科技。正好這次借助梳理Spring MVC的機(jī)會(huì),想自己弄個(gè)小爬蟲,簡單沒關(guān)系,有些小bug也無所謂,我需要的只是一個(gè)能針對(duì)某個(gè)種子網(wǎng)站能爬取我想要的信息就可以了。有Exception就去解決,可能是一些API使用不當(dāng),也可能是遇到了http請求狀態(tài)異常,又或是數(shù)據(jù)庫讀寫有問題,就是在這個(gè)報(bào)exception和解決exception的過程中,JewelCrawler(兒子的小名)已經(jīng)可以能夠獨(dú)立的爬取數(shù)據(jù),并且還有一項(xiàng)基于Word2Vec算法做個(gè)情感分析的小技能。
后面可能還會(huì)有未知的Exception等著解決,也有一些性能需要優(yōu)化,比如和數(shù)據(jù)庫的交互,數(shù)據(jù)的讀寫等等。但是目測年內(nèi)沒有太多精力放這上面了,所以今天做一個(gè)簡單的總結(jié),而且前兩篇主要側(cè)重的是功能和結(jié)果,這篇來說說JewelCrawler是如何誕生的,并將代碼放到Github上(源碼地址在文章最后),有興趣的可以關(guān)注下(僅供交流學(xué)習(xí),請勿他用,考慮下douban君。多一點(diǎn)真誠,少一點(diǎn)傷害)
環(huán)境介紹
開發(fā)工具:Intellij idea 14
數(shù)據(jù)庫: Mysql 5.5 + 數(shù)據(jù)庫管理工具Navicat(可用來連接查詢數(shù)據(jù)庫)

語言:Java
Jar包管理:Maven
版本管理:Git
目錄結(jié)構(gòu)

其中
com.ansj.vec是Word2Vec算法的Java版本實(shí)現(xiàn)
com.jackie.crawler.doubanmovie是爬蟲實(shí)現(xiàn)模塊,其中又包括

有些包是空的,因?yàn)檫@些模塊還沒有用上,其中
- constants包是存放常量類
- crawl包存放爬蟲入口程序
- entity包映射數(shù)據(jù)庫表的實(shí)體類
- test包存放測試類
- utils包存放工具類
resource模塊存放的是配置文件和資源文件,比如
- beans.xml:Spring上下文的配置文件
- seed.properties:種子文件
- stopwords.dic:停用詞庫
- comment12031715.txt:爬取的短評(píng)數(shù)據(jù)
- tokenizerResult.txt:使用IKAnalyzer分詞后的結(jié)果文件
- vector.mod:基于Word2Vec算法訓(xùn)練的模型數(shù)據(jù)
test模塊是測試模塊,用于編寫UT.
數(shù)據(jù)庫配置
1. 添加依賴的包
JewelCrawler使用的maven管理,所以只需要在pom.xml中添加相應(yīng)的依賴就可以了
<dependency> <groupId>org.springframework</groupId> <artifactId>spring-jdbc</artifactId> <version>4.1.1.RELEASE</version> </dependency> <dependency> <groupId>commons-pool</groupId> <artifactId>commons-pool</artifactId> <version>1.6</version> </dependency> <dependency> <groupId>commons-dbcp</groupId> <artifactId>commons-dbcp</artifactId> <version>1.4</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency>
2. 聲明數(shù)據(jù)源bean
我們需要在beans.xml中聲明數(shù)據(jù)源的bean
<context:property-placeholder location="classpath*:*.properties"/>
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">
<property name="driverClassName" value="${jdbc.driver}"/>
<property name="url" value="${jdbc.url}"/>
<property name="username" value="${jdbc.username}"/>
<property name="password" value="${jdbc.password}"/>
</bean>
注意: 這里是綁定了外部配置文件jdbc.properties,具體數(shù)據(jù)源的參數(shù)從該文件讀取。
如果遇到問題“SQL [insert into user(id) values(?)]; Field 'name' doesn't have a default value;”解決方法是設(shè)置表的相應(yīng)字段為自增長字段。
解析頁面遇到的問題
對(duì)于爬到的網(wǎng)頁數(shù)據(jù)需要解析dom結(jié)構(gòu),拿到自己想要的數(shù)據(jù),期間遇到如下錯(cuò)誤
org.htmlparser.Node不識(shí)別
解決方法:添加jar包依賴
<dependency> <groupId>org.htmlparser</groupId> <artifactId>htmlparser</artifactId> <version>1.6</version> </dependency>
org.apache.http.HttpEntity不識(shí)別
解決方法:添加jar包依賴
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency>
當(dāng)然這是期間遇到的問題,最后用的是Jsoup做的頁面解析。
maven倉庫下載速度慢
之前使用的是默認(rèn)的maven中央倉庫,下載jar包的速度很慢,不知道是我的網(wǎng)絡(luò)問題還是其他原因,后來在網(wǎng)上找到了阿里云的maven倉庫,更新后,相比之前簡直是秒下,吐血推薦。
<mirrors> <mirror> <id>alimaven</id> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <mirrorOf>central</mirrorOf> </mirror> </mirrors>
找到maven的settings.xml文件,添加這個(gè)鏡像即可。
讀取resource模塊下文件的一種方法
比如讀取seed.properties文件
@Test
public void testFile(){
File seedFile = new File(this.getClass().getResource("/seed.properties").getPath());
System.out.print("===========" + seedFile.length() + "===========" );
}
有關(guān)正則表達(dá)式
使用regrex正則表達(dá)式的時(shí)候,如果匹配上了定義的Pattern,則需要先調(diào)用matcher的find方法然后才能使用group方法找到子串。直接調(diào)用group方法是沒有辦法找到你想要的結(jié)果的。
我看了下上面Matcher類的源碼
package java.util.regex;
import java.util.Objects;
public final class Matcher implements MatchResult {
/**
* The Pattern object that created this Matcher.
*/
Pattern parentPattern;
/**
* The storage used by groups. They may contain invalid values if
* a group was skipped during the matching.
*/
int[] groups;
/**
* The range within the sequence that is to be matched. Anchors
* will match at these "hard" boundaries. Changing the region
* changes these values.
*/
int from, to;
/**
* Lookbehind uses this value to ensure that the subexpression
* match ends at the point where the lookbehind was encountered.
*/
int lookbehindTo;
/**
* The original string being matched.
*/
CharSequence text;
/**
* Matcher state used by the last node. NOANCHOR is used when a
* match does not have to consume all of the input. ENDANCHOR is
* the mode used for matching all the input.
*/
static final int ENDANCHOR = 1;
static final int NOANCHOR = 0;
int acceptMode = NOANCHOR;
/**
* The range of string that last matched the pattern. If the last
* match failed then first is -1; last initially holds 0 then it
* holds the index of the end of the last match (which is where the
* next search starts).
*/
int first = -1, last = 0;
/**
* The end index of what matched in the last match operation.
*/
int oldLast = -1;
/**
* The index of the last position appended in a substitution.
*/
int lastAppendPosition = 0;
/**
* Storage used by nodes to tell what repetition they are on in
* a pattern, and where groups begin. The nodes themselves are stateless,
* so they rely on this field to hold state during a match.
*/
int[] locals;
/**
* Boolean indicating whether or not more input could change
* the results of the last match.
*
* If hitEnd is true, and a match was found, then more input
* might cause a different match to be found.
* If hitEnd is true and a match was not found, then more
* input could cause a match to be found.
* If hitEnd is false and a match was found, then more input
* will not change the match.
* If hitEnd is false and a match was not found, then more
* input will not cause a match to be found.
*/
boolean hitEnd;
/**
* Boolean indicating whether or not more input could change
* a positive match into a negative one.
*
* If requireEnd is true, and a match was found, then more
* input could cause the match to be lost.
* If requireEnd is false and a match was found, then more
* input might change the match but the match won't be lost.
* If a match was not found, then requireEnd has no meaning.
*/
boolean requireEnd;
/**
* If transparentBounds is true then the boundaries of this
* matcher's region are transparent to lookahead, lookbehind,
* and boundary matching constructs that try to see beyond them.
*/
boolean transparentBounds = false;
/**
* If anchoringBounds is true then the boundaries of this
* matcher's region match anchors such as ^ and $.
*/
boolean anchoringBounds = true;
/**
* No default constructor.
*/
Matcher() {
}
/**
* All matchers have the state used by Pattern during a match.
*/
Matcher(Pattern parent, CharSequence text) {
this.parentPattern = parent;
this.text = text;
// Allocate state storage
int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
groups = new int[parentGroupCount * 2];
locals = new int[parent.localCount];
// Put fields into initial states
reset();
}
....
/**
* Returns the input subsequence matched by the previous match.
*
* <p> For a matcher <i>m</i> with input sequence <i>s</i>,
* the expressions <i>m.</i><tt>group()</tt> and
* <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt>
* are equivalent. </p>
*
* <p> Note that some patterns, for example <tt>a*</tt>, match the empty
* string. This method will return the empty string when the pattern
* successfully matches the empty string in the input. </p>
*
* @return The (possibly empty) subsequence matched by the previous match,
* in string form
*
* @throws IllegalStateException
* If no match has yet been attempted,
* or if the previous match operation failed
*/
public String group() {
return group(0);
}
/**
* Returns the input subsequence captured by the given group during the
* previous match operation.
*
* <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index
* <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and
* <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt>
* are equivalent. </p>
*
* <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left
* to right, starting at one. Group zero denotes the entire pattern, so
* the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>.
* </p>
*
* <p> If the match was successful but the group specified failed to match
* any part of the input sequence, then <tt>null</tt> is returned. Note
* that some groups, for example <tt>(a*)</tt>, match the empty string.
* This method will return the empty string when such a group successfully
* matches the empty string in the input. </p>
*
* @param group
* The index of a capturing group in this matcher's pattern
*
* @return The (possibly empty) subsequence captured by the group
* during the previous match, or <tt>null</tt> if the group
* failed to match part of the input
*
* @throws IllegalStateException
* If no match has yet been attempted,
* or if the previous match operation failed
*
* @throws IndexOutOfBoundsException
* If there is no capturing group in the pattern
* with the given index
*/
public String group(int group) {
if (first < 0)
throw new IllegalStateException("No match found");
if (group < 0 || group > groupCount())
throw new IndexOutOfBoundsException("No group " + group);
if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
return null;
return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
}
/**
* Attempts to find the next subsequence of the input sequence that matches
* the pattern.
*
* <p> This method starts at the beginning of this matcher's region, or, if
* a previous invocation of the method was successful and the matcher has
* not since been reset, at the first character not matched by the previous
* match.
*
* <p> If the match succeeds then more information can be obtained via the
* <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p>
*
* @return <tt>true</tt> if, and only if, a subsequence of the input
* sequence matches this matcher's pattern
*/
public boolean find() {
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++;
// If next search starts before region, start it at region
if (nextSearchIndex < from)
nextSearchIndex = from;
// If next search starts beyond region then it fails
if (nextSearchIndex > to) {
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
return false;
}
return search(nextSearchIndex);
}
/**
* Initiates a search to find a Pattern within the given bounds.
* The groups are filled with default values and the match of the root
* of the state machine is called. The state machine will hold the state
* of the match as it proceeds in this matcher.
*
* Matcher.from is not set here, because it is the "hard" boundary
* of the start of the search which anchors will set to. The from param
* is the "soft" boundary of the start of the search, meaning that the
* regex tries to match at that index but ^ won't match there. Subsequent
* calls to the search methods start at a new "soft" boundary which is
* the end of the previous match.
*/
boolean search(int from) {
this.hitEnd = false;
this.requireEnd = false;
from = from < 0 ? 0 : from;
this.first = from;
this.oldLast = oldLast < 0 ? from : oldLast;
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
acceptMode = NOANCHOR;
boolean result = parentPattern.root.match(this, from, text);
if (!result)
this.first = -1;
this.oldLast = this.last;
return result;
}
...
}
原因是這樣的:這里如果不先調(diào)用find方法,直接調(diào)用group,可以發(fā)現(xiàn)group方法調(diào)用group(int group),該方法的方法體中有if first<0,顯然這里這個(gè)條件是成立的,因?yàn)閒irst的初始值就是-1,所以這里會(huì)拋異常。但是如果調(diào)用find方法,可以發(fā)現(xiàn),最終會(huì)調(diào)用search(nextSearchIndex),注意這里的nextSearchIndex已被last賦值,而last的值為0,再跳轉(zhuǎn)到search方法中
boolean search(int from) {
this.hitEnd = false;
this.requireEnd = false;
from = from < 0 ? 0 : from;
this.first = from;
this.oldLast = oldLast < 0 ? from : oldLast;
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
acceptMode = NOANCHOR;
boolean result = parentPattern.root.match(this, from, text);
if (!result)
this.first = -1;
this.oldLast = this.last;
return result;
}
這個(gè)nextSearchIndex傳給了from,而from在方法體中被賦值給了first,所以,調(diào)用了find方法之后,這個(gè)的first就不在是-1,也就不是拋異常了。
源碼已經(jīng)上傳至百度網(wǎng)盤:http://pan.baidu.com/s/1dFwtvNz
以上說的問題比較碎,都是在遇到問題和解決問題的時(shí)候的一些總結(jié)。在具體操作的時(shí)候還會(huì)遇到其他問題,有問題或者建議的話歡迎提出來^^。
最后放幾張截止目前爬取的數(shù)據(jù)
Record表

其中存儲(chǔ)的是79032條,爬取過的網(wǎng)頁有48471條
movie表

目前爬取了2964部影視作品
comments表

爬取了29711條記錄
以上就是本文的全部內(nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
相關(guān)文章
分享Java程序員應(yīng)該知道的10個(gè)調(diào)試技巧
在本文中,作者將使用大家常用的的開發(fā)工具Eclipse來調(diào)試Java應(yīng)用程序。但這里介紹的調(diào)試方法基本都是通用的,也適用于NetBeans IDE,我們會(huì)把重點(diǎn)放在運(yùn)行時(shí)上面2012-09-09
Java函數(shù)式編程(十二):監(jiān)控文件修改
這篇文章主要介紹了Java函數(shù)式編程(十二):監(jiān)控文件修改,本文是系列文章的第12篇,其它文章請參閱本文底部的相關(guān)文章,需要的朋友可以參考下2014-09-09
關(guān)于Spring多數(shù)據(jù)源TransactionManager沖突的解決方案
這篇文章主要介紹了關(guān)于Spring多數(shù)據(jù)源TransactionManager沖突的解決方案,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-07-07

