springBoot+webMagic實(shí)現(xiàn)網(wǎng)站爬蟲的實(shí)例代碼
前端時(shí)間公司項(xiàng)目需要抓取各類數(shù)據(jù),py玩的不6,只好研究Java爬蟲方案,做一個(gè)總結(jié)。
開發(fā)環(huán)境:
springBoot 2.2.6、jdk1.8。
1、導(dǎo)入依賴
<!--WebMagic核心包--> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <!--這里可以去掉WebMagic自帶的日志(因?yàn)榇蛴〉暮芏?。。。。?-> <!-- <exclusions>--> <!-- <exclusion>--> <!-- <groupId>org.slf4j</groupId>--> <!-- <artifactId>slf4j-log4j12</artifactId>--> <!-- </exclusion>--> <!-- </exclusions>--> </dependency> <!--WebMagic擴(kuò)展--> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <!--WebMagic對(duì)布隆過(guò)濾器的支持--> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>16.0</version> </dependency>
話不多說(shuō),直接上代碼。
基礎(chǔ)案例
下面代碼說(shuō)明以一個(gè)類似列表的頁(yè)面為例
package com.crawler.project.proTask;
import com.alibaba.fastjson.JSONObject;
import org.springframework.scheduling.annotation.Scheduled;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
import us.codecraft.webmagic.scheduler.QueueScheduler;
import us.codecraft.webmagic.selector.Selectable;
import java.util.List;
public class TaskProcessor implements PageProcessor {
/*
* 此方法為爬蟲業(yè)務(wù)實(shí)現(xiàn)
* */
@Override
public void process(Page page) {
//1、爬蟲任務(wù)獲取到一個(gè)page 解析page上的列表
List<Selectable> list = page.getHtml().css("css selector").nodes();
if (list.size() > 0){//說(shuō)明為列表頁(yè)面、需要解析列表中每個(gè)元素的鏈接,存入待獲取page隊(duì)列中
for (Selectable selectable : list) {
//遍歷集合,將每個(gè)元素鏈接存入待獲取page隊(duì)列中
page.addTargetRequest(selectable.links().toString());
}
//同時(shí)將下一頁(yè)的url存入隊(duì)列中
page.addTargetRequest("下一頁(yè)的url");
}else {
//此時(shí)為列表中單個(gè)元素對(duì)應(yīng)的詳情頁(yè)
//在自定義方法中處理詳細(xì)頁(yè),獲取需要的數(shù)據(jù)進(jìn)行處理。
handle(page);
}
}
private void handle(Page page) {
//例如 處理后的數(shù)據(jù)為一個(gè)JSONObject對(duì)象
JSONObject tmp = new JSONObject();
//將這個(gè)tmp交由自定義的TaskPipline類處理,若未自定義Pipline并設(shè)置到Spider參數(shù)中,框架會(huì)默認(rèn)將tmp打印到控制臺(tái)。
page.putField("obj",tmp);
}
/*
* 此方法為配置爬蟲過(guò)程的一些參數(shù)
* */
private Site site = Site.me()
.setCharset("UTF-8")
.setTimeOut(60 * 1000)
.setRetrySleepTime(60 * 1000)
.setCycleRetryTimes(5);
@Override
public Site getSite() {
return site;
}
/*
設(shè)置定時(shí)任務(wù),執(zhí)行爬蟲任務(wù)
* */
@Scheduled(initialDelay = 1 * 1000,fixedDelay = 2 * 1000)
public void process(){
System.out.println("開始執(zhí)行爬蟲抓取任務(wù)");
Spider.create(new TaskProcessor())//注意這里的類名要和當(dāng)前類名對(duì)應(yīng)
.addUrl("起始頁(yè)url")
.addPipeline(new TaskPipeline()) //此處課自定義 數(shù)據(jù)處理類 (在handle()方法中有);
.setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(100000)))
.thread(3)//此處設(shè)置線程數(shù)量(不宜過(guò)多,最好和列表頁(yè)中列表元素?cái)?shù)量一致)
.run();
}
}
package com.crawler.project.proTask;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
public class TaskPipeline implements Pipeline {
@Override
public void process(ResultItems resultItems, Task task) {
if (resultItems.getAll() .size() > 0){
Object obj = resultItems.getAll().get("obj");
JSONObject jsonObject = JSON.parseObject(obj.toString());
//獲取到JSONObject對(duì)象下面可進(jìn)行自定義的業(yè)務(wù)處理。
}
}
}
特殊情況一
需根據(jù)鏈接下載圖片或文件
eg:在上面說(shuō)到的詳情頁(yè)中含有iframe。
1、首先獲取iframe的src
//獲得iframe的src (這里要注意獲得的src是絕對(duì)路徑還是相對(duì)路徑,相對(duì)路徑需要拼接主站點(diǎn)url)
String src = html.css("css selector", "src").toString();
//采用jsoup解析
Document document = Jsoup.parse(new URL(src),1000);
//獲得需要的元素
Element ele = document.select("css selector").last();
//獲取需要下載的文件的鏈接
String downUrl = ele.attr("href");
//根據(jù)鏈接下載文件 返回一個(gè)文件的名稱
String fileName = downloadFile(downUrl);
//通過(guò)url下載文件
public String downloadFile(String fileUrl) throws FileNotFoundException{
try{
URL httpUrl = new URL(fileUrl);
String fileName = UUID.randomUUID().toString() + ".mp3";
File file = new File(this.STATIC_FILEPATH + fileName);
System.out.println("============保存文件方法被調(diào)用===============");
FileUtils.copyURLToFile(httpUrl,file);
return fileName;
}catch (Exception e){
e.printStackTrace();
return null;
}
}
特殊情況二
有些https站點(diǎn) 無(wú)法直接使用WebMagic默認(rèn)的下載器下載,此時(shí)我們可以根據(jù)站點(diǎn)ssl類型修改下載器。
在項(xiàng)目中創(chuàng)建一個(gè)包用于存放自定義(修改)的下載器類
(?。?!摘自webMagic框架中HttpClientDownloader,基于此類修改!?。。?/p>
/*
此方法中需要傳入一個(gè)自定義的生成器(HttpClientGenerator)
*/
package com.crawler.project.spider_download;
import org.apache.commons.io.IOUtils;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.AbstractDownloader;
import us.codecraft.webmagic.downloader.HttpClientRequestContext;
import us.codecraft.webmagic.downloader.HttpUriRequestConverter;
import us.codecraft.webmagic.proxy.Proxy;
import us.codecraft.webmagic.proxy.ProxyProvider;
import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.CharsetUtils;
import us.codecraft.webmagic.utils.HttpClientUtils;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map;
/**
* The http downloader based on HttpClient.
*
* @author code4crafter@gmail.com <br>
* @since 0.1.0
*/
public class HttpClientDownloader extends AbstractDownloader {
private Logger logger = LoggerFactory.getLogger(getClass());
private final Map<String, CloseableHttpClient> httpClients = new HashMap<String, CloseableHttpClient>();
//自定義的生成器(HttpClientGenerator)注意導(dǎo)入的應(yīng)為自定義的HttpClientGenerator類,而不是WebMagic依賴中的HttpClientGenerator類。
private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
private ProxyProvider proxyProvider;
private boolean responseHeader = true;
public void setHttpUriRequestConverter(HttpUriRequestConverter httpUriRequestConverter) {
this.httpUriRequestConverter = httpUriRequestConverter;
}
public void setProxyProvider(ProxyProvider proxyProvider) {
this.proxyProvider = proxyProvider;
}
private CloseableHttpClient getHttpClient(Site site) {
if (site == null) {
return httpClientGenerator.getClient(null);
}
String domain = site.getDomain();
CloseableHttpClient httpClient = httpClients.get(domain);
if (httpClient == null) {
synchronized (this) {
httpClient = httpClients.get(domain);
if (httpClient == null) {
httpClient = httpClientGenerator.getClient(site);
httpClients.put(domain, httpClient);
}
}
}
return httpClient;
}
@Override
public Page download(Request request, Task task) {
if (task == null || task.getSite() == null) {
throw new NullPointerException("task or site can not be null");
}
CloseableHttpResponse httpResponse = null;
CloseableHttpClient httpClient = getHttpClient(task.getSite());
Proxy proxy = proxyProvider != null ? proxyProvider.getProxy(task) : null;
HttpClientRequestContext requestContext = httpUriRequestConverter.convert(request, task.getSite(), proxy);
Page page = Page.fail();
try {
httpResponse = httpClient.execute(requestContext.getHttpUriRequest(), requestContext.getHttpClientContext());
page = handleResponse(request, request.getCharset() != null ? request.getCharset() : task.getSite().getCharset(), httpResponse, task);
onSuccess(request);
logger.info("downloading page success {}", request.getUrl());
return page;
} catch (IOException e) {
logger.warn("download page {} error", request.getUrl(), e);
onError(request);
return page;
} finally {
if (httpResponse != null) {
//ensure the connection is released back to pool
EntityUtils.consumeQuietly(httpResponse.getEntity());
}
if (proxyProvider != null && proxy != null) {
proxyProvider.returnProxy(proxy, page, task);
}
}
}
@Override
public void setThread(int thread) {
httpClientGenerator.setPoolSize(thread);
}
protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
byte[] bytes = IOUtils.toByteArray(httpResponse.getEntity().getContent());
String contentType = httpResponse.getEntity().getContentType() == null ? "" : httpResponse.getEntity().getContentType().getValue();
Page page = new Page();
page.setBytes(bytes);
if (!request.isBinaryContent()){
if (charset == null) {
charset = getHtmlCharset(contentType, bytes);
}
page.setCharset(charset);
page.setRawText(new String(bytes, charset));
}
page.setUrl(new PlainText(request.getUrl()));
page.setRequest(request);
page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
page.setDownloadSuccess(true);
if (responseHeader) {
page.setHeaders(HttpClientUtils.convertHeaders(httpResponse.getAllHeaders()));
}
return page;
}
private String getHtmlCharset(String contentType, byte[] contentBytes) throws IOException {
String charset = CharsetUtils.detectCharset(contentType, contentBytes);
if (charset == null) {
charset = Charset.defaultCharset().name();
logger.warn("Charset autodetect failed, use {} as charset. Please specify charset in Site.setCharset()", Charset.defaultCharset());
}
return charset;
}
}
然后在自定義的HttpClientGenerator類中修改有關(guān)ssl的參數(shù)
(?。。≌詗ebMagic框架中HttpClientGenerator,基于此類修改?。。。?/p>
/*
自定義的HttpClientGenerator生成器
*/
package com.sealion_crawler.project.spider_download;
import org.apache.http.HttpException;
import org.apache.http.HttpRequest;
import org.apache.http.HttpRequestInterceptor;
import org.apache.http.client.CookieStore;
import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.config.SocketConfig;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.DefaultHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.impl.client.*;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.apache.http.protocol.HttpContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.downloader.CustomRedirectStrategy;
import javax.net.ssl.SSLContext;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;
import java.io.IOException;
import java.security.KeyManagementException;
import java.security.NoSuchAlgorithmException;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.Map;
/**
* @author code4crafter@gmail.com <br>
* @since 0.4.0
*/
public class HttpClientGenerator {
private transient Logger logger = LoggerFactory.getLogger(getClass());
private PoolingHttpClientConnectionManager connectionManager;
public HttpClientGenerator() {
Registry<ConnectionSocketFactory> reg = RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", PlainConnectionSocketFactory.INSTANCE)
.register("https", buildSSLConnectionSocketFactory())
.build();
connectionManager = new PoolingHttpClientConnectionManager(reg);
connectionManager.setDefaultMaxPerRoute(100);
}
/*
此方法中設(shè)置ssl有關(guān)參數(shù)。
*/
private SSLConnectionSocketFactory buildSSLConnectionSocketFactory() {
try {
return new SSLConnectionSocketFactory(createIgnoreVerifySSL(), new String[]{"SSLv3", "TLSv1", "TLSv1.1", "TLSv1.2"},
null,
new DefaultHostnameVerifier()); // 優(yōu)先繞過(guò)安全證書
} catch (KeyManagementException e) {
logger.error("ssl connection fail", e);
} catch (NoSuchAlgorithmException e) {
logger.error("ssl connection fail", e);
}
return SSLConnectionSocketFactory.getSocketFactory();
}
private SSLContext createIgnoreVerifySSL() throws NoSuchAlgorithmException, KeyManagementException {
// 實(shí)現(xiàn)一個(gè)X509TrustManager接口,用于繞過(guò)驗(yàn)證,不用修改里面的方法
X509TrustManager trustManager = new X509TrustManager() {
@Override
public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException {
}
@Override
public void checkServerTrusted(X509Certificate[] chain, String authType) throws CertificateException {
}
@Override
public X509Certificate[] getAcceptedIssuers() {
return null;
}
};
/*
下面為當(dāng)前框架默認(rèn)參數(shù)
SSLContext sc = SSLContext.getInstance("SSLv3");
可修改為需要的ssl參數(shù)類型
*/
SSLContext sc = SSLContext.getInstance("TLS");
sc.init(null, new TrustManager[] { trustManager }, null);
return sc;
}
public HttpClientGenerator setPoolSize(int poolSize) {
connectionManager.setMaxTotal(poolSize);
return this;
}
public CloseableHttpClient getClient(Site site) {
return generateClient(site);
}
private CloseableHttpClient generateClient(Site site) {
HttpClientBuilder httpClientBuilder = HttpClients.custom();
httpClientBuilder.setConnectionManager(connectionManager);
if (site.getUserAgent() != null) {
httpClientBuilder.setUserAgent(site.getUserAgent());
} else {
httpClientBuilder.setUserAgent("");
}
if (site.isUseGzip()) {
httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {
public void process(
final HttpRequest request,
final HttpContext context) throws HttpException, IOException {
if (!request.containsHeader("Accept-Encoding")) {
request.addHeader("Accept-Encoding", "gzip");
}
}
});
}
//解決post/redirect/post 302跳轉(zhuǎn)問(wèn)題
httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());
SocketConfig.Builder socketConfigBuilder = SocketConfig.custom();
socketConfigBuilder.setSoKeepAlive(true).setTcpNoDelay(true);
socketConfigBuilder.setSoTimeout(site.getTimeOut());
SocketConfig socketConfig = socketConfigBuilder.build();
httpClientBuilder.setDefaultSocketConfig(socketConfig);
connectionManager.setDefaultSocketConfig(socketConfig);
httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
generateCookie(httpClientBuilder, site);
return httpClientBuilder.build();
}
private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) {
if (site.isDisableCookieManagement()) {
httpClientBuilder.disableCookieManagement();
return;
}
CookieStore cookieStore = new BasicCookieStore();
for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
cookie.setDomain(site.getDomain());
cookieStore.addCookie(cookie);
}
for (Map.Entry<String, Map<String, String>> domainEntry : site.getAllCookies().entrySet()) {
for (Map.Entry<String, String> cookieEntry : domainEntry.getValue().entrySet()) {
BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
cookie.setDomain(domainEntry.getKey());
cookieStore.addCookie(cookie);
}
}
httpClientBuilder.setDefaultCookieStore(cookieStore);
}
}
好了,到這里 基于WebMagic框架 實(shí)現(xiàn)爬蟲、包括jsoup的使用總結(jié)就到這里的。
到此這篇關(guān)于springBoot+webMagic實(shí)現(xiàn)網(wǎng)站爬蟲的實(shí)例代碼的文章就介紹到這了,更多相關(guān)springBoot webMagic 爬蟲內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
通過(guò)Java實(shí)現(xiàn)反向代理集群服務(wù)的平滑分配
這篇文章主要介紹了如何通過(guò)Java語(yǔ)言,自己編寫的平滑加權(quán)輪詢算法,結(jié)合線程池和Socket?網(wǎng)絡(luò)編程等,并實(shí)現(xiàn)反向代理集群服務(wù)的平滑分配,需要的可以參考一下2022-04-04
feign調(diào)用中文參數(shù)被encode編譯的問(wèn)題
這篇文章主要介紹了feign調(diào)用中文參數(shù)被encode編譯的問(wèn)題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2022-03-03
spring boot RestTemplate 發(fā)送get請(qǐng)求的踩坑及解決
這篇文章主要介紹了spring boot RestTemplate 發(fā)送get請(qǐng)求的踩坑及解決方案,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2021-08-08
Java中使用數(shù)組實(shí)現(xiàn)棧數(shù)據(jù)結(jié)構(gòu)實(shí)例
這篇文章主要介紹了Java中使用數(shù)組實(shí)現(xiàn)棧數(shù)據(jù)結(jié)構(gòu)實(shí)例,本文先是講解了實(shí)現(xiàn)棧至少應(yīng)該包括以下幾個(gè)方法等知識(shí),然后給出代碼實(shí)例,需要的朋友可以參考下2015-01-01
SpringBoot如何接收數(shù)組參數(shù)的方法
這篇文章主要介紹了SpringBoot如何接收數(shù)組參數(shù)的方法,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2020-12-12
解決swagger主頁(yè)訪問(wèn),返回報(bào)錯(cuò)500問(wèn)題
在使用Swagger時(shí)遇到500錯(cuò)誤,通過(guò)仔細(xì)的debug發(fā)現(xiàn)問(wèn)題源于注解使用不當(dāng),具體表現(xiàn)為一個(gè)接口的入?yún)⒈诲e(cuò)誤地注解了三個(gè)參數(shù),而實(shí)際上只有兩個(gè),這導(dǎo)致了Swagger在解析時(shí)拋出了NullPointerException異常,解決方法是刪除錯(cuò)誤的第三個(gè)參數(shù)的注解2024-09-09
Java實(shí)現(xiàn)的簡(jiǎn)單字符串反轉(zhuǎn)操作示例
這篇文章主要介紹了Java實(shí)現(xiàn)的簡(jiǎn)單字符串反轉(zhuǎn)操作,結(jié)合實(shí)例形式分別描述了java遍歷逆序輸出以及使用StringBuffer類的reverse()方法兩種字符串反轉(zhuǎn)操作技巧,需要的朋友可以參考下2018-08-08
Java使用EasyExcel動(dòng)態(tài)添加自增序號(hào)列
本文將介紹如何通過(guò)使用EasyExcel自定義攔截器實(shí)現(xiàn)在最終的Excel文件中新增一列自增的序號(hào)列,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2021-09-09
IntelliJ?IDEA?2022.2.1最新永久激活破解教程(持續(xù)更新)
這篇文章主要介紹了IntelliJ?IDEA?2022.2.1最新永久激活破解教程(持續(xù)更新),小編測(cè)試這種激活工具也適用idea2022以下所有版本,本篇教程整理的比較詳細(xì),匯總了idea各個(gè)版本的激活工具,激活方法多種多樣,大家選擇一種即可,感興趣的朋友跟隨小編一起看看吧2022-09-09
Java運(yùn)算符的常見問(wèn)題與用法小結(jié)
這篇文章主要介紹了Java運(yùn)算符,結(jié)合實(shí)例形式總結(jié)分析了Java各種常見運(yùn)算符,包括算術(shù)運(yùn)算符、比較運(yùn)算符、邏輯運(yùn)算符、位運(yùn)算符等相關(guān)功能、原理與使用技巧,需要的朋友可以參考下2020-04-04

