如何使用Java爬蟲批量爬取圖片

更新時間：2023年04月14日 09:50:37 作者：CrazyDragon_King

這篇文章主要介紹了如何使用Java爬蟲批量爬取圖片,對于爬蟲的入門來說，圖片相對來說是比較容易獲取的，因為大部分圖片都不是敏感數(shù)據(jù)，所以不會遇到什么反爬措施，對于入門爬蟲來說是比較合適的,需要的朋友可以參考下

Java爬取圖片

現(xiàn)在開始學(xué)習(xí)爬蟲，對于爬蟲的入門來說，圖片相對來說是比較容易獲取的，因為大部分圖片都不是敏感數(shù)據(jù)，所以不會遇到什么反爬措施，對于入門爬蟲來說是比較合適的。

使用技術(shù)：Java基礎(chǔ)知識、HttpClient 4.x 、Jsoup

學(xué)習(xí)目標：下載靜態(tài)資源圖片。

爬取思路

對于這種圖片的獲取，其實本質(zhì)上就是就是文件的下載（HttpClient）。但是因為不只是獲取一張圖片，所以還會有一個頁面解析的處理過程（Jsoup）。

Jsoup：解析html頁面，獲取圖片的鏈接。

HttpClient：請求圖片的鏈接，保存圖片到本地。

具體步驟

首先進入首頁分析，主要有以下幾個分類（這里不是全部分類，但是這幾個也足夠了，這只是學(xué)習(xí)技術(shù)而已。），我們的目標就是獲取每個分類下的圖片。

這里來分析一下網(wǎng)站的結(jié)構(gòu)，我這里就簡單一點吧。下面這張圖片是大致的結(jié)構(gòu)，這里選取一個分類標簽進行說明。 一個分類標簽頁含有多個標題頁，然后每個標題頁含有多個圖片頁。（對應(yīng)標題頁的幾十張圖片）

在這里插入圖片描述

對網(wǎng)站的結(jié)構(gòu)有了大致了解之后，就可以著手開始爬取圖片了。這里還有一個需要注意，大概是前輩們做得太過了，導(dǎo)致這個網(wǎng)站已經(jīng)開始有反爬蟲機制了。不過，幸好它還不是很強大，我們還是可以繞過去的。這個網(wǎng)站的反爬蟲機制主要就是：UA、Referer。

這是從知乎上面學(xué)習(xí)到的經(jīng)驗。

具體代碼

導(dǎo)入項目依賴jar包坐標或者直接下載對應(yīng)的jar包，導(dǎo)入項目也可。

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.6</version>
</dependency>
	
<dependency>
   	<groupId>org.jsoup</groupId>
   	<artifactId>jsoup</artifactId>
   	<version>1.11.3</version>
</dependency>

實體類 Picture 和工具類 HeaderUtil

實體類：把屬性封裝成一個對象，這樣調(diào)用方便一點。

package com.picture;

public class Picture {
	private String title;
	private String url;
	
	public Picture(String title, String url) {
		this.title = title;
		this.url = url;
	}
	public String getTitle() {
		return this.title;
	}
	public String getUrl() {
		return this.url;
	}
}

工具類：不斷變換 UA（我也不知道有沒有用，不過我是使用自己的ip，估計用處不大了）

package com.picture;

public class HeaderUtil {
	public static String[] headers = {
			"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
		    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",
		    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",
		    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",
		    "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)",
		    "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11",
		    "Opera/9.25 (Windows NT 5.1; U; en)",
		    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
		    "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)",
		    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
		    "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
		    "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
		    "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 "
	};
}

下載類

多線程實在是太快了，再加上我只有一個ip，沒有代理ip可以用（我也不太了解），使用多線程被封ip是很快的。

package com.picture;

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.Random;

import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;

import com.m3u8.HttpClientUtil;

public class SinglePictureDownloader {
	private String referer;
	private CloseableHttpClient httpClient;
	private Picture picture;
	private String filePath;
	
	
	
	public SinglePictureDownloader(Picture picture, String referer, String filePath) {
		this.httpClient = HttpClientUtil.getHttpClient();
		this.picture = picture;
		this.referer = referer;
		this.filePath = filePath;
	}
	
	public void download() {
		HttpGet get = new HttpGet(picture.getUrl());
		Random rand = new Random();
		//設(shè)置請求頭
		get.setHeader("User-Agent", HeaderUtil.headers[rand.nextInt(HeaderUtil.headers.length)]);
		get.setHeader("referer", referer);
		System.out.println(referer);
		HttpEntity entity = null;
		try (CloseableHttpResponse response = httpClient.execute(get)) {
			int statusCode = response.getStatusLine().getStatusCode();
			if (statusCode == 200) {
				entity = response.getEntity();
				if (entity != null) {
					File picFile = new File(filePath, picture.getTitle());
					try (OutputStream out = new BufferedOutputStream(new FileOutputStream(picFile))) {
						entity.writeTo(out);
						System.out.println("下載完畢：" + picFile.getAbsolutePath());
					}
				}
			}
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			try {
				//關(guān)閉實體，關(guān)于 httpClient 的關(guān)閉資源，有點不太了解。
				EntityUtils.consume(entity);
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}
}

這是獲取 HttpClient 連接的工具類，避免頻繁創(chuàng)建連接的性能消耗。（但是因為我這里是使用單線程來爬取，所以用處就不大了。我就是可以只使用一個HttpClient連接來爬取，這是因為我剛開始是使用多線程來爬取的，但是基本獲取幾張圖片就被禁掉了，所以改成單線程爬蟲。所以這個連接池也就留下來了。）

package com.m3u8;

import org.apache.http.client.config.RequestConfig;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;

public class HttpClientUtil {
	private static final int TIME_OUT = 10 * 1000;
	private static PoolingHttpClientConnectionManager pcm;   //HttpClient 連接池管理類
	private static RequestConfig requestConfig;
	
	static {
		requestConfig = RequestConfig.custom()
				.setConnectionRequestTimeout(TIME_OUT)
				.setConnectTimeout(TIME_OUT)
				.setSocketTimeout(TIME_OUT).build();
		
		pcm = new PoolingHttpClientConnectionManager();
		pcm.setMaxTotal(50);
		pcm.setDefaultMaxPerRoute(10);  //這里可能用不到這個東西。
	}
	
	public static CloseableHttpClient getHttpClient() {
		return HttpClients.custom()
				.setConnectionManager(pcm)
				.setDefaultRequestConfig(requestConfig)
				.build();
	}
}

最重要的類：解析頁面類 PictureSpider

package com.picture;

import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import com.m3u8.HttpClientUtil;

/**
 * 首先從頂部分類標題開始，依次爬取每一個標題（小分頁），每一個標題（大分頁。）
 * */
public class PictureSpider {

	private CloseableHttpClient httpClient;
	private String referer;
	private String rootPath;
	private String filePath;
	
	public PictureSpider() {
		httpClient = HttpClientUtil.getHttpClient();
	}
	
	/**
	 * 開始爬蟲爬?。?
	 * 
	 * 從爬蟲隊列的第一條開始，依次爬取每一條url。
	 * 
	 * 分頁爬?。号?0頁
	 * 每個url屬于一個分類，每個分類一個文件夾
	 * */
	public void start(List<String> urlList) {
		urlList.stream().forEach(url->{
			this.referer = url;
			
			String dirName = url.substring(22, url.length()-1);  //根據(jù)標題名字去創(chuàng)建目錄
			//創(chuàng)建分類目錄
			File path = new File("D:/DragonFile/DBC/mzt/", dirName); //硬編碼路徑，需要用戶自己指定一個
			if (!path.exists()) {
				path.mkdir();
				rootPath = path.toString();
			}
			
			for (int i = 1; i <= 10; i++) {  //分頁獲取圖片數(shù)據(jù)，簡單獲取幾頁就行了
				this.page(url + "page/"+ 1);  
			}
		});
	}
	

	/**
	  * 標題分頁獲取鏈接
	 * */
	public void page(String url) {
		System.out.println("url：" + url);
		String html = this.getHtml(url);   //獲取頁面數(shù)據(jù)
		Map<String, String> picMap = this.extractTitleUrl(html);  //抽取圖片的url
		
		if (picMap == null) {
			return ;
		}
		//獲取標題對應(yīng)的圖片頁面數(shù)據(jù)
		this.getPictureHtml(picMap);
	}
	
	private String getHtml(String url) {
		String html = null;
		HttpGet get = new HttpGet(url);
		get.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36");
		get.setHeader("referer", url);
		try (CloseableHttpResponse response = httpClient.execute(get)) {
			int statusCode = response.getStatusLine().getStatusCode();
			if (statusCode == 200) {
				HttpEntity entity = response.getEntity();
				if (entity != null) {
					html = EntityUtils.toString(entity, "UTf-8");   //關(guān)閉實體？
				}
			}
			else {
				System.out.println(statusCode);
			}
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} 
		return html;
	}
	
	private Map<String, String> extractTitleUrl(String html) {
		if (html == null) {
			return null;
		}
		Document doc = Jsoup.parse(html, "UTF-8");
		Elements pictures = doc.select("ul#pins > li");
		
		//不知為何，無法直接獲取 a[0]，我不太懂這方面的知識。
		//那我就多處理一步，這里先放下。
		Elements pictureA = pictures.stream()
				.map(pic->pic.getElementsByTag("a").first())
				.collect(Collectors.toCollection(Elements::new));
		
		return pictureA.stream().collect(Collectors.toMap(
				pic->pic.getElementsByTag("img").first().attr("alt"),
				pic->pic.attr("href")));
	}
	
	/**
	 * 進入每一個標題的鏈接，再次分頁獲取圖片的鏈接
	 * */
	private void getPictureHtml(Map<String, String> picMap) {
		//進入標題頁，在標題頁中再次分頁下載。
		picMap.forEach((title, url)->{
			//分頁下載一個系列的圖片，每個系列一個文件夾。
			File dir = new File(rootPath, title.trim());
			if (!dir.exists()) {
				dir.mkdir();
				filePath = dir.toString();  //這個 filePath 是每一個系列圖片的文件夾
			}
			
			for (int i = 1; i <= 60; i++) {
				String html = this.getHtml(url + "/" + i);
				if (html == null) {
					//每個系列的圖片一般沒有那么多，
					//如果返回的頁面數(shù)據(jù)為 null，那就退出這個系列的下載。
					return ; 
				}
				Picture picture = this.extractPictureUrl(html);
				System.out.println("開始下載");
				//多線程實在是太快了（快并不是好事，我改成單線程爬取吧）
				
				SinglePictureDownloader downloader = new SinglePictureDownloader(picture, referer, filePath);
				downloader.download();
				try {
					Thread.sleep(1500);   //不要爬的太快了，這里只是學(xué)習(xí)爬蟲的知識。不要擾亂別人的正常服務(wù)。
					System.out.println("爬取完一張圖片，休息1.5秒。");
				} catch (InterruptedException e) {
					e.printStackTrace();
				}
			}
		});
	}
	
	/**
	 * 獲取每一頁圖片的標題和鏈接
	 * */
	private Picture extractPictureUrl(String html) {
		Document doc = Jsoup.parse(html, "UTF-8");
		//獲取標題作為文件名
		String title = doc.getElementsByTag("h2")
				.first()
				.text();

		//獲取圖片的鏈接（img 標簽的 src 屬性）
		String url = doc.getElementsByAttributeValue("class", "main-image")
				.first()
				.getElementsByTag("img")
				.attr("src");
		
		//獲取圖片的文件擴展名
		title = title + url.substring(url.lastIndexOf("."));		
		return new Picture(title, url);
	}
}

啟動類 BootStrap

這里有一個爬蟲隊列，但是我最終連第一個都沒有爬取完，這是因為我計算失誤了，少算了兩個數(shù)量級。但是，程序的功能是正確的。

package com.picture;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

/**
 * 爬蟲啟動類
 * */
public class BootStrap {
	public static void main(String[] args) {		
		//反爬措施：UA、refer 簡單繞過就行了。
		//refer   https://www.mzitu.com
		
		//使用數(shù)組做一個爬蟲隊列
		String[] urls = new String[] {
			"https://www.mzitu.com/xinggan/",     
			"https://www.mzitu.com/zipai/"   
		};
		
		// 添加初始隊列，啟動爬蟲
		List<String> urlList = new ArrayList<>(Arrays.asList(urls));
		PictureSpider spider = new PictureSpider();
		spider.start(urlList);
	}
}

爬取結(jié)果

在這里插入圖片描述

注意事項

這里有一個計算失誤，代碼如下：

for (int i = 1; i <= 10; i++) {  //分頁獲取圖片數(shù)據(jù)，簡單獲取幾頁就行了
	this.page(url + "page/"+ 1);  
}

這個 i 的取值過大了，因為我計算的時候失誤了。如果按照這個情況下載的話，總共會下載：4 * 10 * (30-5) * 60 = 64800 張。（每一頁是含有30個標題頁，大概5個是廣告。） 我一開始以為只有幾百張圖片！ 這是一個估計值，但是真實的下載量和這個不會差太多的（沒有數(shù)量級的差距）。所以我下載了一會發(fā)現(xiàn)只下載了第一個隊列里面的圖片。當(dāng)然了，作為一個爬蟲學(xué)習(xí)的程序，它還是很合格的。

這個程序只是用來學(xué)習(xí)的，我設(shè)置每張圖片的下載間隔時間是1.5秒，而且是單線程的程序，所以速度上會顯得很慢。但是那樣也沒有關(guān)系，只要程序的功能正確就行了，應(yīng)該沒有人會真的等到圖片下載完吧。

那估計要好久了：64800*1.5s = 97200s = 27h，這也只是一個粗略的估計值，沒有考慮程序的其他運行時間，不過其他時間可以基本忽略了。

總結(jié)

雖然說Java比較麻煩（相對于python，確實是不夠簡潔了）。但是我們學(xué)習(xí)的是技術(shù)呀，這點小困難還是能克服的，如果你只是想要下載圖片，那還是去看python的爬蟲吧。Java的HttpClient也就夠?qū)W一段時間了，發(fā)現(xiàn)這個東西真的是挺好用的，但是我也只是學(xué)習(xí)了一點點。這個類庫，不僅可以用來寫爬蟲（做爬蟲只是它功能比較強大而已），而且還能做很多有趣的事情。爬蟲，還是很有意思的，有很多有趣的地方，可以讓我們探索。但是，也要注意，本意是學(xué)習(xí)技術(shù)，不能搞破壞。

到此這篇關(guān)于如何使用Java爬蟲批量爬取圖片的文章就介紹到這了,更多相關(guān)Java爬蟲批量爬取圖片內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫

CMS

常用工具

如何使用Java爬蟲批量爬取圖片

目錄

Java爬取圖片

爬取思路

具體步驟

具體代碼

實體類 Picture 和工具類 HeaderUtil

下載類

最重要的類：解析頁面類 PictureSpider

啟動類 BootStrap

爬取結(jié)果

注意事項

總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

如何使用Java爬蟲批量爬取圖片

目錄

Java爬取圖片

爬取思路

具體步驟

具體代碼

實體類 Picture 和 工具類 HeaderUtil

下載類

最重要的類：解析頁面類 PictureSpider

啟動類 BootStrap

爬取結(jié)果

注意事項

總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

實體類 Picture 和工具類 HeaderUtil