利用Java實(shí)現(xiàn)網(wǎng)站聚合工具

更新時(shí)間：2022年01月27日 15:14:43 作者：炒雞辣雞123

互聯(lián)網(wǎng)上有數(shù)以萬億計(jì)的網(wǎng)站，每個(gè)網(wǎng)站大都具有一定的功能。搜索引擎雖然對互聯(lián)網(wǎng)上的部分網(wǎng)站建立了索引，但是其作為一個(gè)大而全的搜索系統(tǒng)，無法很好的定位到一些特殊的需求。因此本文將介紹一個(gè)用java實(shí)現(xiàn)的網(wǎng)站數(shù)據(jù)聚合工具，需要的可以參考一下

互聯(lián)網(wǎng)上有數(shù)以萬億計(jì)的網(wǎng)站，每個(gè)網(wǎng)站大都具有一定的功能。搜索引擎雖然對互聯(lián)網(wǎng)上的部分網(wǎng)站建立了索引，但是其作為一個(gè)大而全的搜索系統(tǒng)，無法很好的定位到一些特殊的需求，基于這樣的背景，我嘗試了寫了一個(gè)網(wǎng)站數(shù)據(jù)聚合的程序?，F(xiàn)在將原理和實(shí)現(xiàn)代碼分享給大家。

原理

可以把互聯(lián)網(wǎng)上的網(wǎng)站看做一張巨大的連通圖，不同的網(wǎng)站處于不同的連通塊中，然后以廣度優(yōu)先算法遍歷這個(gè)連通塊，就能找到所有的網(wǎng)站域名，利用廣度優(yōu)先算法遍歷該連通塊的結(jié)構(gòu)可以抽象為：

然后，我們對該網(wǎng)站的返回內(nèi)容進(jìn)行分詞，剔除無意義的詞語和標(biāo)點(diǎn)符號，就得出該網(wǎng)站首頁的關(guān)鍵詞排序，我們可以取詞頻在（10,50）區(qū)間范圍內(nèi)的為關(guān)鍵詞，然后將這些關(guān)鍵詞作為網(wǎng)站主題，把網(wǎng)站的信息放到以該詞為名字的markdown文件中備用。

同理，我們也對該網(wǎng)站返回內(nèi)容的title部分進(jìn)行分詞，因?yàn)閠itle是網(wǎng)站開發(fā)者對網(wǎng)站功能的濃縮，也比較重要，同理，也將這些關(guān)鍵詞作為網(wǎng)站主題，把網(wǎng)站的信息放到以該詞為名字的markdown文件中備用。

最后，我們只需要從這些文件中人工做篩選，或者以這些數(shù)據(jù)放到elasticsearch中，做關(guān)鍵詞搜索引擎即可。以達(dá)到想用的時(shí)候隨時(shí)去拿的目的。?

不過，當(dāng)你遍歷連通塊沒有收斂時(shí)，得到的數(shù)據(jù)還是很少的，某些分類往往只有一兩個(gè)網(wǎng)站。

實(shí)現(xiàn)代碼

頁面下載

頁面下載我使用的是httpClient，前期考慮用playwrite來做，但是兩者性能差距太大，后者效率太低了，所以舍棄了部分準(zhǔn)確性（即web2.0技術(shù)的網(wǎng)站，前者無法拿到數(shù)據(jù)），所以準(zhǔn)確的說我實(shí)現(xiàn)的僅僅是web1.0的網(wǎng)站分類搜索引擎的頁面下載功能。

public SendReq.ResBody doRequest(String url, String method, Map<String, Object> params) {
    String urlTrue = url;
    SendReq.ResBody resBody = SendReq.sendReq(urlTrue, method, params, defaultHeaders());
    return resBody;
}

其中，SendReq是我封裝的一個(gè)httpClient的類，只是實(shí)現(xiàn)了一個(gè)頁面下載的功能，你可以替換為RestTemplate或者別的發(fā)起http(s)請求的方法。?

解析返回值中的所有鏈接

因?yàn)槭沁B通塊遍歷，那么定義的連通網(wǎng)站就是該網(wǎng)站首頁里面所有的外鏈的域名所在的站，所以我們需要提取鏈接，直接使用正則表達(dá)式提取即可。

public static List<String> getUrls(String htmlText) {
    Pattern pattern = Pattern.compile("(http|https):\\/\\/[A-Za-z0-9_\\-\\+.:?&@=\\/%#,;]*");
    Matcher matcher = pattern.matcher(htmlText);
    Set<String> ans = new HashSet<>();
    while (matcher.find()){
        ans.add(DomainUtils.getDomainWithCompleteDomain(matcher.group()));
    }
    return new ArrayList<>(ans);
}

解析返回值中的title

title是網(wǎng)站開發(fā)者對網(wǎng)站功能的濃縮，所以很有必要將title解析出來做進(jìn)一步處理

public static String getTitle(String htmlText){
    Pattern pattern = Pattern.compile("(?<=title\\>).*(?=</title)");
    Matcher matcher = pattern.matcher(htmlText);
    Set<String> ans = new HashSet<>();
    while (matcher.find()){
        return matcher.group();
    }
    return "";
}

去除返回值中的標(biāo)簽

因?yàn)楹罄m(xù)步驟需要對網(wǎng)站返回值進(jìn)行分詞，所以需要對頁面中的標(biāo)簽和代碼進(jìn)行去除。

public static String getContent(String html) {
? ? String ans = "";
? ? try {
? ? ? ? html = StringEscapeUtils.unescapeHtml4(html);
? ? ? ? html = delHTMLTag(html);
? ? ? ? html = htmlTextFormat(html);
? ? ? ? return html;
? ? } catch (Exception e) {
? ? ? ? e.printStackTrace();
? ? }
? ? return ans;
}

public static String delHTMLTag(String htmlStr) {
? ? String regEx_script = "<script[^>]*?>[\\s\\S]*?<\\/script>"; //定義script的正則表達(dá)式
? ? String regEx_style = "<style[^>]*?>[\\s\\S]*?<\\/style>"; //定義style的正則表達(dá)式
? ? String regEx_html = "<[^>]+>"; //定義HTML標(biāo)簽的正則表達(dá)式

? ? Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
? ? Matcher m_script = p_script.matcher(htmlStr);
? ? htmlStr = m_script.replaceAll(""); //過濾script標(biāo)簽

? ? Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
? ? Matcher m_style = p_style.matcher(htmlStr);
? ? htmlStr = m_style.replaceAll(""); //過濾style標(biāo)簽

? ? Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
? ? Matcher m_html = p_html.matcher(htmlStr);
? ? htmlStr = m_html.replaceAll(""); //過濾html標(biāo)簽

? ? return htmlStr.trim();
}

分詞

分詞算法使用之前講NLP入門的文章里面提到的hanlp即可

private static Pattern ignoreWords = Pattern.compile("[,.0-9_\\- ，、：。；;\\]\\[\\/?。ǎ尽?？“”()+:|\"%~<>——]+");

public static Set<Word> separateWordAndReturnUnit(String text) {
? ? Segment segment = HanLP.newSegment().enableOffset(true);
? ? Set<Word> detectorUnits = new HashSet<>();
? ? Map<Integer, Word> detectorUnitMap = new HashMap<>();
? ? List<Term> terms = segment.seg(text);
? ? for (Term term : terms) {
? ? ? ? Matcher matcher = ignoreWords.matcher(term.word);
? ? ? ? if (!matcher.find() && term.word.length() > 1 && !term.word.contains("?")) {
? ? ? ? ? ? Integer hashCode = term.word.hashCode();
? ? ? ? ? ? Word detectorUnit = detectorUnitMap.get(hashCode);
? ? ? ? ? ? if (Objects.nonNull(detectorUnit)) {
? ? ? ? ? ? ? ? detectorUnit.setCount(detectorUnit.getCount() + 1);
? ? ? ? ? ? } else {
? ? ? ? ? ? ? ? detectorUnit = new Word();
? ? ? ? ? ? ? ? detectorUnit.setWord(term.word.trim());
? ? ? ? ? ? ? ? detectorUnit.setCount(1);
? ? ? ? ? ? ? ? detectorUnitMap.put(hashCode, detectorUnit);
? ? ? ? ? ? ? ? detectorUnits.add(detectorUnit);
? ? ? ? ? ? }
? ? ? ? }
? ? }
? ? return detectorUnits;
}

獲取分詞結(jié)果的數(shù)量前十個(gè)

這里為了去掉詞頻過多的詞的干擾，所以只取詞頻小于50的詞的前十

public static List<String> print2List(List<Word> tmp,int cnt){
    PriorityQueue<Word> words = new PriorityQueue<>();
    List<String> ans = new ArrayList<>();
    for (Word word : tmp) {
        words.add(word);
    }
    int count = 0;
    while (!words.isEmpty()) {
        Word word = words.poll();
        if (word.getCount()<50){
            ans.add(word.getWord() + " " + word.getCount());
            count ++;
            if (count >= cnt){
                break;
            }
        }
    }
    return ans;
}

方法就是放到優(yōu)先隊(duì)列中一個(gè)一個(gè)取出來，優(yōu)先隊(duì)列是使用大頂堆實(shí)現(xiàn)的，所以取出來一定是有序的。如果想了解大頂堆的朋友，可以看我前面的文章。
值得注意的是，優(yōu)先隊(duì)列中放入的類必須是可排序的，所以，這里的Word也是可排序的，簡化的代碼如下：

public class Word implements Comparable{
? ? private String word;
? ? private Integer count = 0;

? ? ... ...

? ? @Override
? ? public int compareTo(Object o) {
? ? ? ? if (this.count >= ((Word)o).count){
? ? ? ? ? ? return -1;
? ? ? ? }else {
? ? ? ? ? ? return 1;
? ? ? ? }
? ? }
}

好了，現(xiàn)在準(zhǔn)備工作已經(jīng)做好了。下面開始實(shí)現(xiàn)程序邏輯部分。

遍歷網(wǎng)站連通塊

利用廣度優(yōu)先遍歷網(wǎng)站連通塊，之前的文章有專門講利用隊(duì)列寫廣度優(yōu)先遍歷?，F(xiàn)在就使用該方法。

public void doTask() {
? ? String root = "http://" + this.domain + "/";
? ? Queue<String> urls = new LinkedList<>();
? ? urls.add(root);
? ? Set<String> tmpDomains = new HashSet<>();
? ? tmpDomains.add(DomainUtils.getDomainWithCompleteDomain(root));
? ? while (!urls.isEmpty()) {
? ? ? ? String url = urls.poll();
? ? ? ? SendReq.ResBody html = doRequest(url, "GET", new HashMap<>());
? ? ? ? System.out.println("當(dāng)前的請求為 " + url + " 隊(duì)列的大小為 " + urls.size() + " 結(jié)果為" + html.getCode());
? ? ? ? if (html.getCode().equals(0)) {
? ? ? ? ? ? ignoreSet.add(DomainUtils.getDomainWithCompleteDomain(url));
? ? ? ? ? ? try {
? ? ? ? ? ? ? ? GenerateFile.createFile2("moneyframework/generate/ignore", "demo.txt", ignoreSet.toString());
? ? ? ? ? ? } catch (IOException e) {
? ? ? ? ? ? ? ? e.printStackTrace();
? ? ? ? ? ? }
? ? ? ? ? ? continue;
? ? ? ? }

? ? ? ? OnePage onePage = new OnePage();
? ? ? ? onePage.setUrl(url);
? ? ? ? onePage.setDomain(DomainUtils.getDomainWithCompleteDomain(url));
? ? ? ? onePage.setCode(html.getCode());
? ? ? ? String title = HtmlUtil.getTitle(html.getResponce()).trim();
? ? ? ? if (!StringUtils.hasText(title) || title.length() > 100 || title.contains("?")) continue;
? ? ? ? onePage.setTitle(title);
? ? ? ? String content = HtmlUtil.getContent(html.getResponce());
? ? ? ? Set<Word> words = Nlp.separateWordAndReturnUnit(content);
? ? ? ? List<String> wordStr = Nlp.print2List(new ArrayList<>(words), 10);
? ? ? ? handleWord(wordStr, DomainUtils.getDomainWithCompleteDomain(url), title);
? ? ? ? onePage.setContent(wordStr.toString());
? ? ? ? if (html.getCode().equals(200)) {
? ? ? ? ? ? List<String> domains = HtmlUtil.getUrls(html.getResponce());
? ? ? ? ? ? for (String domain : domains) {
? ? ? ? ? ? ? ? int flag = 0;
? ? ? ? ? ? ? ? for (String i : ignoreSet) {
? ? ? ? ? ? ? ? ? ? if (domain.endsWith(i)) {
? ? ? ? ? ? ? ? ? ? ? ? flag = 1;
? ? ? ? ? ? ? ? ? ? ? ? break;
? ? ? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? ? if (flag == 1) continue;
? ? ? ? ? ? ? ? if (StringUtils.hasText(domain.trim())) {
? ? ? ? ? ? ? ? ? ? if (!tmpDomains.contains(domain)) {
? ? ? ? ? ? ? ? ? ? ? ? tmpDomains.add(domain);
? ? ? ? ? ? ? ? ? ? ? ? urls.add("http://" + domain + "/");
? ? ? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? ? }
? ? ? ? ? ? }
? ? ? ? }
? ? }
}

調(diào)用測試

@Service
public class Task {

? ? @PostConstruct
? ? public void init(){
? ? ? ? new Thread(new Runnable() {
? ? ? ? ? ? @Override
? ? ? ? ? ? public void run() {
? ? ? ? ? ? ? ? while (true){
? ? ? ? ? ? ? ? ? ? try {
? ? ? ? ? ? ? ? ? ? ? ? HttpClientCrawl clientCrawl = new HttpClientCrawl("http://www.mengwa.store/");
? ? ? ? ? ? ? ? ? ? ? ? clientCrawl.doTask();
? ? ? ? ? ? ? ? ? ? }catch (Exception e){
? ? ? ? ? ? ? ? ? ? ? ? e.printStackTrace();
? ? ? ? ? ? ? ? ? ? }
? ? ? ? ? ? ? ? }
? ? ? ? ? ? }
? ? ? ? }).start();
? ? }
}

大家也可以用自己的個(gè)人博客作為起點(diǎn)試一下，看下自己在哪個(gè)連通塊里面。

以上就是利用Java實(shí)現(xiàn)網(wǎng)站聚合工具的詳細(xì)內(nèi)容，更多關(guān)于Java網(wǎng)站聚合的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: