PHP利用正則表達式將相對路徑轉(zhuǎn)成絕對路徑的方法示例
前言
大家應(yīng)該都有所體會,很多時候在做網(wǎng)絡(luò)爬蟲的時候特別需要將爬蟲搜索到的超鏈接進行處理,統(tǒng)一都改成絕對路徑的,所以本文就寫了一個正則表達式來對搜索到的鏈接進行處理。下面話不多說,來看看詳細的介紹吧。
通常我們可能會搜索到如下的鏈接:
<!-- 空超鏈接 --> <a href=""></a> <!-- 空白符 --> <a href=" " rel="external nofollow" > </a> <!-- a標(biāo)簽含有其它屬性 --> <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超鏈接"> index.html </a> <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" target="_blank"> / target="_blank" </a> <a target="_blank" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超鏈接" > target="_blank" / alt="超鏈接" </a> <a target="_blank" title="超鏈接" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超鏈接" > target="_blank" title="超鏈接" / alt="超鏈接" </a> <!-- 根目錄 --> <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" > / </a> <a href="a" rel="external nofollow" > a </a> <!-- 含參數(shù) --> <a href="/index.html?id=1" rel="external nofollow" > /index.html?id=1 </a> <a href="?id=2" rel="external nofollow" > ?id=2 </a> <!-- // --> <a rel="external nofollow" > //index.html </a> <a rel="external nofollow" > //www.mafutian.net </a> <!-- 站內(nèi)鏈接 --> <a rel="external nofollow" > http://www.hole_1.com/index.html </a> <!-- 站外鏈接 --> <a rel="external nofollow" > http://www.mafutian.net </a> <a rel="external nofollow" > http://www.numberer.net </a> <!-- 圖片,文本文件格式的鏈接 --> <a href="1.jpg" rel="external nofollow" > 1.jpg </a> <a href="1.jpeg" rel="external nofollow" > 1.jpeg </a> <a href="1.gif" rel="external nofollow" > 1.gif </a> <a href="1.png" rel="external nofollow" > 1.png </a> <a href="1.txt" rel="external nofollow" > 1.txt </a> <!-- 普通鏈接 --> <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a> <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a> <a href="./index.html" rel="external nofollow" > ./index.html </a> <a href="../index.html" rel="external nofollow" > ../index.html </a> <a href=".../" rel="external nofollow" > .../ </a> <a href="..." rel="external nofollow" > ... </a> <!-- 非鏈接,含有鏈接冒號 --> <a href="javascript:void(0)" rel="external nofollow" > javascript:void(0) </a> <a href="a:b" rel="external nofollow" > a:b </a> <a href="/a#a:b" rel="external nofollow" > /a#a:b </a> <a href="mailto:'mafutian@126.com'" rel="external nofollow" > mailto:'mafutian@126.com' </a> <a href="/tencent://message/?uin=335134463" rel="external nofollow" > /tencent://message/?uin=335134463 </a> <!-- 相對路徑 --> <a href="." rel="external nofollow" > . </a> <a href=".." rel="external nofollow" > .. </a> <a href="../" rel="external nofollow" > ../ </a> <a href="/a/b/.." rel="external nofollow" > /a/b/.. </a> <a href="/a" rel="external nofollow" > /a </a> <a href="./b" rel="external nofollow" > ./b </a> <a href="./././././././././b" rel="external nofollow" > ./././././././././b </a> <!-- 其實就是 ./b --> <a href="../c" rel="external nofollow" > ../c </a> <a href="../../d" rel="external nofollow" > ../../d </a> <a href="../a/../b/c/../d" rel="external nofollow" > ../a/../b/c/../d </a> <a href="./../e" rel="external nofollow" > ./../e </a> <a rel="external nofollow" > http://www.hole_1.org/./../e </a> <a href="./.././f" rel="external nofollow" > ./.././f </a> <a rel="external nofollow" > http://www.hole_1.org/../a/.../../b/c/../d/.. </a> <!-- 帶有端口號 --> <a href=":8081/index.html" rel="external nofollow" > :8081/index.html </a> <a rel="external nofollow" > :80/index.html </a> <a rel="external nofollow" > http://www.mafutian.net:8081/index.html </a> <a rel="external nofollow" > http://www.mafutian.net:8082/index.html </a>
處理的第一步,設(shè)置成絕對路徑:
http:// ... / ../ ../
然后本文講講如何去除絕對路徑中的 './'、'../'、'/..'的實現(xiàn)代碼:
function url_to_absolute($relative) { $absolute = ''; // 去除所有的 './' $absolute = preg_replace('/(?<!\.)\.\//','',$relative); $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); // 迭代去除所有的 '/abc/../' do { $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//','/',$absolute); $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); }while($count >= 1); // 除去最后的 '/..' $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.$/','/',$absolute); $absolute = preg_replace('/\/\.\.$/','',$absolute); // 除去存在的 '../' $absolute = preg_replace('/(?<!\.)\.\.\//','',$absolute); return $absolute; } $relative = 'http://www.mytest.org/../a/.../../b/c/../d/..'; var_dump(url_to_absolute($relative)); // 輸出:string 'http://www.mytest.org/a/b/' (length=26)
總結(jié)
以上就是這篇文章的全部內(nèi)容了,希望本文的內(nèi)容對大家的學(xué)習(xí)或者工作能帶來一定的幫助,如果有疑問大家可以留言交流,謝謝大家對腳本之家的支持。
相關(guān)文章
一些PHP Coding Tips(php小技巧)[2011/04/02最后更新]
以下是一些PHP Coding Tips. 當(dāng)然, 這些Tips并不一定僅僅局限于PHP.大家有什么好的心得, 歡迎直接回復(fù)本文與更多的人分享.. 謝謝2011-05-05PHP圖片處理之圖片旋轉(zhuǎn)和圖片翻轉(zhuǎn)實例
這篇文章主要介紹了PHP圖片處理之圖片旋轉(zhuǎn)和圖片翻轉(zhuǎn)實例,本文使用imagerotate函數(shù)實現(xiàn),自定義了多個函數(shù)來實現(xiàn)功能需求,需要的朋友可以參考下2014-11-11php基礎(chǔ)知識:類與對象(4) 范圍解析操作符(::)
php基礎(chǔ)知識:類與對象(4) 范圍解析操作符(::)...2006-12-12PHP函數(shù)checkdnsrr用法詳解(Windows平臺用法)
這篇文章主要介紹了PHP函數(shù)checkdnsrr用法,分析講解了在Windows平臺使用checkdnsrr函數(shù)的方法,需要的朋友可以參考下2016-03-03PHP數(shù)組遞歸排序?qū)崿F(xiàn)方法示例
這篇文章主要介紹了PHP數(shù)組遞歸排序?qū)崿F(xiàn)方法,結(jié)合實例形式分析了php基于遞歸算法針對特定key對數(shù)組進行排序的相關(guān)操作技巧,需要的朋友可以參考下2018-03-03