国产一级a片免费看高清,亚洲熟女中文字幕在线视频,黄三级高清在线播放,免费黄色视频在线看

打開(kāi)APP
userphoto
未登錄

開(kāi)通VIP,暢享免費(fèi)電子書(shū)等14項(xiàng)超值服

開(kāi)通VIP
一個(gè)Web搜索程序
這是一個(gè)web搜索的基本程序,從命令行輸入搜索條件(起始的URL、處理url的最大數(shù)、要搜索的字符串),
它就會(huì)逐個(gè)對(duì)Internet上的URL進(jìn)行實(shí)時(shí)搜索,查找并輸出匹配搜索條件的頁(yè)面。 這個(gè)程序的原型來(lái)自《java編程藝術(shù)》,
為了更好的分析,站長(zhǎng)去掉了其中的GUI部分,并稍作修改以適用jdk1.5。以這個(gè)程序?yàn)榛A(chǔ),可以寫(xiě)出在互聯(lián)網(wǎng)上搜索
諸如圖像、郵件、網(wǎng)頁(yè)下載之類(lèi)的“爬蟲(chóng)”。
先請(qǐng)看程序運(yùn)行的過(guò)程:
D:\java>javac  SearchCrawler.java(編譯)
D:\java>java   SearchCrawler http://127.0.0.1:8080/zz3zcwbwebhome/index.jsp 20 java
Start searching...
result:
searchString=java
http://127.0.0.1:8080/zz3zcwbwebhome/index.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/reply.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/learn.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/download.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/article.jsp
http://127.0.0.1:8080/zz3zcwbwebhome/myexample/jlGUIOverview.htm
http://127.0.0.1:8080/zz3zcwbwebhome/myexample/Proxooldoc/index.html
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=301
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=297
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=291
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=286
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=285
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=284
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=276
http://127.0.0.1:8080/zz3zcwbwebhome/view.jsp?id=272
又如:
D:\java>java    SearchCrawler http://www.sina.com  20 java
Start searching...
result:
searchString=java
http://sina.com
http://redirect.sina.com/WWW/sinaCN/www.sina.com.cn class=a2
http://redirect.sina.com/WWW/sinaCN/www.sina.com.cn class=a8
http://redirect.sina.com/WWW/sinaHK/www.sina.com.hk class=a2
http://redirect.sina.com/WWW/sinaTW/www.sina.com.tw class=a8
http://redirect.sina.com/WWW/sinaUS/home.sina.com class=a8
http://redirect.sina.com/WWW/smsCN/sms.sina.com.cn/ class=a2
http://redirect.sina.com/WWW/smsCN/sms.sina.com.cn/ class=a3
http://redirect.sina.com/WWW/sinaNet/www.sina.net/ class=a3
D:\java>
下面是這個(gè)程序的源碼
import java.util.*;import java.net.*;import java.io.*;import java.util.regex.*;// 搜索Web爬行者public class SearchCrawler implements Runnable{/* disallowListCache緩存robot不允許搜索的URL。 Robot協(xié)議在Web站點(diǎn)的根目錄下設(shè)置一個(gè)robots.txt文件,*規(guī)定站點(diǎn)上的哪些頁(yè)面是限制搜索的。 搜索程序應(yīng)該在搜索過(guò)程中跳過(guò)這些區(qū)域,下面是robots.txt的一個(gè)例子:# robots.txt for http://somehost.com/User-agent: *Disallow: /cgi-bin/Disallow: /registration # /Disallow robots on registration pageDisallow: /login*/private HashMap< String,ArrayList< String>> disallowListCache = new HashMap< String,ArrayList< String>>();ArrayList< String> errorList= new ArrayList< String>();//錯(cuò)誤信息ArrayList< String> result=new ArrayList< String>(); //搜索到的結(jié)果String startUrl;//開(kāi)始搜索的起點(diǎn)int maxUrl;//最大處理的url數(shù)String searchString;//要搜索的字符串(英文)boolean caseSensitive=false;//是否區(qū)分大小寫(xiě)boolean limitHost=false;//是否在限制的主機(jī)內(nèi)搜索public SearchCrawler(String startUrl,int maxUrl,String searchString){this.startUrl=startUrl;this.maxUrl=maxUrl;this.searchString=searchString;}public ArrayList< String> getResult(){return result;}public void run(){//啟動(dòng)搜索線(xiàn)程crawl(startUrl,maxUrl, searchString,limitHost,caseSensitive);}//檢測(cè)URL格式private URL verifyUrl(String url) {// 只處理HTTP URLs.if (!url.toLowerCase().startsWith("http://"))return null;URL verifiedUrl = null;try {verifiedUrl = new URL(url);} catch (Exception e) {return null;}return verifiedUrl;}// 檢測(cè)robot是否允許訪(fǎng)問(wèn)給出的URL.private boolean isRobotAllowed(URL urlToCheck) {String host = urlToCheck.getHost().toLowerCase();//獲取給出RUL的主機(jī)//System.out.println("主機(jī)="+host);// 獲取主機(jī)不允許搜索的URL緩存ArrayList< String> disallowList =disallowListCache.get(host);// 如果還沒(méi)有緩存,下載并緩存。if (disallowList == null) {disallowList = new ArrayList< String>();try {URL robotsFileUrl =new URL("http://" + host + "/robots.txt");BufferedReader reader =new BufferedReader(new InputStreamReader(robotsFileUrl.openStream()));// 讀robot文件,創(chuàng)建不允許訪(fǎng)問(wèn)的路徑列表。String line;while ((line = reader.readLine()) != null) {if (line.indexOf("Disallow:") == 0) {//是否包含"Disallow:"String disallowPath =line.substring("Disallow:".length());//獲取不允許訪(fǎng)問(wèn)路徑// 檢查是否有注釋。int commentIndex = disallowPath.indexOf("#");if (commentIndex != - 1) {disallowPath =disallowPath.substring(0, commentIndex);//去掉注釋}disallowPath = disallowPath.trim();disallowList.add(disallowPath);}}// 緩存此主機(jī)不允許訪(fǎng)問(wèn)的路徑。disallowListCache.put(host, disallowList);} catch (Exception e) {return true; //web站點(diǎn)根目錄下沒(méi)有robots.txt文件,返回真}}String file = urlToCheck.getFile();//System.out.println("文件getFile()="+file);for (int i = 0; i < disallowList.size(); i++) {String disallow = disallowList.get(i);if (file.startsWith(disallow)) {return false;}}return true;}private String downloadPage(URL pageUrl) {try {// Open connection to URL for reading.BufferedReader reader =new BufferedReader(new InputStreamReader(pageUrl.openStream()));// Read page into buffer.String line;StringBuffer pageBuffer = new StringBuffer();while ((line = reader.readLine()) != null) {pageBuffer.append(line);}return pageBuffer.toString();} catch (Exception e) {}return null;}// 從URL中去掉"www"private String removeWwwFromUrl(String url) {int index = url.indexOf("://www.");if (index != -1) {return url.substring(0, index + 3) +url.substring(index + 7);}return (url);}// 解析頁(yè)面并找出鏈接private ArrayList< String> retrieveLinks(URL pageUrl, String pageContents, HashSet crawledList,boolean limitHost){// 用正則表達(dá)式編譯鏈接的匹配模式。Pattern p =Pattern.compile("< a\\s+href\\s*=\\s*\"?(.*?)[\"|>]",Pattern.CASE_INSENSITIVE);Matcher m = p.matcher(pageContents);ArrayList< String> linkList = new ArrayList< String>();while (m.find()) {String link = m.group(1).trim();if (link.length() < 1) {continue;}// 跳過(guò)鏈到本頁(yè)面內(nèi)鏈接。if (link.charAt(0) == ‘#‘) {continue;}if (link.indexOf("mailto:") != -1) {continue;}if (link.toLowerCase().indexOf("javascript") != -1) {continue;}if (link.indexOf("://") == -1){if (link.charAt(0) == ‘/‘) {//處理絕對(duì)地link = "http://" + pageUrl.getHost()+":"+pageUrl.getPort()+ link;} else {String file = pageUrl.getFile();if (file.indexOf(‘/‘) == -1) {//處理相對(duì)地址link = "http://" + pageUrl.getHost()+":"+pageUrl.getPort() + "/" + link;} else {String path =file.substring(0, file.lastIndexOf(‘/‘) + 1);link = "http://" + pageUrl.getHost() +":"+pageUrl.getPort()+ path + link;}}}int index = link.indexOf(‘#‘);if (index != -1) {link = link.substring(0, index);}link = removeWwwFromUrl(link);URL verifiedLink = verifyUrl(link);if (verifiedLink == null) {continue;}/* 如果限定主機(jī),排除那些不合條件的URL*/if (limitHost &&!pageUrl.getHost().toLowerCase().equals(verifiedLink.getHost().toLowerCase())){continue;}// 跳過(guò)那些已經(jīng)處理的鏈接.if (crawledList.contains(link)) {continue;}linkList.add(link);}return (linkList);}// 搜索下載Web頁(yè)面的內(nèi)容,判斷在該頁(yè)面內(nèi)有沒(méi)有指定的搜索字符串private boolean searchStringMatches(String pageContents, String searchString, boolean caseSensitive){String searchContents = pageContents;if (!caseSensitive) {//如果不區(qū)分大小寫(xiě)searchContents = pageContents.toLowerCase();}Pattern p = Pattern.compile("[\\s]+");String[] terms = p.split(searchString);for (int i = 0; i < terms.length; i++) {if (caseSensitive) {if (searchContents.indexOf(terms[i]) == -1) {return false;}} else {if (searchContents.indexOf(terms[i].toLowerCase()) == -1) {return false;}} }return true;}//執(zhí)行實(shí)際的搜索操作public ArrayList< String> crawl(String startUrl, int maxUrls, String searchString,boolean limithost,boolean caseSensitive ){System.out.println("searchString="+searchString);HashSet< String> crawledList = new HashSet< String>();LinkedHashSet< String> toCrawlList = new LinkedHashSet< String>();if (maxUrls < 1) {errorList.add("Invalid Max URLs value.");System.out.println("Invalid Max URLs value.");}if (searchString.length() < 1) {errorList.add("Missing Search String.");System.out.println("Missing search String");}if (errorList.size() > 0) {System.out.println("err!!!");return errorList;}// 從開(kāi)始URL中移出wwwstartUrl = removeWwwFromUrl(startUrl);toCrawlList.add(startUrl);while (toCrawlList.size() > 0) {if (maxUrls != -1) {if (crawledList.size() == maxUrls) {break;}}// Get URL at bottom of the list.String url = toCrawlList.iterator().next();// Remove URL from the to crawl list.toCrawlList.remove(url);// Convert string url to URL object.URL verifiedUrl = verifyUrl(url);// Skip URL if robots are not allowed to access it.if (!isRobotAllowed(verifiedUrl)) {continue;}// 增加已處理的URL到crawledListcrawledList.add(url);String pageContents = downloadPage(verifiedUrl);if (pageContents != null && pageContents.length() > 0){// 從頁(yè)面中獲取有效的鏈接ArrayList< String> links =retrieveLinks(verifiedUrl, pageContents, crawledList,limitHost);toCrawlList.addAll(links);if (searchStringMatches(pageContents, searchString,caseSensitive)){result.add(url);System.out.println(url);}}}return result;}// 主函數(shù)public static void main(String[] args) {if(args.length!=3){System.out.println("Usage:java SearchCrawler startUrl maxUrl searchString");return;}int max=Integer.parseInt(args[1]);SearchCrawler crawler = new SearchCrawler(args[0],max,args[2]);Thread search=new Thread(crawler);System.out.println("Start searching...");System.out.println("result:");search.start();}}
本站僅提供存儲(chǔ)服務(wù),所有內(nèi)容均由用戶(hù)發(fā)布,如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊舉報(bào)
打開(kāi)APP,閱讀全文并永久保存 查看更多類(lèi)似文章
猜你喜歡
類(lèi)似文章
JAVA面試題
jsp中 <jsp:include> 中使用絕對(duì)路徑的問(wèn)題 ,不能用${ctx}
JSP購(gòu)物車(chē)實(shí)例[一]
ArrayList類(lèi)概述及使用
(1)[轉(zhuǎn)載]博文導(dǎo)航
徹底搞定 tree 菜單(xloadtree)
更多類(lèi)似文章 >>
生活服務(wù)
分享 收藏 導(dǎo)長(zhǎng)圖 關(guān)注 下載文章
綁定賬號(hào)成功
后續(xù)可登錄賬號(hào)暢享VIP特權(quán)!
如果VIP功能使用有故障,
可點(diǎn)擊這里聯(lián)系客服!

聯(lián)系客服