国产一级a片免费看高清,亚洲熟女中文字幕在线视频,黄三级高清在线播放,免费黄色视频在线看

打開APP
userphoto
未登錄

開通VIP,暢享免費電子書等14項超值服

開通VIP
網絡爬蟲+HtmlAgilityPack+windows服務從博客園爬取20萬博文

1.前言

最新在公司做一個項目,需要一些文章類的數據,當時就想到了用網絡爬蟲去一些技術性的網站爬一些,當然我經常去的就是博客園,于是就有下面的這篇文章。

2.準備工作

我需要把我從博客園爬取的數據,保存起來,最好的方式當然是保存到數據庫中去了,好了我們先建一個數據庫,在來一張表,保存我們的數據,其實都很簡單的了啊,如下圖所示

BlogArticleId博文自增ID,BlogTitle博文標題,BlogUrl博文地址,BlogAuthor博文作者,BlogTime博文發(fā)布時間,BlogMotto作者座右銘,BlogDepth蜘蛛爬蟲爬取的深度,IsDeleted是否刪除。

數據庫表也創(chuàng)建好了,我們先來一個數據庫的幫助類。

/// <summary>  /// 數據庫幫助類  /// </summary>  public class MssqlHelper  {    #region 字段屬性    /// <summary>    /// 數據庫連接字符串    /// </summary>    private static string conn = "Data Source=.;Initial Catalog=Cnblogs;User ID=sa;Password=123";    #endregion    #region DataTable寫入數據    public static void GetData(string title, string url, string author, string time, string motto, string depth, DataTable dt)    {      DataRow dr;      dr = dt.NewRow();      dr["BlogTitle"] = title;      dr["BlogUrl"] = url;      dr["BlogAuthor"] = author;      dr["BlogTime"] = time;      dr["BlogMotto"] = motto;      dr["BlogDepth"] = depth;      //2.0 將dr追加到dt中      dt.Rows.Add(dr);    }    #endregion    #region 插入數據到數據庫    /// <summary>    /// 插入數據到數據庫    /// </summary>    public static void InsertDb(DataTable dt)    {      try      {        using (System.Data.SqlClient.SqlBulkCopy copy = new System.Data.SqlClient.SqlBulkCopy(conn))        {          //3.0.1 指定數據插入目標表名稱          copy.DestinationTableName = "BlogArticle";          //3.0.2 告訴SqlBulkCopy對象 內存表中的 OrderNO1和Userid1插入到OrderInfos表中的哪些列中          copy.ColumnMappings.Add("BlogTitle", "BlogTitle");          copy.ColumnMappings.Add("BlogUrl", "BlogUrl");          copy.ColumnMappings.Add("BlogAuthor", "BlogAuthor");          copy.ColumnMappings.Add("BlogTime", "BlogTime");          copy.ColumnMappings.Add("BlogMotto", "BlogMotto");          copy.ColumnMappings.Add("BlogDepth", "BlogDepth");          //3.0.3 將內存表dt中的數據一次性批量插入到OrderInfos表中          copy.WriteToServer(dt);          dt.Rows.Clear();        }      }      catch (Exception)      {        dt.Rows.Clear();      }    }    #endregion  }

3.日志

來個日志,方便我們查看,代碼如下。

/// <summary>  /// 日志幫助類  /// </summary>  public class LogHelper  {    #region 寫入日志    //寫入日志    public static void WriteLog(string text)    {      //StreamWriter sw = new StreamWriter(AppDomain.CurrentDomain.BaseDirectory + "\\log.txt", true);      StreamWriter sw = new StreamWriter("F:" + "\\log.txt", true);      sw.WriteLine(text);      sw.Close();//寫入    }    #endregion  }

4.爬蟲

我的網絡蜘蛛爬蟲,用的一個第三方類庫,代碼如下。

namespace Feng.SimpleCrawler{  using System;  /// <summary>  /// The add url event handler.  /// </summary>  /// <param name="args">  /// The args.  /// </param>  /// <returns>  /// The <see cref="bool"/>.  /// </returns>  public delegate bool AddUrlEventHandler(AddUrlEventArgs args);  /// <summary>  /// The add url event args.  /// </summary>  public class AddUrlEventArgs : EventArgs  {    #region Public Properties    /// <summary>    /// Gets or sets the depth.    /// </summary>    public int Depth { get; set; }    /// <summary>    /// Gets or sets the title.    /// </summary>    public string Title { get; set; }    /// <summary>    /// Gets or sets the url.    /// </summary>    public string Url { get; set; }    #endregion  }}
AddUrlEventArgs.cs
namespace Feng.SimpleCrawler{    using System;    using System.Collections;    /// <summary>    /// The bloom filter.    /// </summary>    /// <typeparam name="T">    /// The generic type.    /// </typeparam>    public class BloomFilter<T>    {  #region Fields  /// <summary>  /// The get hash secondary.  /// </summary>  private readonly HashFunction getHashSecondary;  /// <summary>  /// The hash bits.  /// </summary>  private readonly BitArray hashBits;  /// <summary>  /// The hash function count.  /// </summary>  private readonly int hashFunctionCount;  #endregion  #region Constructors and Destructors  /// <summary>  /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  public BloomFilter(int capacity)      : this(capacity, null)  {  }  /// <summary>  /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  /// <param name="errorRate">  /// The error rate.  /// </param>  public BloomFilter(int capacity, int errorRate)      : this(capacity, errorRate, null)  {  }  /// <summary>  /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  /// <param name="hashFunction">  /// The hash function.  /// </param>  public BloomFilter(int capacity, HashFunction hashFunction)      : this(capacity, BestErrorRate(capacity), hashFunction)  {  }  /// <summary>  /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  /// <param name="errorRate">  /// The error rate.  /// </param>  /// <param name="hashFunction">  /// The hash function.  /// </param>  public BloomFilter(int capacity, float errorRate, HashFunction hashFunction)      : this(capacity, errorRate, hashFunction, BestM(capacity, errorRate), BestK(capacity, errorRate))  {  }  /// <summary>  /// Initializes a new instance of the <see cref="BloomFilter{T}"/> class.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  /// <param name="errorRate">  /// The error rate.  /// </param>  /// <param name="hashFunction">  /// The hash function.  /// </param>  /// <param name="m">  /// The m.  /// </param>  /// <param name="k">  /// The k.  /// </param>  public BloomFilter(int capacity, float errorRate, HashFunction hashFunction, int m, int k)  {      if (capacity < 1)      {    throw new ArgumentOutOfRangeException("capacity", capacity, "capacity must be > 0");      }      if (errorRate >= 1 || errorRate <= 0)      {    throw new ArgumentOutOfRangeException(        "errorRate",         errorRate,         string.Format("errorRate must be between 0 and 1, exclusive. Was {0}", errorRate));      }      if (m < 1)      {    throw new ArgumentOutOfRangeException(        string.Format(      "The provided capacity and errorRate values would result in an array of length > int.MaxValue. Please reduce either of these values. Capacity: {0}, Error rate: {1}",       capacity,       errorRate));      }      if (hashFunction == null)      {    if (typeof(T) == typeof(string))    {        this.getHashSecondary = HashString;    }    else if (typeof(T) == typeof(int))    {        this.getHashSecondary = HashInt32;    }    else    {        throw new ArgumentNullException(      "hashFunction",       "Please provide a hash function for your type T, when T is not a string or int.");    }      }      else      {    this.getHashSecondary = hashFunction;      }      this.hashFunctionCount = k;      this.hashBits = new BitArray(m);  }  #endregion  #region Delegates  /// <summary>  /// The hash function.  /// </summary>  /// <param name="input">  /// The input.  /// </param>  /// <returns>  /// The <see cref="int"/>.  /// </returns>  public delegate int HashFunction(T input);  #endregion  #region Public Properties  /// <summary>  /// Gets the truthiness.  /// </summary>  public double Truthiness  {      get      {    return (double)this.TrueBits() / this.hashBits.Count;      }  }  #endregion  #region Public Methods and Operators  /// <summary>  /// The add.  /// </summary>  /// <param name="item">  /// The item.  /// </param>  public void Add(T item)  {      int primaryHash = item.GetHashCode();      int secondaryHash = this.getHashSecondary(item);      for (int i = 0; i < this.hashFunctionCount; i++)      {    int hash = this.ComputeHash(primaryHash, secondaryHash, i);    this.hashBits[hash] = true;      }  }  /// <summary>  /// The contains.  /// </summary>  /// <param name="item">  /// The item.  /// </param>  /// <returns>  /// The <see cref="bool"/>.  /// </returns>  public bool Contains(T item)  {      int primaryHash = item.GetHashCode();      int secondaryHash = this.getHashSecondary(item);      for (int i = 0; i < this.hashFunctionCount; i++)      {    int hash = this.ComputeHash(primaryHash, secondaryHash, i);    if (this.hashBits[hash] == false)    {        return false;    }      }      return true;  }  #endregion  #region Methods  /// <summary>  /// The best error rate.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  /// <returns>  /// The <see cref="float"/>.  /// </returns>  private static float BestErrorRate(int capacity)  {      var c = (float)(1.0 / capacity);      if (Math.Abs(c) > 0)      {    return c;      }      double y = int.MaxValue / (double)capacity;      return (float)Math.Pow(0.6185, y);  }  /// <summary>  /// The best k.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  /// <param name="errorRate">  /// The error rate.  /// </param>  /// <returns>  /// The <see cref="int"/>.  /// </returns>  private static int BestK(int capacity, float errorRate)  {      return (int)Math.Round(Math.Log(2.0) * BestM(capacity, errorRate) / capacity);  }  /// <summary>  /// The best m.  /// </summary>  /// <param name="capacity">  /// The capacity.  /// </param>  /// <param name="errorRate">  /// The error rate.  /// </param>  /// <returns>  /// The <see cref="int"/>.  /// </returns>  private static int BestM(int capacity, float errorRate)  {      return (int)Math.Ceiling(capacity * Math.Log(errorRate, 1.0 / Math.Pow(2, Math.Log(2.0))));  }  /// <summary>  /// The hash int 32.  /// </summary>  /// <param name="input">  /// The input.  /// </param>  /// <returns>  /// The <see cref="int"/>.  /// </returns>  private static int HashInt32(T input)  {      var x = input as uint?;      unchecked      {    x = ~x + (x << 15);    x = x ^ (x >> 12);    x = x + (x << 2);    x = x ^ (x >> 4);    x = x * 2057;    x = x ^ (x >> 16);    return (int)x;      }  }  /// <summary>  /// The hash string.  /// </summary>  /// <param name="input">  /// The input.  /// </param>  /// <returns>  /// The <see cref="int"/>.  /// </returns>  private static int HashString(T input)  {      var str = input as string;      int hash = 0;      if (str != null)      {    for (int i = 0; i < str.Length; i++)    {        hash += str[i];        hash += hash << 10;        hash ^= hash >> 6;    }    hash += hash << 3;    hash ^= hash >> 11;    hash += hash << 15;      }      return hash;  }  /// <summary>  /// The compute hash.  /// </summary>  /// <param name="primaryHash">  /// The primary hash.  /// </param>  /// <param name="secondaryHash">  /// The secondary hash.  /// </param>  /// <param name="i">  /// The i.  /// </param>  /// <returns>  /// The <see cref="int"/>.  /// </returns>  private int ComputeHash(int primaryHash, int secondaryHash, int i)  {      int resultingHash = (primaryHash + (i * secondaryHash)) % this.hashBits.Count;      return Math.Abs(resultingHash);  }  /// <summary>  /// The true bits.  /// </summary>  /// <returns>  /// The <see cref="int"/>.  /// </returns>  private int TrueBits()  {      int output = 0;      foreach (bool bit in this.hashBits)      {    if (bit)    {        output++;    }      }      return output;  }  #endregion    }}
BloomFilter.cs
namespace Feng.SimpleCrawler{  using System;  /// <summary>  /// The crawl error event handler.  /// </summary>  /// <param name="args">  /// The args.  /// </param>  public delegate void CrawlErrorEventHandler(CrawlErrorEventArgs args);  /// <summary>  /// The crawl error event args.  /// </summary>  public class CrawlErrorEventArgs : EventArgs  {    #region Public Properties    /// <summary>    /// Gets or sets the exception.    /// </summary>    public Exception Exception { get; set; }    /// <summary>    /// Gets or sets the url.    /// </summary>    public string Url { get; set; }    #endregion  }}
CrawlErrorEventArgs.cs
namespace Feng.SimpleCrawler{  using System;  /// <summary>  /// The crawl error event handler.  /// </summary>  /// <param name="args">  /// The args.  /// </param>  public delegate void CrawlErrorEventHandler(CrawlErrorEventArgs args);  /// <summary>  /// The crawl error event args.  /// </summary>  public class CrawlErrorEventArgs : EventArgs  {    #region Public Properties    /// <summary>    /// Gets or sets the exception.    /// </summary>    public Exception Exception { get; set; }    /// <summary>    /// Gets or sets the url.    /// </summary>    public string Url { get; set; }    #endregion  }}
CrawlExtension.cs
namespace Feng.SimpleCrawler{  using System;  using System.Collections.Generic;  using System.IO;  using System.IO.Compression;  using System.Linq;  using System.Net;  using System.Text;  using System.Text.Regulars;  using System.Threading;  /// <summary>  /// The crawl master.  /// </summary>  public class CrawlMaster  {    #region Constants    /// <summary>    /// The web url regular s.    /// </summary>    private const string WebUrlRegulars = @"^(http|https)://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";    #endregion    #region Fields    /// <summary>    /// The cookie container.    /// </summary>    private readonly CookieContainer cookieContainer;    /// <summary>    /// The random.    /// </summary>    private readonly Random random;    /// <summary>    /// The thread status.    /// </summary>    private readonly bool[] threadStatus;    /// <summary>    /// The threads.    /// </summary>    private readonly Thread[] threads;    #endregion    #region Constructors and Destructors    /// <summary>    /// Initializes a new instance of the <see cref="CrawlMaster"/> class.    /// </summary>    /// <param name="settings">    /// The settings.    /// </param>    public CrawlMaster(CrawlSettings settings)    {      this.cookieContainer = new CookieContainer();      this.random = new Random();      this.Settings = settings;      this.threads = new Thread[settings.ThreadCount];      this.threadStatus = new bool[settings.ThreadCount];    }    #endregion    #region Public Events    /// <summary>    /// The add url event.    /// </summary>    public event AddUrlEventHandler AddUrlEvent;    /// <summary>    /// The crawl error event.    /// </summary>    public event CrawlErrorEventHandler CrawlErrorEvent;    /// <summary>    /// The data received event.    /// </summary>    public event DataReceivedEventHandler DataReceivedEvent;    #endregion    #region Public Properties    /// <summary>    /// Gets the settings.    /// </summary>    public CrawlSettings Settings { get; private set; }    #endregion    #region Public Methods and Operators    /// <summary>    /// The crawl.    /// </summary>    public void Crawl()    {      this.Initialize();      for (int i = 0; i < this.threads.Length; i++)      {        this.threads[i].Start(i);        this.threadStatus[i] = false;      }    }    /// <summary>    /// The stop.    /// </summary>    public void Stop()    {      foreach (Thread thread in this.threads)      {        thread.Abort();      }    }    #endregion    #region Methods    /// <summary>    /// The config request.    /// </summary>    /// <param name="request">    /// The request.    /// </param>    private void ConfigRequest(HttpWebRequest request)    {      request.UserAgent = this.Settings.UserAgent;      request.CookieContainer = this.cookieContainer;      request.AllowAutoRedirect = true;      request.MediaType = "text/html";      request.Headers["Accept-Language"] = "zh-CN,zh;q=0.8";      if (this.Settings.Timeout > 0)      {        request.Timeout = this.Settings.Timeout;      }    }    /// <summary>    /// The crawl process.    /// </summary>    /// <param name="threadIndex">    /// The thread index.    /// </param>    private void CrawlProcess(object threadIndex)    {      var currentThreadIndex = (int)threadIndex;      while (true)      {        // 根據隊列中的 Url 數量和空閑線程的數量,判斷線程是睡眠還是退出        if (UrlQueue.Instance.Count == 0)        {          this.threadStatus[currentThreadIndex] = true;          if (!this.threadStatus.Any(t => t == false))          {            break;          }          Thread.Sleep(2000);          continue;        }        this.threadStatus[currentThreadIndex] = false;        if (UrlQueue.Instance.Count == 0)        {          continue;        }        UrlInfo urlInfo = UrlQueue.Instance.DeQueue();        HttpWebRequest request = null;        HttpWebResponse response = null;        try        {          if (urlInfo == null)          {            continue;          }          // 1~5 秒隨機間隔的自動限速          if (this.Settings.AutoSpeedLimit)          {            int span = this.random.Next(1000, 5000);            Thread.Sleep(span);          }          // 創(chuàng)建并配置Web請求          request = WebRequest.Create(urlInfo.UrlString) as HttpWebRequest;          this.ConfigRequest(request);          if (request != null)          {            response = request.GetResponse() as HttpWebResponse;          }          if (response != null)          {            this.PersistenceCookie(response);            Stream stream = null;            // 如果頁面壓縮,則解壓數據流            if (response.ContentEncoding == "gzip")            {              Stream responseStream = response.GetResponseStream();              if (responseStream != null)              {                stream = new GZipStream(responseStream, CompressionMode.Decompress);              }            }            else            {              stream = response.GetResponseStream();            }            using (stream)            {              string html = this.ParseContent(stream, response.CharacterSet);              this.ParseLinks(urlInfo, html);              if (this.DataReceivedEvent != null)              {                this.DataReceivedEvent(                  new DataReceivedEventArgs                    {                      Url = urlInfo.UrlString,                       Depth = urlInfo.Depth,                       Html = html                    });              }              if (stream != null)              {                stream.Close();              }            }          }        }        catch (Exception exception)        {          if (this.CrawlErrorEvent != null)          {            if (urlInfo != null)            {              this.CrawlErrorEvent(                new CrawlErrorEventArgs { Url = urlInfo.UrlString, Exception = exception });            }          }        }        finally        {          if (request != null)          {            request.Abort();          }          if (response != null)          {            response.Close();          }        }      }    }    /// <summary>    /// The initialize.    /// </summary>    private void Initialize()    {      if (this.Settings.SeedsAddress != null && this.Settings.SeedsAddress.Count > 0)      {        foreach (string seed in this.Settings.SeedsAddress)        {          if (Regex.IsMatch(seed, WebUrlRegulars, RegexOptions.IgnoreCase))          {            UrlQueue.Instance.EnQueue(new UrlInfo(seed) { Depth = 1 });          }        }      }      for (int i = 0; i < this.Settings.ThreadCount; i++)      {        var threadStart = new ParameterizedThreadStart(this.CrawlProcess);        this.threads[i] = new Thread(threadStart);      }      ServicePointManager.DefaultConnectionLimit = 256;    }    /// <summary>    /// The is match regular.    /// </summary>    /// <param name="url">    /// The url.    /// </param>    /// <returns>    /// The <see cref="bool"/>.    /// </returns>    private bool IsMatchRegular(string url)    {      bool result = false;      if (this.Settings.RegularFilters != null && this.Settings.RegularFilters.Count > 0)      {        if (          this.Settings.RegularFilters.Any(            pattern => Regex.IsMatch(url, pattern, RegexOptions.IgnoreCase)))        {          result = true;        }      }      else      {        result = true;      }      return result;    }    /// <summary>    /// The parse content.    /// </summary>    /// <param name="stream">    /// The stream.    /// </param>    /// <param name="characterSet">    /// The character set.    /// </param>    /// <returns>    /// The <see cref="string"/>.    /// </returns>    private string ParseContent(Stream stream, string characterSet)    {      var memoryStream = new MemoryStream();      stream.CopyTo(memoryStream);      byte[] buffer = memoryStream.ToArray();      Encoding encode = Encoding.ASCII;      string html = encode.GetString(buffer);      string localCharacterSet = characterSet;      Match match = Regex.Match(html, "<meta([^<]*)charset=([^<]*)\"", RegexOptions.IgnoreCase);      if (match.Success)      {        localCharacterSet = match.Groups[2].Value;        var stringBuilder = new StringBuilder();        foreach (char item in localCharacterSet)        {          if (item == ' ')          {            break;          }          if (item != '\"')          {            stringBuilder.Append(item);          }        }        localCharacterSet = stringBuilder.ToString();      }      if (string.IsNullOrEmpty(localCharacterSet))      {        localCharacterSet = characterSet;      }      if (!string.IsNullOrEmpty(localCharacterSet))      {        encode = Encoding.GetEncoding(localCharacterSet);      }      memoryStream.Close();      return encode.GetString(buffer);    }    /// <summary>    /// The parse links.    /// </summary>    /// <param name="urlInfo">    /// The url info.    /// </param>    /// <param name="html">    /// The html.    /// </param>    private void ParseLinks(UrlInfo urlInfo, string html)    {      if (this.Settings.Depth > 0 && urlInfo.Depth >= this.Settings.Depth)      {        return;      }      var urlDictionary = new Dictionary<string, string>();      Match match = Regex.Match(html, "(?i)<a .*?href=\"([^\"]+)\"[^>]*>(.*?)</a>");      while (match.Success)      {        // 以 href 作為 key        string urlKey = match.Groups[1].Value;        // 以 text 作為 value        string urlValue = Regex.Replace(match.Groups[2].Value, "(?i)<.*?>", string.Empty);        urlDictionary[urlKey] = urlValue;        match = match.NextMatch();      }      foreach (var item in urlDictionary)      {        string href = item.Key;        string text = item.Value;        if (!string.IsNullOrEmpty(href))        {          bool canBeAdd = true;          if (this.Settings.EscapeLinks != null && this.Settings.EscapeLinks.Count > 0)          {            if (this.Settings.EscapeLinks.Any(suffix => href.EndsWith(suffix, StringComparison.OrdinalIgnoreCase)))            {              canBeAdd = false;            }          }          if (this.Settings.HrefKeywords != null && this.Settings.HrefKeywords.Count > 0)          {            if (!this.Settings.HrefKeywords.Any(href.Contains))            {              canBeAdd = false;            }          }          if (canBeAdd)          {            string url = href.Replace("%3f", "?")              .Replace("%3d", "=")              .Replace("%2f", "/")              .Replace("&", "&");            if (string.IsNullOrEmpty(url) || url.StartsWith("#")              || url.StartsWith("mailto:", StringComparison.OrdinalIgnoreCase)              || url.StartsWith("javascript:", StringComparison.OrdinalIgnoreCase))            {              continue;            }            var baseUri = new Uri(urlInfo.UrlString);            Uri currentUri = url.StartsWith("http", StringComparison.OrdinalIgnoreCase)                       ? new Uri(url)                       : new Uri(baseUri, url);            url = currentUri.AbsoluteUri;            if (this.Settings.LockHost)            {              // 去除二級域名后,判斷域名是否相等,相等則認為是同一個站點              // 例如:mail.pzcast.com 和 www.pzcast.com              if (baseUri.Host.Split('.').Skip(1).Aggregate((a, b) => a + "." + b)                != currentUri.Host.Split('.').Skip(1).Aggregate((a, b) => a + "." + b))              {                continue;              }            }            if (!this.IsMatchRegular(url))            {              continue;            }            var addUrlEventArgs = new AddUrlEventArgs { Title = text, Depth = urlInfo.Depth + 1, Url = url };            if (this.AddUrlEvent != null && !this.AddUrlEvent(addUrlEventArgs))            {              continue;            }            UrlQueue.Instance.EnQueue(new UrlInfo(url) { Depth = urlInfo.Depth + 1 });          }        }      }    }    /// <summary>    /// The persistence cookie.    /// </summary>    /// <param name="response">    /// The response.    /// </param>    private void PersistenceCookie(HttpWebResponse response)    {      if (!this.Settings.KeepCookie)      {        return;      }      string cookies = response.Headers["Set-Cookie"];      if (!string.IsNullOrEmpty(cookies))      {        var cookieUri =          new Uri(            string.Format(              "{0}://{1}:{2}/",               response.ResponseUri.Scheme,               response.ResponseUri.Host,               response.ResponseUri.Port));        this.cookieContainer.SetCookies(cookieUri, cookies);      }    }    #endregion  }}
CrawlMaster.cs
namespace Feng.SimpleCrawler{    using System;    using System.Collections.Generic;    /// <summary>    /// The crawl settings.    /// </summary>    [Serializable]    public class CrawlSettings    {  #region Fields  /// <summary>  /// The depth.  /// </summary>  private byte depth = 3;  /// <summary>  /// The lock host.  /// </summary>  private bool lockHost = true;  /// <summary>  /// The thread count.  /// </summary>  private byte threadCount = 1;  /// <summary>  /// The timeout.  /// </summary>  private int timeout = 15000;  /// <summary>  /// The user agent.  /// </summary>  private string userAgent =       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";  #endregion  #region Constructors and Destructors  /// <summary>  /// Initializes a new instance of the <see cref="CrawlSettings"/> class.  /// </summary>  public CrawlSettings()  {      this.AutoSpeedLimit = false;      this.EscapeLinks = new List<string>();      this.KeepCookie = true;      this.HrefKeywords = new List<string>();      this.LockHost = true;      this.RegularFilters = new List<string>();      this.SeedsAddress = new List<string>();  }  #endregion  #region Public Properties  /// <summary>  /// Gets or sets a value indicating whether auto speed limit.  /// </summary>  public bool AutoSpeedLimit { get; set; }  /// <summary>  /// Gets or sets the depth.  /// </summary>  public byte Depth  {      get      {    return this.depth;      }      set      {    this.depth = value;      }  }  /// <summary>  /// Gets the escape links.  /// </summary>  public List<string> EscapeLinks { get; private set; }  /// <summary>  /// Gets or sets a value indicating whether keep cookie.  /// </summary>  public bool KeepCookie { get; set; }  /// <summary>  /// Gets the href keywords.  /// </summary>  public List<string> HrefKeywords { get; private set; }  /// <summary>  /// Gets or sets a value indicating whether lock host.  /// </summary>  public bool LockHost  {      get      {    return this.lockHost;      }      set      {    this.lockHost = value;      }  }  /// <summary>  /// Gets the regular filter s.  /// </summary>  public List<string> RegularFilters { get; private set; }  /// <summary>  /// Gets  the seeds address.  /// </summary>  public List<string> SeedsAddress { get; private set; }  /// <summary>  /// Gets or sets the thread count.  /// </summary>  public byte ThreadCount  {      get      {    return this.threadCount;      }      set      {    this.threadCount = value;      }  }  /// <summary>  /// Gets or sets the timeout.  /// </summary>  public int Timeout  {      get      {    return this.timeout;      }      set      {    this.timeout = value;      }  }  /// <summary>  /// Gets or sets the user agent.  /// </summary>  public string UserAgent  {      get      {    return this.userAgent;      }      set      {    this.userAgent = value;      }  }  #endregion    }}
CrawlSettings.cs
namespace Feng.SimpleCrawler{  /// <summary>  /// The crawl status.  /// </summary>  public enum CrawlStatus  {    /// <summary>    /// The completed.    /// </summary>    Completed = 1,     /// <summary>    /// The never been.    /// </summary>    NeverBeen = 2  }}
CrawlStatus.cs
namespace Feng.SimpleCrawler{  using System;  /// <summary>  /// The data received event handler.  /// </summary>  /// <param name="args">  /// The args.  /// </param>  public delegate void DataReceivedEventHandler(DataReceivedEventArgs args);  /// <summary>  /// The data received event args.  /// </summary>  public class DataReceivedEventArgs : EventArgs  {    #region Public Properties    /// <summary>    /// Gets or sets the depth.    /// </summary>    public int Depth { get; set; }    /// <summary>    /// Gets or sets the html.    /// </summary>    public string Html { get; set; }    /// <summary>    /// Gets or sets the url.    /// </summary>    public string Url { get; set; }    #endregion  }}
DataReceivedEventArgs.cs
namespace Feng.SimpleCrawler{    using System.Collections.Generic;    using System.Threading;    /// <summary>    /// The security queue.    /// </summary>    /// <typeparam name="T">    /// Any type.    /// </typeparam>    public abstract class SecurityQueue<T>  where T : class    {  #region Fields  /// <summary>  /// The inner queue.  /// </summary>  protected readonly Queue<T> InnerQueue = new Queue<T>();  /// <summary>  /// The sync object.  /// </summary>  protected readonly object SyncObject = new object();  /// <summary>  /// The auto reset event.  /// </summary>  private readonly AutoResetEvent autoResetEvent;  #endregion  #region Constructors and Destructors  /// <summary>  /// Initializes a new instance of the <see cref="SecurityQueue{T}"/> class.  /// </summary>  protected SecurityQueue()  {      this.autoResetEvent = new AutoResetEvent(false);  }  #endregion  #region Delegates  /// <summary>  /// The before en queue event handler.  /// </summary>  /// <param name="target">  /// The target.  /// </param>  /// <returns>  /// The <see cref="bool"/>.  /// </returns>  public delegate bool BeforeEnQueueEventHandler(T target);  #endregion  #region Public Events  /// <summary>  /// The before en queue event.  /// </summary>  public event BeforeEnQueueEventHandler BeforeEnQueueEvent;  #endregion  #region Public Properties  /// <summary>  /// Gets the auto reset event.  /// </summary>  public AutoResetEvent AutoResetEvent  {      get      {    return this.autoResetEvent;      }  }  /// <summary>  /// Gets the count.  /// </summary>  public int Count  {      get      {    lock (this.SyncObject)    {        return this.InnerQueue.Count;    }      }  }  /// <summary>  /// Gets a value indicating whether has value.  /// </summary>  public bool HasValue  {      get      {    return this.Count != 0;      }  }  #endregion  #region Public Methods and Operators  /// <summary>  /// The de queue.  /// </summary>  /// <returns>  /// The <see cref="T"/>.  /// </returns>  public T DeQueue()  {      lock (this.SyncObject)      {    if (this.InnerQueue.Count > 0)    {        return this.InnerQueue.Dequeue();    }    return default(T);      }  }  /// <summary>  /// The en queue.  /// </summary>  /// <param name="target">  /// The target.  /// </param>  public void EnQueue(T target)  {      lock (this.SyncObject)      {    if (this.BeforeEnQueueEvent != null)    {        if (this.BeforeEnQueueEvent(target))        {      this.InnerQueue.Enqueue(target);        }    }    else    {        this.InnerQueue.Enqueue(target);    }    this.AutoResetEvent.Set();      }  }  #endregion    }}
SecurityQueue.cs
namespace Feng.SimpleCrawler{    /// <summary>    /// The url info.    /// </summary>    public class UrlInfo    {  #region Fields  /// <summary>  /// The url.  /// </summary>  private readonly string url;  #endregion  #region Constructors and Destructors  /// <summary>  /// Initializes a new instance of the <see cref="UrlInfo"/> class.  /// </summary>  /// <param name="urlString">  /// The url string.  /// </param>  public UrlInfo(string urlString)  {      this.url = urlString;  }  #endregion  #region Public Properties  /// <summary>  /// Gets or sets the depth.  /// </summary>  public int Depth { get; set; }  /// <summary>  /// Gets the url string.  /// </summary>  public string UrlString  {      get      {    return this.url;      }  }  /// <summary>  /// Gets or sets the status.  /// </summary>  public CrawlStatus Status { get; set; }  #endregion    }}
UrlInfo.cs
namespace Feng.SimpleCrawler{    /// <summary>    /// The url queue.    /// </summary>    public class UrlQueue : SecurityQueue<UrlInfo>    {  #region Constructors and Destructors  /// <summary>  /// Prevents a default instance of the <see cref="UrlQueue"/> class from being created.  /// </summary>  private UrlQueue()  {  }  #endregion  #region Public Properties  /// <summary>  /// Gets the instance.  /// </summary>  public static UrlQueue Instance  {      get      {    return Nested.Inner;      }  }  #endregion  /// <summary>  /// The nested.  /// </summary>  private static class Nested  {      #region Static Fields      /// <summary>      /// The inner.      /// </summary>      internal static readonly UrlQueue Inner = new UrlQueue();      #endregion  }    }}
UrlQueue.cs

5.創(chuàng)建windows服務.

這些工作都準備完成后,終于要來我們的重點了,我們都知道控制臺程序非常不穩(wěn)定,而我們的這個從博客園上面爬取文章的這個事情需要長期的進行下去,這個需要 很穩(wěn)定的進行下去,所以我想起了windows服務,創(chuàng)建好我們的windows服務,代碼如下。

using Feng.SimpleCrawler;using Feng.DbHelper;using Feng.Log;using HtmlAgilityPack;namespace Feng.Demo{    /// <summary>    /// windows服務    /// </summary>    partial class FengCnblogsService : ServiceBase    {        #region 構造函數        /// <summary>        /// 構造函數        /// </summary>        public FengCnblogsService()        {  InitializeComponent();        }         #endregion        #region 字段屬性        /// <summary>        /// 蜘蛛爬蟲的設置        /// </summary>        private static readonly CrawlSettings Settings = new CrawlSettings();        /// <summary>        /// 臨時內存表存儲數據        /// </summary>        private static DataTable dt = new DataTable();        /// <summary>        /// 關于 Filter URL:http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html        /// </summary>        private static BloomFilter<string> filter;        #endregion        #region 啟動服務        /// <summary>        /// TODO: 在此處添加代碼以啟動服務。        /// </summary>        /// <param name="args"></param>        protected override void OnStart(string[] args)        {  ProcessStart();        }         #endregion        #region 停止服務        /// <summary>        /// TODO: 在此處添加代碼以執(zhí)行停止服務所需的關閉操作。        /// </summary>        protected override void OnStop()        {        }         #endregion        #region 程序開始處理        /// <summary>        /// 程序開始處理        /// </summary>        private void ProcessStart()        {  dt.Columns.Add("BlogTitle", typeof(string));  dt.Columns.Add("BlogUrl", typeof(string));  dt.Columns.Add("BlogAuthor", typeof(string));  dt.Columns.Add("BlogTime", typeof(string));  dt.Columns.Add("BlogMotto", typeof(string));  dt.Columns.Add("BlogDepth", typeof(string));  filter = new BloomFilter<string>(200000);  const string CityName = "";  #region 設置種子地址  // 設置種子地址   Settings.SeedsAddress.Add(string.Format("http://www.cnblogs.com/{0}", CityName));  Settings.SeedsAddress.Add("http://www.cnblogs.com/artech");  Settings.SeedsAddress.Add("http://www.cnblogs.com/wuhuacong/");  Settings.SeedsAddress.Add("http://www.cnblogs.com/dudu/");  Settings.SeedsAddress.Add("http://www.cnblogs.com/guomingfeng/");  Settings.SeedsAddress.Add("http://www.cnblogs.com/daxnet/");  Settings.SeedsAddress.Add("http://www.cnblogs.com/fenglingyi");  Settings.SeedsAddress.Add("http://www.cnblogs.com/ahthw/");  Settings.SeedsAddress.Add("http://www.cnblogs.com/wangweimutou/");  #endregion  #region 設置 URL 關鍵字  Settings.HrefKeywords.Add("a/");  Settings.HrefKeywords.Add("b/");  Settings.HrefKeywords.Add("c/");  Settings.HrefKeywords.Add("d/");  Settings.HrefKeywords.Add("e/");  Settings.HrefKeywords.Add("f/");  Settings.HrefKeywords.Add("g/");  Settings.HrefKeywords.Add("h/");  Settings.HrefKeywords.Add("i/");  Settings.HrefKeywords.Add("j/");  Settings.HrefKeywords.Add("k/");  Settings.HrefKeywords.Add("l/");  Settings.HrefKeywords.Add("m/");  Settings.HrefKeywords.Add("n/");  Settings.HrefKeywords.Add("o/");  Settings.HrefKeywords.Add("p/");  Settings.HrefKeywords.Add("q/");  Settings.HrefKeywords.Add("r/");  Settings.HrefKeywords.Add("s/");  Settings.HrefKeywords.Add("t/");  Settings.HrefKeywords.Add("u/");  Settings.HrefKeywords.Add("v/");  Settings.HrefKeywords.Add("w/");  Settings.HrefKeywords.Add("x/");  Settings.HrefKeywords.Add("y/");  Settings.HrefKeywords.Add("z/");  #endregion  // 設置爬取線程個數  Settings.ThreadCount = 1;  // 設置爬取深度  Settings.Depth = 55;  // 設置爬取時忽略的 Link,通過后綴名的方式,可以添加多個  Settings.EscapeLinks.Add("http://www.oschina.net/");  // 設置自動限速,1~5 秒隨機間隔的自動限速  Settings.AutoSpeedLimit = false;  // 設置都是鎖定域名,去除二級域名后,判斷域名是否相等,相等則認為是同一個站點  Settings.LockHost = false;  Settings.RegularFilters.Add(@"http://([w]{3}.)+[cnblogs]+.com/");  var master = new CrawlMaster(Settings);  master.AddUrlEvent += MasterAddUrlEvent;  master.DataReceivedEvent += MasterDataReceivedEvent;  master.Crawl();        }        #endregion        #region 打印Url        /// <summary>        /// The master add url event.        /// </summary>        /// <param name="args">        /// The args.        /// </param>        /// <returns>        /// The <see cref="bool"/>.        /// </returns>        private static bool MasterAddUrlEvent(AddUrlEventArgs args)        {  if (!filter.Contains(args.Url))  {      filter.Add(args.Url);      Console.WriteLine(args.Url);      if (dt.Rows.Count > 200)      {          MssqlHelper.InsertDb(dt);          dt.Rows.Clear();      }      return true;  }  return false; // 返回 false 代表:不添加到隊列中        }        #endregion        #region 解析HTML        /// <summary>        /// The master data received event.        /// </summary>        /// <param name="args">        /// The args.        /// </param>        private static void MasterDataReceivedEvent(SimpleCrawler.DataReceivedEventArgs args)        {  // 在此處解析頁面,可以用類似于 HtmlAgilityPack(頁面解析組件)的東東、也可以用正則表達式、還可以自己進行字符串分析  HtmlDocument doc = new HtmlDocument();  doc.LoadHtml(args.Html);  HtmlNode node = doc.DocumentNode.SelectSingleNode("//title");  string title = node.InnerText;  HtmlNode node2 = doc.DocumentNode.SelectSingleNode("//*[@id='post-date']");  string time = node2.InnerText;  HtmlNode node3 = doc.DocumentNode.SelectSingleNode("//*[@id='topics']/div/div[3]/a[1]");  string author = node3.InnerText;  HtmlNode node6 = doc.DocumentNode.SelectSingleNode("//*[@id='blogTitle']/h2");  string motto = node6.InnerText;  MssqlHelper.GetData(title, args.Url, author, time, motto, args.Depth.ToString(), dt);  LogHelper.WriteLog(title);  LogHelper.WriteLog(args.Url);  LogHelper.WriteLog(author);  LogHelper.WriteLog(time);  LogHelper.WriteLog(motto == "" ? "null" : motto);  LogHelper.WriteLog(args.Depth + "&dt.Rows.Count=" + dt.Rows.Count);  //每次超過100條數據就存入數據庫,可以根據自己的情況設置數量  if (dt.Rows.Count > 100)  {      MssqlHelper.InsertDb(dt);      dt.Rows.Clear();  }        }        #endregion    }}

這里我們用爬蟲從博客園爬取來了博文,我們需要用這個HtmlAgilityPack第三方工具來解析出我們需要的字段,博文標題,博文作者,博文URL,等等一些信息。同時我們可以設置服務的一些信息

在網絡爬蟲中,我們要設置一些參數,設置種子地址,URL關鍵字,還有爬取的深度等等,這些工作都完成后,我們就只需要安裝我們的windows服務,就大功告成了。嘿嘿...

6.0安裝windows服務

在這里我們采用vs自帶的工具來安裝windows服務。

安裝成功后,打開我們的windows服務就可以看到我們安裝的windows服務。

同時可以查看我們的日志文件,查看我們爬取的博文解析出來的信息。如下圖所示。

這個時候去查看我們的數據庫,我的這個服務已經運行了一天。。。

如果你覺得本文不錯的話,幫我推薦一下,本人能力有限,文中如有不妥之處,歡迎拍磚,如果需要源碼的童鞋,可以留下你的郵箱...

本站僅提供存儲服務,所有內容均由用戶發(fā)布,如發(fā)現有害或侵權內容,請點擊舉報
打開APP,閱讀全文并永久保存 查看更多類似文章
猜你喜歡
類似文章
C#獲取文件名稱、路徑、后綴名
C# LINQ學習筆記五:LINQ to XML
超強C#圖片上傳,加水印,自動生成縮略圖源代碼(2)
.Net中常用的JS(javascript)操作類
C#操作配置文件中appSetting,connectionStrings節(jié)點
c#常用的Datable轉換為json,以及json轉換為DataTable操作方法
更多類似文章 >>
生活服務
分享 收藏 導長圖 關注 下載文章
綁定賬號成功
后續(xù)可登錄賬號暢享VIP特權!
如果VIP功能使用有故障,
可點擊這里聯(lián)系客服!

聯(lián)系客服