杨幂醉酒原版视频在线观看,唐嫣访谈视频,男人抓胸将机机桶美女视频

python抓小說教程來了！urllib2、BeautifulSoup抓小說！

東西二王 >《Python》

2019.05.20

關(guān)注

庫

urllib2

模擬http請(qǐng)求獲取html

BeautifulSoup

根據(jù)選擇器獲取dom結(jié)點(diǎn),可查看css選擇器

抓取邏輯

1.查看起點(diǎn)免費(fèi)小說列表：https://www.qidian.com/free/all

2.先搞懂一本書的抓取邏輯

2.1 根據(jù)選擇器獲取到書的鏈接和書名

bookCover = book.select('div[class='book-mid-info'] h4 > a')[0]

利用css選擇器，直接定位到我們需要的div。

2.2 創(chuàng)建并打開文件

 bookFile = open('crawler/books/'   bookCover.string   '.txt',  'a ')

使用'a '模式打開，如果不存在就創(chuàng)建這個(gè)文件，如果存在，就追加內(nèi)容。創(chuàng)建的txt文件名也就是抓取到的dom結(jié)點(diǎn)的text

2.3 跳轉(zhuǎn)到正文內(nèi)容

先獲取到'div[class='book-mid-info'] h4 > a' 這個(gè)結(jié)點(diǎn)的href地址，然后獲取到返回內(nèi)容，如下圖

再獲取到免費(fèi)試讀這個(gè)結(jié)點(diǎn)的href，再獲取它的返回內(nèi)容

2.4 遞歸獲取到每一張的內(nèi)容，寫入文件

通過class獲取到結(jié)點(diǎn)內(nèi)容，然后再獲取到下一章的href然后遞歸獲取每章內(nèi)容。

如果沒有下一頁而是書末頁就說明已經(jīng)最后一章了，遞歸結(jié)束，一本書的內(nèi)容也就獲取完畢了。

循環(huán)獲取當(dāng)前頁的每本書內(nèi)容

每本書其實(shí)都是一個(gè)li標(biāo)簽，先獲取到所有的li然后按照第二步進(jìn)行遍歷。

循環(huán)獲取所有頁面的書

當(dāng)當(dāng)前頁面所有的書本都抓取完畢了，那么我們可以獲取下>對(duì)應(yīng)的href然后獲取到返回內(nèi)容，繼續(xù)循環(huán)抓取。

直到抓取到最后一頁,>這個(gè)dom結(jié)點(diǎn)的class會(huì)增加一個(gè)為lbf-pagination-disabled,可以根據(jù)這個(gè)來判斷是否為最后一頁。

成品展示

完整代碼

# coding=utf-8import urllib2import sysfrom bs4 import BeautifulSoup#設(shè)置編碼reload(sys)sys.setdefaultencoding('utf-8')startIndex = 0 #默認(rèn)第0本startPage = 0 #默認(rèn)第0頁#獲取一個(gè)章節(jié)的內(nèi)容def getChapterContent(file,url): try: bookContentRes = urllib2.urlopen(url) bookContentSoup = BeautifulSoup(bookContentRes.read(), 'html.parser') file.write(bookContentSoup.select('h3[class='j_chapterName']')[0].string '\n') for p in bookContentSoup.select('.j_readContent p'): file.write(p.next '\n') except BaseException: #如果出錯(cuò)了，就重新運(yùn)行一遍 print(BaseException.message) getChapterContent(file, url) else: chapterNext = bookContentSoup.select('a#j_chapterNext')[0] if chapterNext.string != '書末頁': nextUrl = 'https:' chapterNext['href'] getChapterContent(file,nextUrl)#獲取當(dāng)前頁所有書的內(nèi)容def getCurrentUrlBooks(url): response = urllib2.urlopen(url) the_page = response.read() soup = BeautifulSoup(the_page, 'html.parser') bookArr = soup.select('ul[class='all-img-list cf'] > li') global startIndex if startIndex > 0: bookArr = bookArr[startIndex:] startIndex = 0 for book in bookArr: bookCover = book.select('div[class='book-mid-info'] h4 > a')[0] print '書名：' bookCover.string # 先創(chuàng)建.txt文件，然后獲取文本內(nèi)容寫入 bookFile = open('crawler/books/' bookCover.string '.txt', 'a ') bRes = urllib2.urlopen('https:' bookCover['href']) bSoup = BeautifulSoup(bRes.read(), 'html.parser') bookContentHref = bSoup.select('a[class='red-btn J-getJumpUrl ']')[0]['href'] getChapterContent(bookFile, 'https:' bookContentHref) bookFile.close() nextPage = soup.select('a.lbf-pagination-next')[0] return nextPage['href']if len(sys.argv)==1: passelif len(sys.argv) == 2: startPage = int(sys.argv[1])/20 #從第幾頁開始下載 startIndex = int(sys.argv[1])%20 # 從第幾本開始下載elif len(sys.argv) > 2: startPage = int(sys.argv[1]) startIndex = int(sys.argv[2])#根據(jù)傳入?yún)?shù)設(shè)置從哪里開始下載url = '//www.qidian.com/free/all?orderId=&vip=hidden&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=1&page=' str(startPage 1)#死循環(huán) 直到?jīng)]有下一頁while True: if url.startswith('//'): url = getCurrentUrlBooks('https:' url) else: break;

本站僅提供存儲(chǔ)服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊舉報(bào)。

打開APP，閱讀全文并永久保存查看更多類似文章

淺談Python網(wǎng)絡(luò)爬蟲

Python新手寫出漂亮的爬蟲代碼1

手把手教你用python抓網(wǎng)頁數(shù)據(jù)

Python抓取網(wǎng)頁&批量下載文件方法初探（正則表達(dá)式+BeautifulSoup）

Python 爬蟲介紹 | 菜鳥教程

Python爬蟲：一些常用的爬蟲技巧總結(jié)

更多類似文章 >>

国产一级a片免费看高清,亚洲熟女中文字幕在线视频,黄三级高清在线播放,免费黄色视频在线看