前言

一、基本環(huán)境配置

python版本:python 3.8.3 編輯器：anaconda3下的spyder 瀏覽器版本：Google Chrome 87.0.4280.88 瀏覽器驅(qū)動(dòng)器：本文通過selenium中的webdriver驅(qū)動(dòng)瀏覽器模擬人的點(diǎn)擊行為爬取信息的，因?yàn)檫€需要下載瀏覽器對(duì)應(yīng)版本的驅(qū)動(dòng)器，谷歌驅(qū)動(dòng)器下載地址： https://npm.taobao.org/mirrors/chromedriver 這個(gè)地址下有各類版本的驅(qū)動(dòng)器，找到自己版本下的谷歌驅(qū)動(dòng)器(我自己下載的版本是/87.0.4280.20)，下載放置到谷歌的目錄下，并將該路徑添加到環(huán)境變量即可

二、使用步驟

1.引入庫(kù)

代碼如下（示例）：

from bs4 import BeautifulSoupfrom selenium import webdriver#ActionChain是用來(lái)實(shí)現(xiàn)一些基礎(chǔ)的自動(dòng)化操作：比如鼠標(biāo)移動(dòng)、鼠標(biāo)點(diǎn)擊等，ActionChains可以實(shí)現(xiàn)一步操作多個(gè)步驟from selenium.webdriver import ActionChainsimport PILfrom PIL import Imageimport timeimport base64 #Base64編碼是從二進(jìn)制到字符的過程import threadingimport pandas as pd

主要庫(kù)的作用： BeautifulSoup：可以從HTML或XML文件中提取數(shù)據(jù)的python庫(kù)，能夠通過喜歡的轉(zhuǎn)換器實(shí)現(xiàn)慣用的文檔導(dǎo)航，查找，修改文檔的方式，本文主要用該庫(kù)對(duì)webriver爬到的網(wǎng)頁(yè)數(shù)據(jù)進(jìn)行解析，獲取想要的數(shù)據(jù)內(nèi)容，具體介紹可以參照如下網(wǎng)址： https://beautifulsoup.readthedocs.io/zh_CN/latest/ selenium.webdriver:一般python爬取網(wǎng)頁(yè)數(shù)據(jù)時(shí)使用的是urllib中的urlopen方法,返回網(wǎng)頁(yè)對(duì)象，并使用read方法獲取url的html內(nèi)容，然后再使用beautifulsoup結(jié)合正則表達(dá)式抓取某個(gè)標(biāo)簽內(nèi)容，urlopen還可以實(shí)現(xiàn)帶著登陸后的cookies訪問網(wǎng)站內(nèi)容而免去每次都要登陸的煩惱；但是urllib的局限處在于只能獲取網(wǎng)頁(yè)的靜態(tài)html內(nèi)容，網(wǎng)頁(yè)中的動(dòng)態(tài)內(nèi)容是不包含在靜態(tài)html中的，所以有的時(shí)候抓取的網(wǎng)站html信息與我們實(shí)際在網(wǎng)站中看到的html信息不一致。selenium模塊可以獲取動(dòng)態(tài)網(wǎng)頁(yè)生成的內(nèi)容。 Image：Image.open()可以打開文件夾中的圖片，本文用來(lái)打開下載至本地的淘寶登陸二維碼，通過掃描打開大的圖片二維碼可以實(shí)現(xiàn)淘寶掃碼登陸。 time：本文中通過使用time.sleep()方法可實(shí)現(xiàn)暫時(shí)休眠功能，防止訪問速度過快被識(shí)別出為爬蟲。

2.實(shí)際案例

登陸淘寶：

def login_first(self):                 #淘寶首頁(yè)鏈接 一開始使用https時(shí)request.get返回status_code為502        #pageview_url = 'http://www.taobao.com/?spm=a1z02.1.1581860521.1.CPoW0X' 類屬性中已經(jīng)定義        #PhantomJS與Chrome的差別在于Chrome可以啟動(dòng)瀏覽器，觀察到代碼操作的每一個(gè)步驟進(jìn)行的頁(yè)面操作PhantomJS模擬一個(gè)虛擬瀏覽器，即在不打開瀏覽器的前提下對(duì)瀏覽器進(jìn)行操作        #driver = webdriver.PhantomJS()         #driver = webdriver.Chrome() 類屬性中已經(jīng)定義         #在get()步加載時(shí)間會(huì)很長(zhǎng) 不知道是不是網(wǎng)絡(luò)的問題，所以限制加載20秒以后停止加載        self.driver.set_page_load_timeout(40)        self.driver.set_script_timeout(40)        try:            self.driver.get(self.pagelogin_url)        except:            print('頁(yè)面加載太慢，停止加載，繼續(xù)下一步操作')            self.driver.execute_script('window.stop()')        #此處需要設(shè)置等待，不然再頁(yè)面緩存完之前就執(zhí)行以下語(yǔ)句會(huì)導(dǎo)致找不到元素        time.sleep(40)        #找到登陸按鈕并點(diǎn)擊        #原本在獲取cookies后通過request庫(kù)帶著cookies訪問需要登陸才可訪問的頁(yè)面，但是對(duì)于webdriver沒有作用        #wedriver元素定位的各種方法https://www.cnblogs.com/yufeihlf/p/5717291.html        #網(wǎng)頁(yè)自動(dòng)化最基本的要求就是先定位到各個(gè)元素，然后才能對(duì)各元素進(jìn)行操作(輸入、點(diǎn)擊、清除、提交等)        #XPath是一種XML文檔中定位元素的語(yǔ)言。該定位方式也是比較常用的定位方式        #8.1通過屬性定位元素 find_element_by_xpath('//標(biāo)簽名[@屬性='屬性值']')        #可能的屬性：id、class、name、maxlength        #通過標(biāo)簽名定位屬性：所有input標(biāo)簽元素find_element_by_xpath('//input')        #通過父子定位屬性：所有input標(biāo)簽元素find_element_by_xpath('//span/input')         #通過元素內(nèi)容定位：find_element_by_xpath('//p[contains(text(),'京公網(wǎng)')]')         #獲取id='login'的標(biāo)簽下第一個(gè)div//’表示從任意節(jié)點(diǎn)搜索不必從根節(jié)點(diǎn)一節(jié)一節(jié)        #如何找到對(duì)應(yīng)按鈕或者頁(yè)面的xpath內(nèi)容：檢查頁(yè)面，點(diǎn)擊左上角箭頭標(biāo)志，        #然后點(diǎn)擊目標(biāo)內(nèi)容，會(huì)自動(dòng)定位到Elements中該內(nèi)容的標(biāo)簽屬性，右鍵COPY Xpath即可            #self.driver.find_element_by_xpath('//*[@class='btn-login ml1 tb-bg weight']').click()          #time.sleep(40)        #下行代碼的目的：查找二維碼登陸的源代碼點(diǎn)擊使用二維碼登陸         try:            #找到二維碼登陸對(duì)應(yīng)的屬性并點(diǎn)擊二維碼登陸            driver_data=self.driver.find_element_by_xpath('//*[@id='login']/div[1]/i').click()                    except:            pass        #通常需要停頓幾秒鐘，不然會(huì)被檢測(cè)到是爬蟲        #等待網(wǎng)頁(yè)緩沖        time.sleep(20)        # 執(zhí)行JS獲得canvas的二維碼        #.通過tag_name定位元素        JS = 'return document.getElementsByTagName('canvas')[0].toDataURL('image/png');'        im_info = self.driver.execute_script(JS) # 執(zhí)行JS獲取圖片信息        im_base64 = im_info.split(',')[1]  #拿到base64編碼的圖片信息        im_bytes = base64.b64decode(im_base64)  #轉(zhuǎn)為bytes類型        time.sleep(2)        with open('E:/學(xué)習(xí)‘/login.png','wb') as f:            #生成二維碼保存            f.write(im_bytes)            f.close()        #打開二維碼圖片，需要手動(dòng)掃描二維碼登陸        t = threading.Thread(target=self.opening,args=('E:/學(xué)習(xí)‘/login.png',))        t.start()        print('Logining...Please sweep the code!\n')        #獲取登陸后的Cookie(只看到用戶名，沒有看到賬號(hào)和密碼)        while(True):            c = self.driver.get_cookies()            if len(c) > 20:   #登陸成功獲取到cookies                cookies = {}                #下面隱藏是因?yàn)閷?duì)cookies只保留了name和value以后只能用于request，不能用于webdriver的add_cookies作用不然會(huì)報(bào)錯(cuò)InvalidCookieDomainException                #for i in range(len(c)):                    #cookies[c[i]['name']] = c[i]['value']                self.driver.close()                print('Login in successfully!\n')                #return cookies            return c            time.sleep(10)

執(zhí)行上述代碼已經(jīng)完成淘寶登陸并且跳轉(zhuǎn)到了我的淘寶頁(yè)面

從我的淘寶跳轉(zhuǎn)到淘寶首頁(yè)：

def my_split(self,s, seps): '''split移除多個(gè)字符''' res = [s] for sep in seps: t = [] list(map(lambda ss: t.extend(ss.split(sep)), res)) res = t return res def is_Chinese(self,word): '''判斷是否中文''' for ch in word: if '\u4e00' <= ch <= '\u9fff': return True return False def get_Cates(self): '''登陸后默認(rèn)跳轉(zhuǎn)到我的頁(yè)面，該段代碼用戶跳轉(zhuǎn)到淘寶首頁(yè)并且獲取淘寶所有商品類別''' #以上步驟已經(jīng)實(shí)現(xiàn)淘寶登陸，并且進(jìn)入到我的頁(yè)面，點(diǎn)擊淘寶網(wǎng)首頁(yè)，跳轉(zhuǎn)到首頁(yè) time.sleep(10) #查看源代碼確定'淘寶網(wǎng)首頁(yè)'的按鈕對(duì)應(yīng)的屬性，使用click進(jìn)行點(diǎn)擊 self.driver.find_element_by_xpath('//*[@class='site-nav-menu site-nav-home']').click() #跳轉(zhuǎn)到首頁(yè)后首先獲取淘寶頁(yè)面左邊欄的所有品類 #檢測(cè)到淘寶為動(dòng)態(tài)JS，使用request庫(kù)獲取網(wǎng)頁(yè)信息會(huì)與網(wǎng)頁(yè)檢查出來(lái)的數(shù)據(jù)不一致，所以需要用selenium包。 #辨別網(wǎng)頁(yè)靜態(tài)動(dòng)態(tài)的方法：https://www.jianshu.com/p/236fc043db0b #查看源碼定位目標(biāo)內(nèi)容(即品類欄目)所對(duì)應(yīng)的屬性class driver_data = self.driver.find_element_by_xpath('//*[@class='screen-outer clearfix']') html_doc = self.driver.page_source #driver.quit() #利用beautifulSoup解析網(wǎng)頁(yè)源代碼 soup = BeautifulSoup(html_doc, 'lxml') #找到淘寶主頁(yè)中所有的主題，對(duì)應(yīng)的class可通過查看class的范圍確定 cate_list=[] soup_data_list = soup.find('div',attrs={'class':'service J_Service'}) #獲取源代碼中的文本信息，即淘寶所有的物品主題 #通過自定義的split函數(shù)獲取移除非法字符后的中文字符 list_tuple = list(('\n','\\','\ue62e','/','\t',' ',' ',' ')) cate_list=self.my_split(soup_data_list.text,list_tuple) #使用自定義的is_Chinese函數(shù)僅保留中文文本 keep_select = [] #cate_list_final=[] for i in cate_list: keep_select = self.is_Chinese(i) if keep_select: self.cate_list_final.append(i) time.sleep(10) return self.cate_list_final

執(zhí)行上述代碼已經(jīng)完成獲取淘寶頁(yè)面所有品類名稱

在搜索框中輸入要搜索的品類，點(diǎn)擊搜索，點(diǎn)擊銷量排序按鈕，獲取頁(yè)面內(nèi)容進(jìn)行解析：

def search_Taobao(self,cate):        print('正在搜索的品類是：%s'%cate)                #點(diǎn)擊淘寶網(wǎng)首頁(yè)跳轉(zhuǎn)到首頁(yè)頁(yè)面，不管是在我的首頁(yè)和各類別搜索下的首頁(yè)，需要點(diǎn)擊跳回淘寶網(wǎng)首頁(yè)的class不變，所以把點(diǎn)擊首頁(yè)的代碼放在這里'        time.sleep(10)                #在首頁(yè)搜索欄輸入需要搜索的內(nèi)容，并點(diǎn)擊搜索        #一個(gè)輸入框兩個(gè)input，兩個(gè)input重疊的情況，先單點(diǎn)value提示字的input后，才會(huì)顯示真正要輸入框的input，這時(shí)再向這個(gè)input輸值         #首先需要點(diǎn)擊搜索框'//*[@class='search-combobox-input-wrap']'，使得其為可交互狀態(tài)        self.driver.find_element_by_xpath('//*[@class='search-combobox-input-wrap']').click()        #再找到輸入搜索框的class名，輸入要搜索的內(nèi)容，找到搜索對(duì)應(yīng)的class點(diǎn)擊搜索        driver_input_data = self.driver.find_element_by_xpath('//*[@class='search-combobox-input']')        #填寫需要搜索的品類 不知道為什么 帶著cookies這一步還是需要登陸        #driver_input_data.send_keys('女裝')        driver_input_data.send_keys(cate)        #停頓3秒，否則速度過快會(huì)被識(shí)別出為爬蟲        time.sleep(8)        #查找頁(yè)面上”搜索“的按鈕        try:            submit =  self.driver.find_element_by_xpath('//*[@class='search-button']')            submit.click()        except :            pass                    time.sleep(5)            def get_Catinfo(self,cate):        #self.login_first()        time.sleep(20)        self.search_Taobao(cate)        #登陸后進(jìn)入了對(duì)應(yīng)搜索品類的頁(yè)面，獲取按銷量降序后的第一頁(yè)的商品信息        time.sleep(50)                #查找到按銷量排序的元素，并進(jìn)行點(diǎn)擊得到降序排列的商品信息        submit_order =  self.driver.find_element_by_xpath('//*[@class='J_Ajax link  ']')        submit_order.click()        time.sleep(5)        #獲取整個(gè)頁(yè)面源碼        html_doc = self.driver.page_source                #通過頁(yè)面源碼獲取各個(gè)商品的必要信息        soup= BeautifulSoup(html_doc,'lxml')        shop_data_list = soup.find('div',class_='grid g-clearfix').find_all_next('div',class_='items')        for shop_data in shop_data_list:            #不同的信息分布在以下兩個(gè)不同的class下            shop_data_a = shop_data.find_all('div',class_='ctx-box J_MouseEneterLeave J_IconMoreNew')            shop_data_b = shop_data.find_all('div',class_='pic-box J_MouseEneterLeave J_PicBox')            for goods_contents_b in shop_data_b:                #另起一列為爬取的商品類別                self.shop_cate_list.append(cate)                #0.獲取商品名稱                goods_name = goods_contents_b.find('div',class_='pic').find_all('img',class_='J_ItemPic img')[0]['alt']                self.goods_name_list.append(goods_name)                #0.獲取商品圖片                goods_pic = goods_contents_b.find('div',class_='pic').find_all('img',class_='J_ItemPic img')[0]['src']                self.goods_pic_list.append(goods_pic)                                for goods_contents_a in shop_data_a:                #2.獲取商品價(jià)格trace-price                goods_price = goods_contents_a.find_all('a',class_='J_ClickStat')[0]['trace-price']                self.goods_price_list.append(goods_price)                #goods_price = goods_contents_a.find('div',class_='price g_price g_price-highlight')                #goods_price_list.append(goods_price)                #1.獲取商品銷量                goods_salenum = goods_contents_a.find('div',class_='deal-cnt')                self.goods_salenum_list.append(goods_salenum)                #2.獲取商品id                goods_id = goods_contents_a.find_all('a',class_='J_ClickStat')[0]['data-nid']                self.goods_id_list.append(goods_id)                #2.獲取商品鏈接                goods_href = goods_contents_a.find_all('a',class_='J_ClickStat')[0]['href']                self.goods_href_list.append(goods_href)                #2.獲取店鋪名稱                goods_store = goods_contents_a.find('a',class_='shopname J_MouseEneterLeave J_ShopInfo').contents[3]                #goods_store = goods_contents.find_all('span',class_='dsrs')                self.goods_store_list.append(goods_store)                #4.獲取店鋪地址                goods_address = goods_contents_a.find('div',class_='location').contents                self.goods_address_list.append(goods_address)                                #爬取結(jié)果整理成dataframe形式        for j in range(min(               len(self.goods_name_list),len(self.goods_id_list),len(self.goods_price_list)               ,len(self.goods_salenum_list),len(self.goods_pic_list),len(self.goods_href_list)               ,len(self.goods_store_list),len(self.goods_address_list)               )           ):            self.data.append([self.shop_cate_list[j],self.goods_name_list[j],self.goods_id_list[j],self.goods_price_list[j]                         ,self.goods_salenum_list[j],self.goods_pic_list[j],self.goods_href_list[j]                         ,self.goods_store_list[j],self.goods_address_list[j]                         ])        #out_df = pd.DataFrame(self.data,columns=['goods_name'        #                                  ,'goods_id'        #                                  ,'goods_price'        #                                 ,'goods_salenum'        #                                  ,'goods_pic'        #                                  ,'goods_href'        #                                  ,'goods_store'        #                                  ,'goods_address'])                #self.Back_Homepage()        #如果不休眠的話可能會(huì)碰到頁(yè)面還沒來(lái)得及加載就點(diǎn)擊造成點(diǎn)擊錯(cuò)誤        time.sleep(20)        self.driver.find_element_by_xpath('//*[@class='site-nav-menu site-nav-home']').click()          return self.data

執(zhí)行上述代碼已完成搜索各品類按銷量降序排序后第一頁(yè)的商品信息

三、如何定位網(wǎng)頁(yè)內(nèi)容標(biāo)簽與屬性？

上文為利用selenium爬蟲淘寶品類TOP銷量數(shù)據(jù)的源碼，但爬取數(shù)據(jù)其實(shí)最終要的內(nèi)容在于如何定位網(wǎng)頁(yè)中需要獲取的內(nèi)容的標(biāo)簽和屬性，我通過騰訊視頻中爬蟲視頻介紹的方法介紹如下： 1.在網(wǎng)頁(yè)中單擊右鍵，點(diǎn)擊審查元素(或者檢查)，點(diǎn)擊后右上角出現(xiàn)該頁(yè)面對(duì)應(yīng)的屬性信息，單擊檢查信息中左上角的箭頭(Elements左邊，會(huì)顯示select an element in the page to inspect it)，然后點(diǎn)擊網(wǎng)頁(yè)中任意一個(gè)地方即可定位到對(duì)應(yīng)的屬性；

后話

本文以獲取淘寶各品類銷量作為實(shí)際案例作為爬蟲的練習(xí)，但是在爬蟲過程中對(duì)于HTML的結(jié)構(gòu)、webdriver定位元素的方法、beautifulsoup定位元素、正則表達(dá)式的使用的方法還沒有一個(gè)完整的概念，后續(xù)會(huì)將這部分內(nèi)容梳理以下。

本站僅提供存儲(chǔ)服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊舉報(bào)。

国产一级a片免费看高清,亚洲熟女中文字幕在线视频,黄三级高清在线播放,免费黄色视频在线看

前言

一、基本環(huán)境配置

二、使用步驟

1.引入庫(kù)

2.實(shí)際案例

三、如何定位網(wǎng)頁(yè)內(nèi)容標(biāo)簽與屬性？

后話

一、基本環(huán)境配置

二、使用步驟

三、如何定位網(wǎng)頁(yè)內(nèi)容標(biāo)簽與屬性？