大主宰天蚕土豆小说,小说阅读网

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營(yíng)銷解決方案

Python利用Beautifulsoup爬取笑話網(wǎng)站

利用Beautifulsoup爬取老牌笑話網(wǎng)站

創(chuàng)新互聯(lián)建站服務(wù)項(xiàng)目包括羅城網(wǎng)站建設(shè)、羅城網(wǎng)站制作、羅城網(wǎng)頁(yè)制作以及羅城網(wǎng)絡(luò)營(yíng)銷策劃等。多年來(lái)，我們專注于互聯(lián)網(wǎng)行業(yè)，利用自身積累的技術(shù)優(yōu)勢(shì)、行業(yè)經(jīng)驗(yàn)、深度合作伙伴關(guān)系等，向廣大中小型企業(yè)、政府機(jī)構(gòu)等提供互聯(lián)網(wǎng)行業(yè)的解決方案，羅城網(wǎng)站推廣取得了明顯的社會(huì)效益與經(jīng)濟(jì)效益。目前，我們服務(wù)的客戶以成都為中心已經(jīng)輻射到羅城省份的部分城市，未來(lái)相信會(huì)繼續(xù)擴(kuò)大服務(wù)區(qū)域并繼續(xù)獲得客戶的支持與信任！

首先我們來(lái)看看需要爬取的網(wǎng)站：http://xiaohua.zol.com.cn/

1.開始前準(zhǔn)備

1.1 python3，本篇博客內(nèi)容采用python3來(lái)寫，如果電腦上沒有安裝python3請(qǐng)先安裝python3.

1.2 Request庫(kù)，urllib的升級(jí)版本打包了全部功能并簡(jiǎn)化了使用方法。下載方法：

 
 
 
 
  
  
  
  pip install requests

1.3 Beautifulsoup庫(kù)，是一個(gè)可以從HTML或XML文件中提取數(shù)據(jù)的Python庫(kù).它能夠通過你喜歡的轉(zhuǎn)換器實(shí)現(xiàn)慣用的文檔導(dǎo)航，查找,修改文檔的方式.。下載方法：

 
 
 
 
  
  
  
  pip install beautifulsoup4

1.4 LXML，用于輔助Beautifulsoup庫(kù)解析網(wǎng)頁(yè)。(如果你不用anaconda，你會(huì)發(fā)現(xiàn)這個(gè)包在Windows下pip安裝報(bào)錯(cuò))下載方法：

 
 
 
 
  
  
  
  pip install lxml

1.5 pycharm，一款功能強(qiáng)大的pythonIDE工具。下載官方版本后，使用license sever免費(fèi)使用(同系列產(chǎn)品類似)，具體參照http://www.cnblogs.com/hanggegege/p/6763329.html。

2.爬取過程演示與分析

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  
  
  
  
  import os
  
  
  
  
  
  
  
  import requests

導(dǎo)入需要的庫(kù)，os庫(kù)用來(lái)后期儲(chǔ)存爬取內(nèi)容。

隨后我們點(diǎn)開“***笑話”，發(fā)現(xiàn)有“全部笑話”這一欄，能夠讓我們***效率地爬取所有歷史笑話!

我們來(lái)通過requests庫(kù)來(lái)看看這個(gè)頁(yè)面的源代碼：

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  
  
  
  
  import os
  
  
  
  
  
  
  
  import requests
  
  
  
  
  
  
  
  all_url = 'http://xiaohua.zol.com.cn/new/
  
  
  
  
  
  
  
  headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
  
  
  
  
  
  
  
  all_html=requests.get(all_url,headers = headers)
  
  
  
  
  
  
  
  print(all_html.text)

header是請(qǐng)求頭，大部分網(wǎng)站沒有這個(gè)請(qǐng)求頭會(huì)爬取失敗

部分效果如下：

通過源碼分析發(fā)現(xiàn)我們還是不能通過此網(wǎng)站就直接獲取到所有笑話的信息，因此我們?cè)谠谶@個(gè)頁(yè)面找一些間接的方法。

點(diǎn)開一個(gè)笑話查看全文，我們發(fā)現(xiàn)此時(shí)網(wǎng)址變成了http://xiaohua.zol.com.cn/detail58/57681.html，在點(diǎn)開其他的笑話，我們發(fā)現(xiàn)網(wǎng)址部都是形如http://xiaohua.zol.com.cn/detail?/?.html的格式，我們以這個(gè)為突破口，去爬取所有的內(nèi)容

我們的目的是找到所有形如http://xiaohua.zol.com.cn/detail?/?.html的網(wǎng)址，再去爬取其內(nèi)容。

我們?cè)凇叭啃υ挕表?yè)面隨便翻到一頁(yè)：http://xiaohua.zol.com.cn/new/5.html ，按下F12查看其源代碼，按照其布局發(fā)現(xiàn) ：

每個(gè)笑話對(duì)應(yīng)其中一個(gè)

標(biāo)簽，分析得每個(gè)笑話展開全文的網(wǎng)址藏在href當(dāng)中，我們只需要獲取href就能得到笑話的網(wǎng)址

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  import os
  
  
  
  import requests
  
  
  
  all_url = 'http://xiaohua.zol.com.cn/new/ 
  
  
  
  
  
  
  
  '
  
  
  
  headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
  
  
  
  all_html=requests.get(all_url,headers = headers)
  
  
  
  #print(all_html.text)
  
  
  
  soup1 = BeautifulSoup(all_html.text,'lxml')
  
  
  
  list1=soup1.find_all('li',class_ = 'article-summary')
  
  
  
  for i in list1:
  
  
  
      #print(i)
  
  
  
      soup2 = BeautifulSoup(i.prettify(),'lxml')
  
  
  
      list2=soup2.find_all('a',target = '_blank',class_='all-read')
  
  
  
      for b in list2:
  
  
  
          href = b['href']
  
  
  
          print(href)

我們通過以上代碼，成功獲得***頁(yè)所有笑話的網(wǎng)址后綴：

也就是說，我們只需要獲得所有的循環(huán)遍歷所有的頁(yè)碼，就能獲得所有的笑話。

上面的代碼優(yōu)化后：

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  import os
  
  
  
  import requests
  
  
  
  all_url = 'http://xiaohua.zol.com.cn/new/5.html 
  
  
  
  
  
  
  
  '
  
  
  
  def Gethref(url):
  
  
  
      headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
  
  
  
      html = requests.get(url,headers = headers)
  
  
  
      soup_first = BeautifulSoup(html.text,'lxml')
  
  
  
      list_first = soup_first.find_all('li',class_='article-summary')
  
  
  
      for i in list_first:
  
  
  
          soup_second = BeautifulSoup(i.prettify(),'lxml')
  
  
  
          list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
  
  
  
          for b in list_second:
  
  
  
              href = b['href']
  
  
  
              print(href)
  
  
  
  Gethref(all_url)

使用如下代碼，獲取完整的笑話地址url

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  import os
  
  
  
  import requests
  
  
  
  all_url = 'http://xiaohua.zol.com.cn/new/5.html 
  
  
  
  
  
  
  
  '
  
  
  
  def Gethref(url):
  
  
  
      list_href = []
  
  
  
      headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
  
  
  
      html = requests.get(url,headers = headers)
  
  
  
      soup_first = BeautifulSoup(html.text,'lxml')
  
  
  
      list_first = soup_first.find_all('li',class_='article-summary')
  
  
  
      for i in list_first:
  
  
  
          soup_second = BeautifulSoup(i.prettify(),'lxml')
  
  
  
          list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
  
  
  
          for b in list_second:
  
  
  
              href = b['href']
  
  
  
              list_href.append(href)
  
  
  
      return list_href
  
  
  
  def GetTrueUrl(liebiao):
  
  
  
      for i in liebiao:
  
  
  
          url = 'http://xiaohua.zol.com.cn 
  
  
  
  
  
  
  
  '+str(i)
  
  
  
          print(url)
  
  
  
  GetTrueUrl(Gethref(all_url))

簡(jiǎn)單分析笑話頁(yè)面html內(nèi)容后，接下來(lái)獲取一個(gè)頁(yè)面全部笑話的內(nèi)容：

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  import os
  
  
  
  import requests
  
  
  
  all_url = 'http://xiaohua.zol.com.cn/new/5.html 
  
  
  
  
  
  
  
  '
  
  
  
  def Gethref(url):
  
  
  
      list_href = []
  
  
  
      headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
  
  
  
      html = requests.get(url,headers = headers)
  
  
  
      soup_first = BeautifulSoup(html.text,'lxml')
  
  
  
      list_first = soup_first.find_all('li',class_='article-summary')
  
  
  
      for i in list_first:
  
  
  
          soup_second = BeautifulSoup(i.prettify(),'lxml')
  
  
  
          list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
  
  
  
          for b in list_second:
  
  
  
              href = b['href']
  
  
  
              list_href.append(href)
  
  
  
      return list_href
  
  
  
  def GetTrueUrl(liebiao):
  
  
  
      list = []
  
  
  
      for i in liebiao:
  
  
  
          url = 'http://xiaohua.zol.com.cn 
  
  
  
  
  
  
  
  '+str(i)
  
  
  
          list.append(url)
  
  
  
      return list
  
  
  
  def GetText(url):
  
  
  
      for i in url:
  
  
  
          html = requests.get(i)
  
  
  
          soup = BeautifulSoup(html.text,'lxml')
  
  
  
          content = soup.find('div',class_='article-text')
  
  
  
          print(content.text)
  
  
  
  GetText(GetTrueUrl(Gethref(all_url)))

效果圖如下：

現(xiàn)在我們開始存儲(chǔ)笑話內(nèi)容!開始要用到os庫(kù)了

使用如下代碼，獲取一頁(yè)笑話的所有內(nèi)容!

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  import os
  
  
  
  import requests
  
  
  
  all_url = 'http://xiaohua.zol.com.cn/new/5.html 
  
  
  
  
  
  
  
  '
  
  
  
  os.mkdir('/home/lei/zol')
  
  
  
  def Gethref(url):
  
  
  
      list_href = []
  
  
  
      headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
  
  
  
      html = requests.get(url,headers = headers)
  
  
  
      soup_first = BeautifulSoup(html.text,'lxml')
  
  
  
      list_first = soup_first.find_all('li',class_='article-summary')
  
  
  
      for i in list_first:
  
  
  
          soup_second = BeautifulSoup(i.prettify(),'lxml')
  
  
  
          list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
  
  
  
          for b in list_second:
  
  
  
              href = b['href']
  
  
  
              list_href.append(href)
  
  
  
      return list_href
  
  
  
  def GetTrueUrl(liebiao):
  
  
  
      list = []
  
  
  
      for i in liebiao:
  
  
  
          url = 'http://xiaohua.zol.com.cn 
  
  
  
  
  
  
  
  '+str(i)
  
  
  
          list.append(url)
  
  
  
      return list
  
  
  
  def GetText(url):
  
  
  
      for i in url:
  
  
  
          html = requests.get(i)
  
  
  
          soup = BeautifulSoup(html.text,'lxml')
  
  
  
          content = soup.find('div',class_='article-text')
  
  
  
          title = soup.find('h1',class_ = 'article-title')
  
  
  
          SaveText(title.text,content.text)
  
  
  
  def SaveText(TextTitle,text):
  
  
  
      os.chdir('/home/lei/zol/')
  
  
  
      f = open(str(TextTitle)+'txt','w')
  
  
  
      f.write(text)
  
  
  
      f.close()
  
  
  
  GetText(GetTrueUrl(Gethref(all_url)))

效果圖：

(因?yàn)槲业南到y(tǒng)為linux系統(tǒng)，路徑問題請(qǐng)按照自己電腦自己更改)

我們的目標(biāo)不是抓取一個(gè)頁(yè)面的笑話那么簡(jiǎn)單，下一步我們要做的是把需要的頁(yè)面遍歷一遍!

通過觀察可以得到全部笑話頁(yè)面url為http://xiaohua.zol.com.cn/new/+頁(yè)碼+html,接下來(lái)我們遍歷前100頁(yè)的所有笑話，全部下載下來(lái)!

接下來(lái)我們?cè)俅涡薷拇a：

 
 
 
 
  
  
  
  from bs4 import BeautifulSoup
  
  
  
  import os
  
  
  
  import requests
  
  
  
  num = 1
  
  
  
  url = 'http://xiaohua.zol.com.cn/new/ 
  
  
  
  
  
  
  
  '+str(num)+'.html'
  
  
  
  os.mkdir('/home/lei/zol')
  
  
  
  def Gethref(url):
  
  
  
      list_href = []
  
  
  
      headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
  
  
  
      html = requests.get(url,headers = headers)
  
  
  
      soup_first = BeautifulSoup(html.text,'lxml')
  
  
  
      list_first = soup_first.find_all('li',class_='article-summary')
  
  
  
      for i in list_first:
  
  
  
          soup_second = BeautifulSoup(i.prettify(),'lxml')
  
  
  
          list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
  
  
  
          for b in list_second:
  
  
  
              href = b['href']
  
  
  
              list_href.append(href)
  
  
  
      return list_href
  
  
  
  def GetTrueUrl(liebiao):
  
  
  
      list = []
  
  
  
      for i in liebiao:
  
  
  
          url = 'http://xiaohua.zol.com.cn 
  
  
  
  
  
  
  
  '+str(i)
  
  
  
          list.append(url)
  
  
  
      return list
  
  
  
  def GetText(url):
  
  
  
      for i in url:
  
  
  
          html = requests.get(i)
  
  
  
          soup = BeautifulSoup(html.text,'lxml')
  
  
  
          content = soup.find('div',class_='article-text')
  
  
  
          title = soup.find('h1',class_ = 'article-title')
  
  
  
  
  
  
  
          SaveText(title.text,content.text)
  
  
  
  def SaveText(TextTitle,text):
  
  
  
      os.chdir('/home/lei/zol/')
  
  
  
      f = open(str(TextTitle)+'txt','w')
  
  
  
      f.write(text)
  
  
  
      f.close()
  
  
  
  while num<=100:
  
  
  
      url = 'http://xiaohua.zol.com.cn/new/ 
  
  
  
  
  
  
  
  ' + str(num) + '.html'
  
  
  
      GetText(GetTrueUrl(Gethref(url)))
  
  
  
      num=num+1

大功告成!剩下的等待文件下載完全就行拉!

效果圖：

文章題目：Python利用Beautifulsoup爬取笑話網(wǎng)站
URL標(biāo)題：http://www.fisionsoft.com.cn/article/dpsighh.html

新聞中心

其他資訊