新聞中心
本文將就HTTP協(xié)議中相關(guān)的返回機制以及在不同情況下會出現(xiàn)何種返回代號作一番淺顯易懂地介紹。返回404 Not Found 時表明找不到相關(guān)頁面;
一、簡介
HTTP狀態(tài)碼是指在Web服務(wù)器上運行的應(yīng)用程序發(fā)送到客戶端(瀏覽器)的信息。它包含了諸如200 OK之類的標準代號,用來告訴客戶端當前頁面所處的情況。而對于百度來說,其為了能夠正常采集數(shù)據(jù)并將其存儲到數(shù)據(jù)庫中,必須要遵循HTTP協(xié)議中相關(guān)的規(guī)則。因此,本文將就HTTP協(xié)議中相關(guān)的返回機制以及在不同情況下會出現(xiàn)何種返回代號作一番淺顯易懂地介紹。

創(chuàng)新互聯(lián)是一家專注于成都網(wǎng)站建設(shè)、成都做網(wǎng)站與策劃設(shè)計,滿洲網(wǎng)站建設(shè)哪家好?創(chuàng)新互聯(lián)做網(wǎng)站,專注于網(wǎng)站建設(shè)10多年,網(wǎng)設(shè)計領(lǐng)域的專業(yè)建站公司;建站業(yè)務(wù)涵蓋:滿洲等地區(qū)。滿洲做網(wǎng)站價格咨詢:18980820575
二、HTTP 狀態(tài)代號
1. 200 OK: 這是最常見也是最重要的 HTTP 狀態(tài)代號之一, 在大部分情況下, 此時表明 Web 服務(wù)器已成功處理了該請求;
2. 301 Moved Permanently: 這意味者永久性重定向, 針對特定鏈接, 如 www.example.com/old-page.html , 此時會將 URL 重新引導(dǎo)到 www.example.com/new-page .html ;
3. 302 Found (Moved Temporarily): 這意味者臨時性重定向, 和301 Moved Permanently 相似, 但302 Found 是臨時更新URL;
4. 404 Not Found: 返回404 Not Found 時表明找不到相關(guān)頁面;
5 403 Forbidden : 有時候 Web 服務(wù)器會阻止特定 IP 地址或由特定 IP 地址執(zhí)行特定方法(例如 POST) , 此時就會返回403 Forbidden ;
三、Http Status Code Return Mechanism of Baidu Crawler
1、Baidu crawler will first send a request to the server and wait for the response from the server in order to get the content of web page or other resources on it . If there is no response within certain time limit , then Baidu crawler will consider that this request has failed and stop crawling this page .
2、When receiving a response from server , Baidu crawler will check whether it is an error code or not according to HTTP status codes returned by server . If it is an error code such as 404 Not found or 403 Forbidden etc., then Baidu crawler will stop crawling this page immediately without further processing . Otherwise if it is a normal status code like 200 OK , then Baidu crawler can continue its work and start downloading contents from this page .
3、In addition to checking HTTP status codes returned by servers , Baidu also checks robots exclusion protocol (robots txt ) before sending requests so as to avoid wasting resources on pages which are forbidden for crawling by website owners themselves through robots txt files stored on their websites .
4、After getting all contents successfully downloaded from target webpages with normal status codes returned by servers , baidu spider will store them into database for later use such as indexing these data into search engine results list when users enter related keywords in search box of baidus homepage etc..
5、Finally after finishing all tasks above mentioned above successfully without any errors occurred during processings of each step involved in whole procedure described hereabove , baud spider can move onto next webpage waiting for being crawled until all webpages listed in task queue have been processed completely one after another orderly just like what we have discussed hereabove briefly but clearly enough hopefully !
以上就是關(guān)于淺談百度爬蟲的HTTP狀態(tài)碼返回機制的相關(guān)知識,如果對你產(chǎn)生了幫助就關(guān)注網(wǎng)址吧。
當前名稱:淺談百度爬蟲的HTTP狀態(tài)碼返回機制
鏈接地址:http://fisionsoft.com.cn/article/ccocpog.html


咨詢
建站咨詢
