Journal
Douban Book Rating Scraper Script in Python
2011·07·15 #Works
Machine-translated from Chinese. · Read original
This script is used to crawl book rating information based on the ISBN number of a book. data.csv is a list file containing the ISBN numbers of books, with each line representing the ISBN number of a book.
This version only uses a single thread to crawl and can only read data from a CSV file. Since it was requested by a friend and the data volume is not large, I will improve it gradually if needed in the future.
The entire Python script is simple, mainly using BeautifulSoup for HTML content extraction.
P.S. I have several interesting projects in hand recently, hoping to complete them soon :)
import urllib,urllib2
import re
import BeautifulSoup
def isbn_2_score(isbn):
url = 'http://www.douban.com/subject_search?search_text='
try:
response = urllib2.urlopen(url+isbn)
except Exception,e:
return 0.0
doc = response.read()
soup = BeautifulSoup.BeautifulSoup(''.join(doc))
try:
book_info = soup.find("a",{"class":"nbg"})
except Exception,e:
return 0.0
if isinstance(book_info,BeautifulSoup.Tag):
url_book_info = book_info['href']
try:
response = urllib2.urlopen(url_book_info)
except Exception,e:
return 0.0
book_page = response.read()
soup = BeautifulSoup.BeautifulSoup(''.join(book_page))
score_info = soup.find('strong','ll rating_num')
if isinstance(book_info,BeautifulSoup.Tag):
score = score_info.string
return score
return 0.0
return 0.0
def read_file(file_name):
file_handler = open(file_name,'r')
return file_handler
def return_isbn(file_handler):
isbn = file_handler.readline()
return isbn
if __name__ == '__main__':
data = read_file('data.csv')
f = open('dump','w')
k = return_isbn(data)
while k is not None and k != '':
score = isbn_2_score(k)
result = k[0:-1]+":"+str(score)+"\n"
print result
f.write(result)
k=return_isbn(data)
f.close()
项目地址:https://github.com/quake0day/douban_crawler
还没有人留言,在下面说两句吧。