Journal

Douban Book Rating Scraper Script in Python

2011·07·15

Machine-translated from Chinese.  ·  Read original

This script is used to crawl book rating information based on the ISBN number of a book. data.csv is a list file containing the ISBN numbers of books, with each line representing the ISBN number of a book. This version only uses a single thread to crawl and can only read data from a CSV file. Since it was requested by a friend and the data volume is not large, I will improve it gradually if needed in the future. The entire Python script is simple, mainly using BeautifulSoup for HTML content extraction.

P.S. I have several interesting projects in hand recently, hoping to complete them soon :)

import urllib,urllib2
import re
import BeautifulSoup

def isbn_2_score(isbn):
    url = 'http://www.douban.com/subject_search?search_text='
    try:
        response = urllib2.urlopen(url+isbn)
    except Exception,e:
        return 0.0
    doc = response.read()
    soup = BeautifulSoup.BeautifulSoup(''.join(doc))
    try:
        book_info = soup.find("a",{"class":"nbg"})
    except Exception,e:
        return 0.0
    if isinstance(book_info,BeautifulSoup.Tag):
        url_book_info = book_info['href']
        try:
            response = urllib2.urlopen(url_book_info)
        except Exception,e:
            return 0.0
        book_page = response.read()
        soup = BeautifulSoup.BeautifulSoup(''.join(book_page))
        score_info = soup.find('strong','ll rating_num')
        if isinstance(book_info,BeautifulSoup.Tag):
            score = score_info.string
            return score
        return 0.0
    return 0.0

def read_file(file_name):
    file_handler = open(file_name,'r')
    return file_handler

def return_isbn(file_handler):
    isbn = file_handler.readline()
    return isbn


if __name__ == '__main__':
    data = read_file('data.csv')
    f = open('dump','w')
    k = return_isbn(data)
    while k is not None and k != '':
        score = isbn_2_score(k)
        result = k[0:-1]+":"+str(score)+"\n"
        print result
        f.write(result)
        k=return_isbn(data)
    f.close()


项目地址:https://github.com/quake0day/douban_crawler
留 · 言