Getting Amazon Reviews for Library Books with Python

Often when I want to read about a new technical subject, I'll see what's available for free at the local library before buying a new book. To decide which book to check out, I browse through the library's catalogue and look up the reviews for each book on Amazon to see which book is the best rated. Of course, doing this manually is tedious and repetitive which is usually a sign that it should be automated.

In this post I'll go over a script I developed that queries a library catalogue for books on a particular subject and then returns the books sorted according to their Amazon rating.

Overview

My local library uses SirsiDynix for its catalogue software. Here's a screenshot of their advanced search page:

Search Image

As you can see, searches can be filtered by format type (Electronic Resources, Sound Recording, Books, etc.) and language. For our purposes we only want English language books. Once you submit the search, you get a results page like the following:

Results Image

What part of the results do we need to scrape? Well, we'd like the title obviously, and also the book's ISBN number so we can look it up on Amazon to get its rating. Since results are sorted by relevance we'll limit our scrape to the first page of results.

Here's a summary of everything we'll need the script to do:

Implementation

We'll use Mechanize to send our requests and BeautifulSoup to parse the HTML. The script will take two arguments. One argument to specify the url of the advanced search page and the other to specify the search string.

#!/usr/bin/env python

import re
import argparse
import mechanize
from bs4 import BeautifulSoup

AMAZON_URL = 'http://www.amazon.com/gp/product/'

class Scraper(object):
    def __init__(self):
        self.br = mechanize.Browser()
        self.br.set_handle_robots(False)
        self.br.addheaders = [('User-agent', 
                               'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("-u", "--url",   help="Url", required=True)
    parser.add_argument("-q", "--query", help="Query", required=True)
    args = parser.parse_args()

    scraper = Scraper()
    scraper.scrape(args.url, args.query)

A top level scrape() method encapsulates all of the logic.

def scrape(self, url, q):
   books = self.search_library_books(url=url, q=q)
   self.get_amazon_reviews(books)
   books = self.rank_by_reviews(books)

   for b in books:
       print b['rating'], b['title']

Scraping the Library Catalogue

To scrape the book titles and ISBN numbers from the Sirsi catalogue, we open up the advanced search page, fill out and submit the form. The search results are then filtered based on the class and id attributes of the elements containing the title and ISBN number.

def search_library_books(self, url, q):
    '''
    Scrape the first page of books for the given query.
    Calculate the ISBN10 value for each book so we can 
    look it up on Amazon
    '''
    books = []

    def select_form(form):
        return form.attrs.get('id', None) == 'advancedSearchForm'

    self.br.open(url)
    self.br.select_form(predicate=select_form)
    self.br.form['allWordsField'] = q
    self.br.form['formatTypeDropDown'] = ['BOOK']
    self.br.form['languageDropDown'] = ['ENG']
    self.br.submit()

    s = BeautifulSoup(self.br.response().read())
    x = {'class': 'isbnValue'}
    y = re.compile(r'^results_cell\d+$')
    z = re.compile(r'^results_bio\d+$')

    for i in s.findAll('input', attrs=x):
        d = i.findParent('div', id=y)
        d = d.find('div', id=z)

        if not i.get('value'):
            continue

        book = {}
        book['title'] = d.a['title']
        book['isbn13'] = i['value']
        book['isbn10'] = self.isbn13to10(book['isbn13'])
        books.append(book)

    return books

Note that we scrape the ISBN13 number and then convert it to ISBN10. That's because while the Sirsi catalogue uses ISBN13 numbers, Amazon uses ISBN10 numbers for product links. [1] [2]

ISBN13 to ISBN10 Conversion

Wikipedia has a nice writeup on how ISBN numbers are formatted. To convert an ISBN13 number to ISBN10, we chop off the first 3 digits of the ISBN13 number, then calculate the ISBN10 check digit for the next nine numbers. The nine numbers plus the check digit is the ISBN10 number.

def isbn13to10(self, isbn13):
    '''
    Convert an ISBN13 number to ISBN10 by chomping off the
    first 3 digits, then calculating the ISBN10 check digit 
    for the next nine digits to come up with the ISBN10 
    '''
    first9 = isbn13[3:][:9]
    isbn10 = first9 + self.isbn10_check_digit(first9)
    return isbn10

The formula for calculating an ISBN10 check digit (the 10th number) given the first nine numbers is as follows:

(11 - ((10x1 + 9x2 + 8x3 + 7x4 + 6x5 + 5x6 + 4x7 + 3x8 + 2x9) mod 11)) mod 11

def isbn10_check_digit(self, isbn10):
    '''
    Given the first 9 digits of an ISBN10 number calculate
    the final (10th) check digit
    '''
    i = 0
    s = 0

    for n in xrange(10,1,-1):
        s += n * int(isbn10[i])
        i += 1

    s = s % 11
    s = 11 - s
    v = s % 11
        
    if v == 10:
        return 'x'
    else:
        return str(v)

Amazon Reviews

Once we've got the ISBN10 number for the book, we can look it up on Amazon using the following link format:

http://www.amazon.com/gp/product/<ISBN10>

Then we can go to that page and scrape the rating for the book.

def get_amazon_reviews(self, books):
    '''
    Get the Amazon review/rating for each book in books[]
    '''
    for b in books:
        url = AMAZON_URL + b['isbn10']

        self.br.open(url)

        s = BeautifulSoup(self.br.response().read())
        d = s.find('div', id='avgRating')
        m = re.search(r'\d+(\.\d+)?', d.text.strip())
        f = float(m.group(0))

        b['rating'] = f

Sorting the Results

Once we've got all the rating for all of the books, we sort them according to their rating.

def rank_by_reviews(self, books):
    ''' 
    Sort books by rating with highest rated books first
    '''
    newlist = sorted(books, key=lambda k: k['rating'], reverse=True) 
    return newlist

Running It

Let's try it out.

$ ./scraper.py -u https://mdpl.ent.sirsi.net/client/catalog/search/advanced -q javascript
4.7 Beginning JavaScript / Paul Wilton.
4.6 JavaScript : the definitive guide / David Flanagan.
4.6 Professional JavaScript for Web developers / Nicholas C. Zakas.
4.5 High performance JavaScript / Nicholas C. Zakas.
4.5 JavaScript : the definitive guide / David Flanagan.
4.3 PPK on JavaScript  / Peter-Paul Koch.
4.1 Pure JavaScript / Jason Gilliam, Charlton Ting, R. Allen Wyke.
4.0 JavaScript bible / Danny Goodman, Michael Morrison.
4.0 JavaScript : a beginner's guide / John Pollock.
3.7 JavaScript & Ajax for dummies / Andy Harris.
3.5 Head first JavaScript / Michael Morrison.
2.2 JavaScript for dummies / by Emily A. Vander Veer.

Each result from the library catalogue is printed out along with its Amazon rating, with the highest rated books listed first.

If you'd like to see the full implementation, the source code for this article is available on github.

Shameless Plug

Have a scraping project you'd like done? I'm available for hire. Contact me for a free quote.