wiki:Projects/Scraper

Scraper

Scraping news headlines from web sites was a fun project of mine way back in the days before RSS took off. My crappy project for that has long been obsoleted.

I currently rely on dailystrips to gather web comics for me. Many of these already have RSS feeds that provide everything I want, but it would be great to be able to extract just the information I need into a common format from the ones that don't. dailystrips does just that, but that's as far as it goes.

There are plenty of other places where lots of data is freely available but in an extremely awkward format to do anything with. Ideally everyone will embrace the semantic web, but until then, I want a straightforward way to force the semantic web on these sites.

Ways to combine some of my thoughts about these things into a happy, effective webapp have been on my mind. I finally sat down and tried to whip up some parts of what I wanted, and I found most of this to be so blazingly simple that I felt like documenting it.

Schema

I actually started with the database schema, since this all revolved around the data I wanted eventually be able to extract and store.

I am currently working with something like this:

CREATE TABLE `files` ( `url` text, `accessed` datetime default NULL, `headers` text, `content` blob, `actual_url` text);
CREATE TABLE `templates` (`url` text, `type` text, `pattern` text, `meaning` text, `format` text);
CREATE TABLE `data` ( `created` datetime default NULL, `meaning` text, `url` text, `value` text);
  • Files is a set of filenames, file contents, a timestamp, and then some metadata. One program will look for files that need to be fetched and fetch them into this table.
  • Templates is a set of patterns that describe how data is extracted from the files. More on this later.
  • Data is the extracted data, in RDF-esque (url, meaning, value) triplets plus a timestamp.

Fetching

Here's the current code to fetch pages. The actual fetching is as simple as urllib2.urlopen(url).read(). The first real part of the code grabs next and previous links from the database and adds any new ones to the list of files to fetch. The second part downloads the files and stores them.

import urllib2
import MySQLdb

db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
cursor = db.cursor()

cursor.execute('SELECT value FROM data WHERE meaning in ("comic:next", "comic:previous")')
insert = db.cursor()
for (url,) in cursor:
    insert.execute('SELECT 1 FROM files WHERE url=%s', (url,))
    if not insert.fetchone():
        cursor.execute('INSERT INTO files (url) VALUES (%s)', (url,))
db.commit()

cursor.execute('SELECT url FROM files WHERE accessed IS NULL')

for (url,) in cursor:
    print 'Fetching %s...' % url
    u = urllib2.urlopen(url)
    data = u.read()
    cursor.execute('UPDATE files SET content=%s, headers=%s, accessed=NOW(), actual_url=%s WHERE url=%s',
        (data, str(u.headers), u.url, url))
    db.commit()

Scraping

Scraping is reasonably simple as well. All of the real work is done in the loop at the bottom. It iterates over files and templates that match that file, finds any matches for the template, stores them in the data table, and then removes any entries from previous runs.

There are currently exactly three types of templates: regex, xpath, and urlregex. regex is obvious. You supply a regular expression, and any captures are stored as values for that match. xpath converts the file to an XML tree using lxml.etree.HTMLParser and then executes the xpath query on it. urlregex is just like regex except that it operates on the URL instead of the file contents. This was just added as an afterthought to be able to extract data from the URL.

Another afterthought was manipulating the extracted data. This is done through a python string comprehension and the FancyFormatter class, which provides access to multiple named values or does a url join on the base URL and a relative path.

import MySQLdb
import re
from StringIO import StringIO
from lxml import etree
import datetime
from urllib import basejoin

db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
cursor = db.cursor()

def xpath_search(content, query):
    tree = etree.parse(StringIO(content), etree.HTMLParser())
    find = etree.XPath(query)
    return find(tree)

class FancyFormatter(object):
    def __init__(self, dictionary):
        self._dict = dictionary

    def __getitem__(self, item):
        if item == 'makeurl':
            return basejoin(self._dict['url'], self._dict['value'])
        else:
            return self._dict[item]

    def __str__(self):
        return self._dict['value']

def add_tuple(meaning, url, value, format):
    if format:
        value = format % FancyFormatter({'url': url, 'value': value})
    cursor.execute('INSERT INTO data (created, meaning, url, value) VALUES (NOW(), %s, %s, %s)',
        (meaning, url, value))

print 'Scraping...'
cursor.execute('SELECT url, content FROM files WHERE content IS NOT NULL')
for (url, content) in cursor:
    start = datetime.datetime.now()

    templates = db.cursor()
    templates.execute('SELECT type, pattern, meaning, format FROM templates WHERE %s LIKE url', (url,))
    for (type, pattern, meaning, format) in templates:
        if type == 'xpath':
            for value in xpath_search(content, pattern):
                add_tuple(meaning, url, value, format)
        elif type == 'regex':
            for value in re.search(pattern, content, re.S|re.M):
                add_tuple(meaning, url, value, format)
        elif type == 'urlregex':
            match = re.search(pattern, url, re.S|re.M)
            if match:
                for value in match.groups():
                    add_tuple(meaning, url, value, format)
        else:
            raise RuntimeError('Unknown template type: "%s".' % (type,))
    cursor.execute('DELETE FROM data WHERE url=%s AND created<%s', (url, start))
    db.commit()
print 'Done.'

Examples

That was more than enough code to get me started. Here are the templates I used for testing. I extract an image, alt text, next and previous links, and id for each comic. The links are understood by the fetcher. url is a pattern which describes which files a template applies to. A MySQL LIKE is being used for that right now, hence the '%'s. You can see I prefer the xpath queries so far. They seem to be quite robust for this purpose.

+------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
| url                                      | type     | pattern                                           | meaning        | format      |
+------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
| http://www.penny-arcade.com/comic/%      | xpath    | //div[@id="comicstrip"]/img/@src                  | comic:img      | %(makeurl)s | 
| http://www.penny-arcade.com/comic/%      | xpath    | //div[@id="comicstrip"]/img/@alt                  | comic:alt      | NULL        | 
| http://www.penny-arcade.com/comic/%      | xpath    | //a[img[@alt="Next"]]/@href                       | comic:next     | %(makeurl)s | 
| http://www.penny-arcade.com/comic/%      | xpath    | //a[img[@alt="Back"]]/@href                       | comic:previous | %(makeurl)s | 
| http://www.penny-arcade.com/comic/%      | xpath    | //input[@name="Date"]/@value                      | comic:id       | NULL        | 
| http://questionablecontent.net/view.php% | xpath    | (//a[text()="Next"]/@href)[1]                     | comic:next     | %(makeurl)s | 
| http://questionablecontent.net/view.php% | xpath    | (//a[text()="Previous"]/@href)[1]                 | comic:previous | %(makeurl)s | 
| http://questionablecontent.net/view.php% | xpath    | //center/img[starts-with(@src, "./comics/")]/@src | comic:img      | %(makeurl)s | 
| http://questionablecontent.net/view.php% | urlregex | (\d+)$                                            | comic:id       | NULL        | 
| http://sinfest.net/archive_page.php%     | xpath    | //img[contains(@src, "/comics/")]/@src            | comic:img      | %(makeurl)s | 
| http://sinfest.net/archive_page.php%     | xpath    | //img[contains(@src, "/comics/")]/@alt            | comic:alt      | NULL        | 
| http://sinfest.net/archive_page.php%     | xpath    | //a[img[@alt="Next"]]/@href                       | comic:next     | %(makeurl)s | 
| http://sinfest.net/archive_page.php%     | xpath    | //a[img[@alt="Previous"]]/@href                   | comic:previous | %(makeurl)s | 
| http://sinfest.net/archive_page.php%     | urlregex | (\d+)$                                            | comic:id       | NULL        | 
+------------------------------------------+----------+---------------------------------------------------+----------------+-------------+

The results? It is working exactly as I expected. I seeded a few URLs from each comic and then alternated running the fetcher and scraper, and my collection of structured data about these comics grew.

Presentation

I know for this to be useful I need to be able to easily produce appealing-looking reports of this data. My first attempt is with producing RSS feeds with genshi.

My no-nonsense template looks like this:

<rss version="2.0"
    xmlns:py="http://genshi.edgewall.org/">
    <channel>
        <title>${title}</title>
        <py:for each="item in items">
            <item>
                <title>${item.alt}</title>
                <description>&lt;img src="${item.img}" alt="${item.alt}" /&gt;</description>
                <guid>${item.url}#${item.id}</guid>
            </item>
        </py:for>
    </channel>
</rss>

The program to put everything together looks like this:

import MySQLdb
from genshi.template import TemplateLoader
import sys

db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
cursor = db.cursor()

if len(sys.argv) != 3:
    print 'Usage: %s urlpattern title'
    sys.exit(1)
(urlpattern, title) = sys.argv[1:]

items = []
cursor.execute('SELECT url FROM data WHERE meaning="comic:id" AND url LIKE "%s" ORDER BY value DESC' % urlpattern)
fields = db.cursor()
for (url,) in cursor:
    fields.execute('SELECT meaning, value FROM data WHERE url=%s AND meaning LIKE "comic:%%"', url)
    item = {'url': url}
    for (meaning, value) in fields:
        item[meaning.split(':', 1)[1]] = unicode(value, 'utf8')
    items.append(item)

loader = TemplateLoader('.')
template = loader.load('comicrss.xml')
stream = template.generate(title=title, items=items)
print stream.render('xml')

The script works under the assumption that there is one comic:id per comic, and any other tuples for a url with a comic:id is data relevant to that comic. The template binds to the specific fields it cares about and produces an RSS XML document.

Thoughts on Improvements

I know this thing has a lot of shortcomings, but I think it is well on its way to being what I want. Here are some of the things I have in mind at the moment.

  • There should be a web interface for manipulating templates, files, and fetching. Editing the DB contents by hand is far from ideal, and a web interface could show results very clearly.
  • Pages that change haven't been accounted for. For example, the page for the latest comic might not have a next link until the next comic is available. The fetcher needs to know to re-fetch that page in those circumstances.
  • Comic images should be fetched and referenced locally.
  • The fetcher should be rate limited. I am currently running it only periodically in a way that it only fetches one or two pages per site, but something should be built in so that it doesn't hammer sites.
  • The scripts should have a common configuration instead of hardcoded DB connection data in each one.
  • This needs to be tested with many more comics.
  • This needs to be tested with something that is entirely unlike comics.
  • genshi for templating works great for this specific case, but it might be preferable to allow for user-defined templates, which might require a sandboxable template system.
  • I should learn how badly I'm butchering RDF concepts.
  • Document templates need to be decoupled from the program which generates documents from them.
  • The scraper needs to be aware of which files and templates have changed and only re-examine those.

AddComment

Last modified 14 years ago Last modified on 12/28/07 16:15:09
Note: See TracWiki for help on using the wiki.