wiki:Projects/Scraper

Context Navigation

Version 2 (modified by Cory McWilliams, 16 years ago) ( diff )
--

Scraper

Scraping news headlines from web sites was a fun project of mine way back in the days before RSS took off. My crappy project for that has long been obsoleted.

I currently rely on dailystrips to gather web comics for me. Many of these already have RSS feeds that provide everything I want, but it would be great to be able to extract just the information I need into a common format from the ones that don't. dailystrips does just that, but that's as far as it goes.

There are plenty of other places where lots of data is freely available but in an extremely awkward format to do anything with. Ideally everyone will embrace the semantic web, but until then, I want a straightforward way to force the semantic web on these sites.

Ways to combine some of my thoughts about these things into a happy, effective webapp have been on my mind. I finally sat down and tried to whip up some parts of what I wanted, and I found most of this to be so blazingly simple that I felt like documenting it.

Schema

I actually started with the database schema, since this all revolved around the data I wanted eventually be able to extract and store.

I am currently working with something like this:

CREATE TABLE `files` ( `url` text, `accessed` datetime default NULL, `headers` text, `content` blob, `actual_url` text);
CREATE TABLE `templates` (`url` text, `type` text, `pattern` text, `meaning` text, `format` text);
CREATE TABLE `data` ( `created` datetime default NULL, `meaning` text, `url` text, `value` text);

Files is a set of filenames, file contents, a timestamp, and then some metadata. One program will look for files that need to be fetched and fetch them into this table.
Templates is a set of patterns that describe how data is extracted from the files. More on this later.
Data is the extracted data, in RDF-esque (url, meaning, value) triplets plus a timestamp.

Fetching

Here's the current code to fetch pages. The actual fetching is as simple as urllib2.urlopen(url).read(). The first real part of the code grabs next and previous links from the database and adds any new ones to the list of files to fetch. The second part downloads the files and stores them.

import urllib2
import MySQLdb

db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
cursor = db.cursor()

cursor.execute('SELECT value FROM data WHERE meaning in ("comic:next", "comic:previous")')
insert = db.cursor()
for (url,) in cursor:
    insert.execute('SELECT 1 FROM files WHERE url=%s', (url,))
    if not insert.fetchone():
        cursor.execute('INSERT INTO files (url) VALUES (%s)', (url,))
db.commit()

cursor.execute('SELECT url FROM files WHERE accessed IS NULL')

for (url,) in cursor:
    print 'Fetching %s...' % url
    u = urllib2.urlopen(url)
    data = u.read()
    cursor.execute('UPDATE files SET content=%s, headers=%s, accessed=NOW(), actual_url=%s WHERE url=%s',
        (data, str(u.headers), u.url, url))
    db.commit()

Scraping

Scraping is reasonably simple as well. All of the real work is done in the loop at the bottom. It iterates over files and templates that match that file, finds any matches for the template, stores them in the data table, and then removes any entries from previous runs.

There are currently exactly three types of templates: regex, xpath, and urlregex. regex is obvious. You supply a regular expression, and any captures are stored as values for that match. xpath converts the file to an XML tree using lxml.etree.HTMLParser and then executes the xpath query on it. urlregex is just like regex except that it operates on the URL instead of the file contents. This was just added as an afterthought to be able to extract data from the URL.

Another afterthought was manipulating the extracted data. This is done through a python string comprehension and the FancyFormatter class, which provides access to multiple named values or does a url join on the base URL and a relative path.

import MySQLdb
import re
from StringIO import StringIO
from lxml import etree
import datetime
from urllib import basejoin

db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
cursor = db.cursor()

def xpath_search(content, query):
    tree = etree.parse(StringIO(content), etree.HTMLParser())
    find = etree.XPath(query)
    return find(tree)

class FancyFormatter(object):
    def __init__(self, dictionary):
        self._dict = dictionary

    def __getitem__(self, item):
        if item == 'makeurl':
            return basejoin(self._dict['url'], self._dict['value'])
        else:
            return self._dict[item]

    def __str__(self):
        return self._dict['value']

def add_tuple(meaning, url, value, format):
    if format:
        value = format % FancyFormatter({'url': url, 'value': value})
    cursor.execute('INSERT INTO data (created, meaning, url, value) VALUES (NOW(), %s, %s, %s)',
        (meaning, url, value))

print 'Scraping...'
cursor.execute('SELECT url, content FROM files WHERE content IS NOT NULL')
for (url, content) in cursor:
    start = datetime.datetime.now()

    templates = db.cursor()
    templates.execute('SELECT type, pattern, meaning, format FROM templates WHERE %s LIKE url', (url,))
    for (type, pattern, meaning, format) in templates:
        if type == 'xpath':
            for value in xpath_search(content, pattern):
                add_tuple(meaning, url, value, format)
        elif type == 'regex':
            for value in re.search(pattern, content, re.S|re.M):
                add_tuple(meaning, url, value, format)
        elif type == 'urlregex':
            match = re.search(pattern, url, re.S|re.M)
            if match:
                for value in match.groups():
                    add_tuple(meaning, url, value, format)
        else:
            raise RuntimeError('Unknown template type: "%s".' % (type,))
    cursor.execute('DELETE FROM data WHERE url=%s AND created<%s', (url, start))
    db.commit()
print 'Done.'

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text