wiki:Projects/Scraper

Version 1 (modified by Cory McWilliams, 16 years ago) ( diff )

--

Scraper

Scraping news headlines from web sites was a fun project of mine way back in the days before RSS took off. My crappy project for that has long been obsoleted.

I currently rely on dailystrips to gather web comics for me. Many of these already have RSS feeds that provide everything I want, but it would be great to be able to extract just the information I need into a common format from the ones that don't. dailystrips does just that, but that's as far as it goes.

There are plenty of other places where lots of data is freely available but in an extremely awkward format to do anything with. Ideally everyone will embrace the semantic web, but until then, I want a straightforward way to force the semantic web on these sites.

Ways to combine some of my thoughts about these things into a happy, effective webapp have been on my mind. I finally sat down and tried to whip up some parts of what I wanted, and I found most of this to be so blazingly simple that I felt like documenting it.

Schema

I actually started with the database schema, since this all revolved around the data I wanted eventually be able to extract and store.

I am currently working with something like this:

CREATE TABLE `files` ( `url` text, `accessed` datetime default NULL, `headers` text, `content` blob, `actual_url` text);
CREATE TABLE `templates` (`url` text, `type` text, `pattern` text, `meaning` text, `format` text);
CREATE TABLE `data` ( `created` datetime default NULL, `meaning` text, `url` text, `value` text);
  • Files is a set of filenames, file contents, a timestamp, and then some metadata. One program will look for files that need to be fetched and fetch them into this table.
  • Templates is a set of patterns that describe how data is extracted from the files. More on this later.
  • Data is the extracted data, in RDF-esque (url, meaning, value) triplets plus a timestamp.

Fetching

Here's the current code to fetch pages. The actual fetching is as simple as urllib2.urlopen(url).read(). The first real part of the code grabs next and previous links from the database and adds any new ones to the list of files to fetch. The second part downloads the files and stores them.

#!/usr/bin/env python

import urllib2
import MySQLdb

db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
cursor = db.cursor()

cursor.execute('SELECT value FROM data WHERE meaning in ("comic:next", "comic:previous")')
insert = db.cursor()
for (url,) in cursor:
    insert.execute('SELECT 1 FROM files WHERE url=%s', (url,))
    if not insert.fetchone():
        cursor.execute('INSERT INTO files (url) VALUES (%s)', (url,))
db.commit()

cursor.execute('SELECT url FROM files WHERE accessed IS NULL')

for (url,) in cursor:
    print 'Fetching %s...' % url
    u = urllib2.urlopen(url)
    data = u.read()
    cursor.execute('UPDATE files SET content=%s, headers=%s, accessed=NOW(), actual_url=%s WHERE url=%s',
        (data, str(u.headers), u.url, url))
    db.commit()
Note: See TracWiki for help on using the wiki.