Changes between Initial Version and Version 1 of Projects/Scraper


Ignore:
Timestamp:
Dec 28, 2007, 3:37:55 PM (16 years ago)
Author:
Cory McWilliams
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Projects/Scraper

    v1 v1  
     1= Scraper =
     2Scraping news headlines from web sites was a fun project of mine way back in the days before RSS took off.  My crappy project for that has long been obsoleted.
     3
     4I currently rely on [http://dailystrips.sourceforge.net/ dailystrips] to gather web comics for me.  Many of these already have RSS feeds that provide everything I want, but it would be great to be able to extract just the information I need into a common format from the ones that don't.  dailystrips does just that, but that's as far as it goes.
     5
     6There are plenty of other places where lots of data is freely available but in an extremely awkward format to do anything with.  Ideally everyone will embrace the semantic web, but until then, I want a straightforward way to force the semantic web on these sites.
     7
     8Ways to combine some of my thoughts about these things into a happy, effective webapp have been on my mind.  I finally sat down and tried to whip up some parts of what I wanted, and I found most of this to be so blazingly simple that I felt like documenting it.
     9
     10== Schema ==
     11I actually started with the database schema, since this all revolved around the data I wanted eventually be able to extract and store.
     12
     13I am currently working with something like this:
     14{{{
     15#!sql
     16CREATE TABLE `files` ( `url` text, `accessed` datetime default NULL, `headers` text, `content` blob, `actual_url` text);
     17CREATE TABLE `templates` (`url` text, `type` text, `pattern` text, `meaning` text, `format` text);
     18CREATE TABLE `data` ( `created` datetime default NULL, `meaning` text, `url` text, `value` text);
     19}}}
     20
     21 * ''Files'' is a set of filenames, file contents, a timestamp, and then some metadata.  One program will look for files that need to be fetched and fetch them into this table.
     22 * ''Templates'' is a set of patterns that describe how data is extracted from the files.  More on this later.
     23 * ''Data'' is the extracted data, in RDF-esque (url, meaning, value) triplets plus a timestamp.
     24
     25== Fetching ==
     26Here's the current code to fetch pages.  The actual fetching is as simple as ''urllib2.urlopen(url).read()''.  The first real part of the code grabs ''next'' and ''previous'' links from the database and adds any new ones to the list of files to fetch.  The second part downloads the files and stores them.
     27
     28{{{
     29#!python
     30#!/usr/bin/env python
     31
     32import urllib2
     33import MySQLdb
     34
     35db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
     36cursor = db.cursor()
     37
     38cursor.execute('SELECT value FROM data WHERE meaning in ("comic:next", "comic:previous")')
     39insert = db.cursor()
     40for (url,) in cursor:
     41    insert.execute('SELECT 1 FROM files WHERE url=%s', (url,))
     42    if not insert.fetchone():
     43        cursor.execute('INSERT INTO files (url) VALUES (%s)', (url,))
     44db.commit()
     45
     46cursor.execute('SELECT url FROM files WHERE accessed IS NULL')
     47
     48for (url,) in cursor:
     49    print 'Fetching %s...' % url
     50    u = urllib2.urlopen(url)
     51    data = u.read()
     52    cursor.execute('UPDATE files SET content=%s, headers=%s, accessed=NOW(), actual_url=%s WHERE url=%s',
     53        (data, str(u.headers), u.url, url))
     54    db.commit()
     55}}}