Changes between Version 2 and Version 3 of Projects/Scraper


Ignore:
Timestamp:
Dec 28, 2007, 3:57:39 PM (16 years ago)
Author:
Cory McWilliams
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Projects/Scraper

    v2 v3  
    121121print 'Done.'
    122122}}}
     123
     124== Examples ==
     125That was more than enough code to get me started.  Here are the templates I used for testing.  I extract an image, alt text, next and previous links, and id for each comic.  The links are understood by the fetcher.  ''url'' is a pattern which describes which files a template applies to.  A MySQL ''LIKE'' is being used for that right now, hence the '%'s.  You can see I prefer the xpath queries so far.  They seem to be quite robust for this purpose.
     126
     127{{{
     128+------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
     129| url                                      | type     | pattern                                           | meaning        | format      |
     130+------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
     131| http://www.penny-arcade.com/comic/%      | xpath    | //div[@id="comicstrip"]/img/@src                  | comic:img      | %(makeurl)s |
     132| http://www.penny-arcade.com/comic/%      | xpath    | //div[@id="comicstrip"]/img/@alt                  | comic:alt      | NULL        |
     133| http://www.penny-arcade.com/comic/%      | xpath    | //a[img[@alt="Next"]]/@href                       | comic:next     | %(makeurl)s |
     134| http://www.penny-arcade.com/comic/%      | xpath    | //a[img[@alt="Back"]]/@href                       | comic:previous | %(makeurl)s |
     135| http://www.penny-arcade.com/comic/%      | xpath    | //input[@name="Date"]/@value                      | comic:id       | NULL        |
     136| http://questionablecontent.net/view.php% | xpath    | (//a[text()="Next"]/@href)[1]                     | comic:next     | %(makeurl)s |
     137| http://questionablecontent.net/view.php% | xpath    | (//a[text()="Previous"]/@href)[1]                 | comic:previous | %(makeurl)s |
     138| http://questionablecontent.net/view.php% | xpath    | //center/img[starts-with(@src, "./comics/")]/@src | comic:img      | %(makeurl)s |
     139| http://questionablecontent.net/view.php% | urlregex | (\d+)$                                            | comic:id       | NULL        |
     140| http://sinfest.net/archive_page.php%     | xpath    | //img[contains(@src, "/comics/")]/@src            | comic:img      | %(makeurl)s |
     141| http://sinfest.net/archive_page.php%     | xpath    | //img[contains(@src, "/comics/")]/@alt            | comic:alt      | NULL        |
     142| http://sinfest.net/archive_page.php%     | xpath    | //a[img[@alt="Next"]]/@href                       | comic:next     | %(makeurl)s |
     143| http://sinfest.net/archive_page.php%     | xpath    | //a[img[@alt="Previous"]]/@href                   | comic:previous | %(makeurl)s |
     144| http://sinfest.net/archive_page.php%     | urlregex | (\d+)$                                            | comic:id       | NULL        |
     145+------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
     146}}}
     147
     148The results?  It is working exactly as I expected.  I seeded a few URLs from each comic and then alternated running the fetcher and scraper, and my collection of structured data about these comics grew.
     149
     150== Presentation ==
     151I know for this to be useful I need to be able to easily produce appealing-looking reports of this data.  My first attempt is with producing RSS feeds with [http://genshi.edgewall.org/ genshi].
     152
     153My no-nonsense template looks like this:
     154{{{
     155#!xml
     156<rss version="2.0"
     157    xmlns:py="http://genshi.edgewall.org/">
     158    <channel>
     159        <title>${title}</title>
     160        <py:for each="item in items">
     161            <item>
     162                <title>${item.alt}</title>
     163                <description>&lt;img src="${item.img}" alt="${item.alt}" /&gt;</description>
     164                <guid>${item.url}#${item.id}</guid>
     165            </item>
     166        </py:for>
     167    </channel>
     168</rss>
     169}}}
     170
     171The program to put everything together looks like this:
     172{{{
     173#!python
     174import MySQLdb
     175from genshi.template import TemplateLoader
     176import sys
     177
     178db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
     179cursor = db.cursor()
     180
     181if len(sys.argv) != 3:
     182    print 'Usage: %s urlpattern title'
     183    sys.exit(1)
     184(urlpattern, title) = sys.argv[1:]
     185
     186items = []
     187cursor.execute('SELECT url FROM data WHERE meaning="comic:id" AND url LIKE "%s" ORDER BY value DESC' % urlpattern)
     188fields = db.cursor()
     189for (url,) in cursor:
     190    fields.execute('SELECT meaning, value FROM data WHERE url=%s AND meaning LIKE "comic:%%"', url)
     191    item = {'url': url}
     192    for (meaning, value) in fields:
     193        item[meaning.split(':', 1)[1]] = unicode(value, 'utf8')
     194    items.append(item)
     195
     196loader = TemplateLoader('.')
     197template = loader.load('comicrss.xml')
     198stream = template.generate(title=title, items=items)
     199print stream.render('xml')
     200}}}
     201
     202The script works under the assumption that there is one ''comic:id'' per comic, and any other tuples for a url with a ''comic:id'' is data relevant to that comic.  The template binds to the specific fields it cares about and produces an RSS XML document.