Context Navigation

Changes between Version 2 and Version 3 of Projects/Scraper

Timestamp:: Dec 28, 2007, 3:57:39 PM (16 years ago)
Author:: Cory McWilliams
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Projects/Scraper

-              v2
+              v3
 print 'Done.'
 }}}
+== Examples ==
+That was more than enough code to get me started.  Here are the templates I used for testing.  I extract an image, alt text, next and previous links, and id for each comic.  The links are understood by the fetcher.  ''url'' is a pattern which describes which files a template applies to.  A MySQL ''LIKE'' is being used for that right now, hence the '%'s.  You can see I prefer the xpath queries so far.  They seem to be quite robust for this purpose.
+{{{
++------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
+| url                                      | type     | pattern                                           | meaning        | format      |
++------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
+| http://www.penny-arcade.com/comic/%      | xpath    | //div[@id="comicstrip"]/img/@src                  | comic:img      | %(makeurl)s |
+| http://www.penny-arcade.com/comic/%      | xpath    | //div[@id="comicstrip"]/img/@alt                  | comic:alt      | NULL        |
+| http://www.penny-arcade.com/comic/%      | xpath    | //a[img[@alt="Next"]]/@href                       | comic:next     | %(makeurl)s |
+| http://www.penny-arcade.com/comic/%      | xpath    | //a[img[@alt="Back"]]/@href                       | comic:previous | %(makeurl)s |
+| http://www.penny-arcade.com/comic/%      | xpath    | //input[@name="Date"]/@value                      | comic:id       | NULL        |
+| http://questionablecontent.net/view.php% | xpath    | (//a[text()="Next"]/@href)[1]                     | comic:next     | %(makeurl)s |
+| http://questionablecontent.net/view.php% | xpath    | (//a[text()="Previous"]/@href)[1]                 | comic:previous | %(makeurl)s |
+| http://questionablecontent.net/view.php% | xpath    | //center/img[starts-with(@src, "./comics/")]/@src | comic:img      | %(makeurl)s |
+| http://questionablecontent.net/view.php% | urlregex | (\d+)$                                            | comic:id       | NULL        |
+| http://sinfest.net/archive_page.php%     | xpath    | //img[contains(@src, "/comics/")]/@src            | comic:img      | %(makeurl)s |
+| http://sinfest.net/archive_page.php%     | xpath    | //img[contains(@src, "/comics/")]/@alt            | comic:alt      | NULL        |
+| http://sinfest.net/archive_page.php%     | xpath    | //a[img[@alt="Next"]]/@href                       | comic:next     | %(makeurl)s |
+| http://sinfest.net/archive_page.php%     | xpath    | //a[img[@alt="Previous"]]/@href                   | comic:previous | %(makeurl)s |
+| http://sinfest.net/archive_page.php%     | urlregex | (\d+)$                                            | comic:id       | NULL        |
++------------------------------------------+----------+---------------------------------------------------+----------------+-------------+
+}}}
+The results?  It is working exactly as I expected.  I seeded a few URLs from each comic and then alternated running the fetcher and scraper, and my collection of structured data about these comics grew.
+== Presentation ==
+I know for this to be useful I need to be able to easily produce appealing-looking reports of this data.  My first attempt is with producing RSS feeds with [http://genshi.edgewall.org/ genshi].
+My no-nonsense template looks like this:
+{{{
+#!xml
+<rss version="2.0"
+    xmlns:py="http://genshi.edgewall.org/">
+    <channel>
+        <title>${title}</title>
+        <py:for each="item in items">
+            <item>
+                <title>${item.alt}</title>
+                <description>&lt;img src="${item.img}" alt="${item.alt}" /&gt;</description>
+                <guid>${item.url}#${item.id}</guid>
+            </item>
+        </py:for>
+    </channel>
+</rss>
+}}}
+The program to put everything together looks like this:
+{{{
+#!python
+import MySQLdb
+from genshi.template import TemplateLoader
+import sys
+db = MySQLdb.connect(user='user', passwd='passwd', host='host', db='db')
+cursor = db.cursor()
+if len(sys.argv) != 3:
+    print 'Usage: %s urlpattern title'
+    sys.exit(1)
+(urlpattern, title) = sys.argv[1:]
+items = []
+cursor.execute('SELECT url FROM data WHERE meaning="comic:id" AND url LIKE "%s" ORDER BY value DESC' % urlpattern)
+fields = db.cursor()
+for (url,) in cursor:
+    fields.execute('SELECT meaning, value FROM data WHERE url=%s AND meaning LIKE "comic:%%"', url)
+    item = {'url': url}
+    for (meaning, value) in fields:
+        item[meaning.split(':', 1)[1]] = unicode(value, 'utf8')
+    items.append(item)
+loader = TemplateLoader('.')
+template = loader.load('comicrss.xml')
+stream = template.generate(title=title, items=items)
+print stream.render('xml')
+}}}
+The script works under the assumption that there is one ''comic:id'' per comic, and any other tuples for a url with a ''comic:id'' is data relevant to that comic.  The template binds to the specific fields it cares about and produces an RSS XML document.