Changes between Version 3 and Version 4 of Projects/Scraper


Ignore:
Timestamp:
Dec 28, 2007, 4:09:13 PM (16 years ago)
Author:
Cory McWilliams
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Projects/Scraper

    v3 v4  
    201201
    202202The script works under the assumption that there is one ''comic:id'' per comic, and any other tuples for a url with a ''comic:id'' is data relevant to that comic.  The template binds to the specific fields it cares about and produces an RSS XML document.
     203
     204== Thoughts on Improvements ==
     205I know this thing has a lot of shortcomings, but I think it is well on its way to being what I want.  Here are some of the things I have in mind at the moment.
     206
     207 * There should be a web interface for manipulating templates, files, and fetching.  Editing the DB contents by hand is far from ideal, and a web interface could show results very clearly.
     208 * Pages that change haven't been accounted for.  For example, the page for the latest comic might not have a ''next'' link until the next comic is available.  The fetcher needs to know to re-fetch that page in those circumstances.
     209 * Comic images should be fetched and referenced locally.
     210 * The fetcher should be rate limited.  I am currently running it only periodically in a way that it only fetches one or two pages per site, but something should be built in so that it doesn't hammer sites.
     211 * The scripts should have a common configuration instead of hardcoded DB connection data in each one.
     212 * This needs to be tested with many more comics.
     213 * This needs to be tested with something that is entirely unlike comics.
     214 * genshi for templating works great for this specific case, but it might be preferable to allow for user-defined templates, which might require a sandboxable template system.
     215 * I should learn how badly I'm butchering RDF concepts.
     216 * Document templates need to be decoupled from the program which generates documents from them.
     217
     218[[AddComment]]