Harvesting design

Sat Jul 30 16:03:30 EDT 2011

conseo wrote:
> > Ditto for processing the incoming mail, which is no longer
> > essential to generating the diff feed.  If I understand correctly,
> > the diff feed is now to be generated entirely by parsing the
> > mailing list's web archive?
> 
> Nope, it is about to be generated from the raw archive in the
> maildir. The MailHarvester is already pretty much in the proposed
> shape. I am afraid, but the web archive of mailman is not
> sufficient. It does not expose mail addresses for example, which are
> crucial to check for authorship via MailishUsername. I came across
> this problem during scraping and had to use the timestamp to scrape
> instead of user's identitiy.

Pipermail does expose the sender's address.  You must be thinking of
Google Groups.

> I am rather proposing that we (or a typical client program for the
> communicational medium we want to access) store all the data
> ourselves in a first step, avoiding to lose any detail, so we don't
> have to worry about future needs or adjustments to our database, as
> we can easily harvest such data pools (we can even exchange them
> between different voteservers). The admin can also create his own
> archive if the web service is borked intentionally (closed and
> locked up service) or unintentionally (fluctuation of web services)
> by the service's admin.

I understand, and agree with the need.  But storing message data is
also the function of the message archive.  If we worked entirely from
that archive, then we'd gain some advantages:

  * Code is simplified because we only need a single parser.  Instead
    of parser for the live stream, and a parser for the archive, we
    only need the latter.

  * We can generate a diff feed based entirely on the archive, going
    back many years - even if we weren't subscribed to the live
    stream.

  * No need to code the subscriptions to the live feed.

> But if we don't have a copy, what do we do if the web archive (e.g.
> metagovernments mailman archive) goes off the net? We only have our
> poor feed information which is pretty much useless without a link to
> the full communicational data. We would have to remove the data from
> the DB then (very bad :-( for our users and the historical records
> of the polls).

If the archive is valuable, then the users should back it up. :-) It's
their list, and their responsibility, and they shouldn't entrust it to
others.

> > 1. Harvesters.  We code a separate harvester for every format of
> >    archive.  Each harvester has a command line interface (CLI) by
> >    which the admin can launch it.  It runs once and then terminates.
> >    It parses a specific archive, for a specific period of time:
> >
> >      voharvester-pipermail --url=http://foo.de/ --start=2011-6
> >
> >    It can also be launched via Java API, of course.
> >
> >    This alone is sufficient to generate a diff feed, albeit
> >    manually.  The rest is just icing on the cake.
> 
> You really like the split programs approach, while I feel
> uncomfortable with it here. The harvesters are daemons (actually it
> is one for me) and will only be started once in a while (likely
> months)...

You are used to looking at them as daemaons, but they are not
continuous processes by nature.  Maybe we should skype first ...

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/