Harvesting design
Michael Allan
mike at zelea.com
Sat Jul 30 16:03:30 EDT 2011
conseo wrote:
> > Ditto for processing the incoming mail, which is no longer
> > essential to generating the diff feed. If I understand correctly,
> > the diff feed is now to be generated entirely by parsing the
> > mailing list's web archive?
>
> Nope, it is about to be generated from the raw archive in the
> maildir. The MailHarvester is already pretty much in the proposed
> shape. I am afraid, but the web archive of mailman is not
> sufficient. It does not expose mail addresses for example, which are
> crucial to check for authorship via MailishUsername. I came across
> this problem during scraping and had to use the timestamp to scrape
> instead of user's identitiy.
Pipermail does expose the sender's address. You must be thinking of
Google Groups.
> I am rather proposing that we (or a typical client program for the
> communicational medium we want to access) store all the data
> ourselves in a first step, avoiding to lose any detail, so we don't
> have to worry about future needs or adjustments to our database, as
> we can easily harvest such data pools (we can even exchange them
> between different voteservers). The admin can also create his own
> archive if the web service is borked intentionally (closed and
> locked up service) or unintentionally (fluctuation of web services)
> by the service's admin.
I understand, and agree with the need. But storing message data is
also the function of the message archive. If we worked entirely from
that archive, then we'd gain some advantages:
* Code is simplified because we only need a single parser. Instead
of parser for the live stream, and a parser for the archive, we
only need the latter.
* We can generate a diff feed based entirely on the archive, going
back many years - even if we weren't subscribed to the live
stream.
* No need to code the subscriptions to the live feed.
> But if we don't have a copy, what do we do if the web archive (e.g.
> metagovernments mailman archive) goes off the net? We only have our
> poor feed information which is pretty much useless without a link to
> the full communicational data. We would have to remove the data from
> the DB then (very bad :-( for our users and the historical records
> of the polls).
If the archive is valuable, then the users should back it up. :-) It's
their list, and their responsibility, and they shouldn't entrust it to
others.
> > 1. Harvesters. We code a separate harvester for every format of
> > archive. Each harvester has a command line interface (CLI) by
> > which the admin can launch it. It runs once and then terminates.
> > It parses a specific archive, for a specific period of time:
> >
> > voharvester-pipermail --url=http://foo.de/ --start=2011-6
> >
> > It can also be launched via Java API, of course.
> >
> > This alone is sufficient to generate a diff feed, albeit
> > manually. The rest is just icing on the cake.
>
> You really like the split programs approach, while I feel
> uncomfortable with it here. The harvesters are daemons (actually it
> is one for me) and will only be started once in a while (likely
> months)...
You are used to looking at them as daemaons, but they are not
continuous processes by nature. Maybe we should skype first ...
--
Michael Allan
Toronto, +1 416-699-9528
http://zelea.com/
More information about the Votorola
mailing list