Harvesting design

Sat Jul 30 06:26:17 EDT 2011

conseo wrote:
> Proposed solution (activitiy diagram attached)
> 
> The current harvester's functionality would be at the parsing
> step. This means in the future we will not use an IRCBot as a
> harvester but an IRC log parser which parses the written logs in
> real time. The current IRCBot will have to move to a seperate
> program/modul (either merged with a simple [standard- format] log
> writer itself or a seperate bot).

Ditto for processing the incoming mail, which is no longer essential
to generating the diff feed.  If I understand correctly, the diff feed
is now to be generated entirely by parsing the mailing list's web
archive?

> The URL parsing will happen for all data the bot parses, meaning
> that it will reparse any data which will be deployed in its archive
> again. We need some mailbox like mechanism to have a new queue of
> logs and "current" queue of already read logs. Moving them to the
> new queue will make the harvester reparse their data and update the
> db.

But the logs may be on a remote server.  When harvesting from a
mailing list, we'll be reading directly from that list's archive
(again, if I understand the design).  Same when harvesting from an IRC
channel?  We read directly from that channels own logs?  Example:
http://irclog.perlgeek.de/perl6/2011-07-30

> Suggestiions, criticism or simple questions? Tell me! :-)

I can picture 4 components, in priority order:

  1. Harvesters.  We code a separate harvester for every format of
     archive.  Each harvester has a command line interface (CLI) by
     which the admin can launch it.  It runs once and then terminates.
     It parses a specific archive, for a specific period of time:

       voharvester-pipermail --url=http://foo.de/ --start=2011-6

     It can also be launched via Java API, of course.

     This alone is sufficient to generate a diff feed, albeit
     manually.  The rest is just icing on the cake.

  2. Harvest prompter.  A control in crossforum theatre that requests
     an instant refresh via the feed's web API.  The feed service
     responds by launching the appropriate harvesters for the poll and
     replying asynchronously via JSONP, "I'm done".  The feed client
     then refreshes as it normally would.

     This component must eventually be disabled for all but
     admins/devs.  We don't want to poll the archives too frequently
     and get the harvesters banned.

     Problem:

         The feed service might want to consult the Count API in order
         to learn the crossforum extent of the poll (a counted
         resource), and thus discover exactly which archives need to
         be harvested for that particular poll.  But crossforum extent
         is not likely to be counted for *all* branches of the poll.
         It is only of interest for users near the leaves.  So
         although the Count API may be sufficient to discover *new*
         archives, it won't store a complete list including the old
         ones, many of which might be quite active downtream in the
         tree.  Somehow the diff code must keep its own list of
         archives for every poll.

         Adding to the problem, the admin must be able to point to a
         message in the diff feed (perhaps a duplicate of another
         message); learn why the feed is harvesting from that
         particular archive URL; and put a stop to it if it's
         inappropriate.

  3. Harvest timer.  This is a cron job that regularly calls the
     harvesters in order to discover new diff messages and to parse
     them.  Basically it implements automatic polling of the archives.
     We can probably get away with polling every 4 hours without
     having to worry too much about robots.txt.

     This component needn't be designed in advance.  It can be added
     at any time.

  4. Harvest kicker.  Sits on the mailing list or IRC channel and
     looks for diff URLs in the stream of incoming messages.  For each
     diff URL detected, it launches the appropropriate harvester to
     parse the message and add it to the feed.

     This component needn't be designed in advance.  It can be added
     at any time.  Until it is added, the diff feed will not operate
     in real time.

I guess we had our priorities reversed in the first impl.  We didn't
realize that archive parsing is crucial thing on which everything else
has to be built.

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/