Harvesting design
Michael Allan
mike at zelea.com
Sat Jul 30 06:26:17 EDT 2011
conseo wrote:
> Proposed solution (activitiy diagram attached)
>
> The current harvester's functionality would be at the parsing
> step. This means in the future we will not use an IRCBot as a
> harvester but an IRC log parser which parses the written logs in
> real time. The current IRCBot will have to move to a seperate
> program/modul (either merged with a simple [standard- format] log
> writer itself or a seperate bot).
Ditto for processing the incoming mail, which is no longer essential
to generating the diff feed. If I understand correctly, the diff feed
is now to be generated entirely by parsing the mailing list's web
archive?
> The URL parsing will happen for all data the bot parses, meaning
> that it will reparse any data which will be deployed in its archive
> again. We need some mailbox like mechanism to have a new queue of
> logs and "current" queue of already read logs. Moving them to the
> new queue will make the harvester reparse their data and update the
> db.
But the logs may be on a remote server. When harvesting from a
mailing list, we'll be reading directly from that list's archive
(again, if I understand the design). Same when harvesting from an IRC
channel? We read directly from that channels own logs? Example:
http://irclog.perlgeek.de/perl6/2011-07-30
> Suggestiions, criticism or simple questions? Tell me! :-)
I can picture 4 components, in priority order:
1. Harvesters. We code a separate harvester for every format of
archive. Each harvester has a command line interface (CLI) by
which the admin can launch it. It runs once and then terminates.
It parses a specific archive, for a specific period of time:
voharvester-pipermail --url=http://foo.de/ --start=2011-6
It can also be launched via Java API, of course.
This alone is sufficient to generate a diff feed, albeit
manually. The rest is just icing on the cake.
2. Harvest prompter. A control in crossforum theatre that requests
an instant refresh via the feed's web API. The feed service
responds by launching the appropriate harvesters for the poll and
replying asynchronously via JSONP, "I'm done". The feed client
then refreshes as it normally would.
This component must eventually be disabled for all but
admins/devs. We don't want to poll the archives too frequently
and get the harvesters banned.
Problem:
The feed service might want to consult the Count API in order
to learn the crossforum extent of the poll (a counted
resource), and thus discover exactly which archives need to
be harvested for that particular poll. But crossforum extent
is not likely to be counted for *all* branches of the poll.
It is only of interest for users near the leaves. So
although the Count API may be sufficient to discover *new*
archives, it won't store a complete list including the old
ones, many of which might be quite active downtream in the
tree. Somehow the diff code must keep its own list of
archives for every poll.
Adding to the problem, the admin must be able to point to a
message in the diff feed (perhaps a duplicate of another
message); learn why the feed is harvesting from that
particular archive URL; and put a stop to it if it's
inappropriate.
3. Harvest timer. This is a cron job that regularly calls the
harvesters in order to discover new diff messages and to parse
them. Basically it implements automatic polling of the archives.
We can probably get away with polling every 4 hours without
having to worry too much about robots.txt.
This component needn't be designed in advance. It can be added
at any time.
4. Harvest kicker. Sits on the mailing list or IRC channel and
looks for diff URLs in the stream of incoming messages. For each
diff URL detected, it launches the appropropriate harvester to
parse the message and add it to the feed.
This component needn't be designed in advance. It can be added
at any time. Until it is added, the diff feed will not operate
in real time.
I guess we had our priorities reversed in the first impl. We didn't
realize that archive parsing is crucial thing on which everything else
has to be built.
--
Michael Allan
Toronto, +1 416-699-9528
http://zelea.com/
More information about the Votorola
mailing list