Harvesting design

conseo 4consensus at web.de
Sat Jul 30 12:30:32 EDT 2011


Am Saturday, July 30, 2011 schrieb Michael Allan:
> conseo wrote:
> > Proposed solution (activitiy diagram attached)
> > 
> > The current harvester's functionality would be at the parsing
> > step. This means in the future we will not use an IRCBot as a
> > harvester but an IRC log parser which parses the written logs in
> > real time. The current IRCBot will have to move to a seperate
> > program/modul (either merged with a simple [standard- format] log
> > writer itself or a seperate bot).
> 
> Ditto for processing the incoming mail, which is no longer essential
> to generating the diff feed.  If I understand correctly, the diff feed
> is now to be generated entirely by parsing the mailing list's web
> archive?

Nope, it is about to be generated from the raw archive in the maildir. The 
MailHarvester is already pretty much in the proposed shape. I am afraid, but 
the web archive of mailman is not sufficient. It does not expose mail addresses 
for example, which are crucial to check for authorship via MailishUsername. I 
came across this problem during scraping and had to use the timestamp to 
scrape instead of user's identitiy.
I am rather proposing that we (or a typical client program for the 
communicational medium we want to access) store all the data ourselves in a 
first step, avoiding to lose any detail, so we don't have to worry about future 
needs or adjustments to our database, as we can easily harvest such data pools 
(we can even exchange them between different voteservers). The admin can also 
create his own archive if the web service is borked intentionally (closed and 
locked up service) or unintentionally (fluctuation of web services) by the 
service's admin.
But if we don't have a copy, what do we do if the web archive (e.g. 
metagovernments mailman archive) goes off the net? We only have our poor feed 
information which is pretty much useless without a link to the full 
communicational data. We would have to remove the data from the DB then (very 
bad :-( for our users and the historical records of the polls).

> 
> > The URL parsing will happen for all data the bot parses, meaning
> > that it will reparse any data which will be deployed in its archive
> > again. We need some mailbox like mechanism to have a new queue of
> > logs and "current" queue of already read logs. Moving them to the
> > new queue will make the harvester reparse their data and update the
> > db.
> 
> But the logs may be on a remote server.  When harvesting from a
> mailing list, we'll be reading directly from that list's archive
> (again, if I understand the design).  Same when harvesting from an IRC
> channel?  We read directly from that channels own logs?  Example:
> http://irclog.perlgeek.de/perl6/2011-07-30

"Read" == scrape here, since the web representations are not equivalent 
representations of data, but rather merged with the visualization (If XML+XSLT 
would have succeeded like in a perfect open world, we might not have that 
problem, but web interfaces are closed and obfuscated by their layout 
representation). Scraping is much worse and flakey than to read from a well-
defined data format. For IRC this would for example mean to interpret the 
archives colors to understand what the message means or to lose channel 
metadata like joins or leaves (if the log bot does not expose them) and 
basically to guess all kind of admin adjustments (not to speak about 
adjustments between new versions of the archive generating program). 
While there are countless types of web renderings for IRC archives and even 
different IRC log formats, IRC itself is a plaintext format which can be simply 
written to disk line by line.

> 
> > Suggestiions, criticism or simple questions? Tell me! :-)
> 
> I can picture 4 components, in priority order:
> 
>   1. Harvesters.  We code a separate harvester for every format of
>      archive.  Each harvester has a command line interface (CLI) by
>      which the admin can launch it.  It runs once and then terminates.
>      It parses a specific archive, for a specific period of time:
> 
>        voharvester-pipermail --url=http://foo.de/ --start=2011-6
> 
>      It can also be launched via Java API, of course.
> 
>      This alone is sufficient to generate a diff feed, albeit
>      manually.  The rest is just icing on the cake.

You really like the split programs approach, while I feel uncomfortable with 
it here. The harvesters are daemons (actually it is one for me) and will only 
be started once in a while (likely months). You have also created a init 
script for you already, so I don't see the benefit of writing daemons as a 
(potentially large set) of command line utilities. If we modify the daemon 
approach to something which is started by cron every 15 mins (for example), 
then we only get more drawbacks. The data is not life anymore and the 
harvester has to be smart to find its return point in the data (e.g. the --
start=2011-5 is not enough or all data since May will be processed again on 
every run), memory consumption rises as harvesters can run in parallel as 
different processes (different JVMs) and they might be started in parallel which 
means they take temporarily high CPU resources. This can be adjusted of 
course, but where is the benefit of different executables? 
Note: The code is modularized anyway and runs in completely seperate threads, 
which are configured via a single ecmascript configuration. If we think about a 
potential future setup of 10 harvesters it is also likely that the admin will 
share configuration between them with some helper routines in ecmascript (like 
we don already for Wiki access).

> 
>   2. Harvest prompter.  A control in crossforum theatre that requests
>      an instant refresh via the feed's web API.  The feed service
>      responds by launching the appropriate harvesters for the poll and
>      replying asynchronously via JSONP, "I'm done".  The feed client
>      then refreshes as it normally would.

Why not harvest all the time and be greedy? :-) If you have web archives (aka 
scraping) in mind, then triggering an update might make some sense, since we 
otherwise have to poll the data regularly. But then the users will face 
outdated data until the admin (or a seperate kicker program (which means an 
additional live client anyway) as you mention) triggers an update. 

> 
>      This component must eventually be disabled for all but
>      admins/devs.  We don't want to poll the archives too frequently
>      and get the harvesters banned.
> 
>      Problem:
> 
>          The feed service might want to consult the Count API in order
>          to learn the crossforum extent of the poll (a counted
>          resource), and thus discover exactly which archives need to
>          be harvested for that particular poll.  But crossforum extent
>          is not likely to be counted for *all* branches of the poll.
>          It is only of interest for users near the leaves.  So
>          although the Count API may be sufficient to discover *new*
>          archives, it won't store a complete list including the old
>          ones, many of which might be quite active downtream in the
>          tree.  Somehow the diff code must keep its own list of
>          archives for every poll.
> 
>          Adding to the problem, the admin must be able to point to a
>          message in the diff feed (perhaps a duplicate of another
>          message); learn why the feed is harvesting from that
>          particular archive URL; and put a stop to it if it's
>          inappropriate.
> 
>   3. Harvest timer.  This is a cron job that regularly calls the
>      harvesters in order to discover new diff messages and to parse
>      them.  Basically it implements automatic polling of the archives.
>      We can probably get away with polling every 4 hours without
>      having to worry too much about robots.txt.

If we have our own dump, we can digest it in any way we like to, even live.

> 
>      This component needn't be designed in advance.  It can be added
>      at any time.
> 
>   4. Harvest kicker.  Sits on the mailing list or IRC channel and
>      looks for diff URLs in the stream of incoming messages.  For each
>      diff URL detected, it launches the appropropriate harvester to
>      parse the message and add it to the feed.
> 
>      This component needn't be designed in advance.  It can be added
>      at any time.  Until it is added, the diff feed will not operate
>      in real time.
> 
> I guess we had our priorities reversed in the first impl.  We didn't
> realize that archive parsing is crucial thing on which everything else
> has to be built.

Yep, the archive is important, esp. since historical development is very 
important for mutual understanding and studying of the consesus making effort. 

What do you think about my perspective? Feel free to ping me on IRC (conseo on 
#votorola) if you have any questions.

c



More information about the Votorola mailing list