Harvesting design

Fri Jul 29 12:44:25 EDT 2011

Dear list readers,

some technical stuff ahead, yet you might be interested in how to openly 
approach web services to allow outreach without limiting the tooling to 
yourself. So feel free to comment...

Some technical design questions need to be solved to keep the harvesting 
flexible and allow us to keep references to the upstream communication data 
(URLs) working.

Outline of the harvester's tasks

The harvesters are responsible for parsing different communication channels for 
difference related discussions and generate a small "twitter-like" feed of 
discussions around differences in Votorola to present them in our cross-project 
Crossforum UI. This is really smart since: 

a) We don't need to show all cluttered data happening in the communication 
channles, but can easily filter for the few related discussions by checking 
diff-urls and the authors of the related difference with the author/addressee of 
the message. This avoids clutter and allows to only show the really relevant 
discussions in general purpose floated channels. 

And b) we only reference the external discussion and don't have to create our 
own communicational medium, which would be less attractive than the places 
where people discuss related issues usually. They simply won't move to our 
discussion medium, they never do without strong reasons and even if they did, 
they would break up the discussion further, which is very counterproductive 
and hides the discussion further. 

We also don't have to care about any development of these media since enough 
really well-designed and proven communicational software like mailing-lists 
and forums are already available and allow you to chose if you want to provide 
some media with your vote server, so by dividing the problem correctly we 
allow free combination of web services without losing functionality and we 
don't have to reinvent the wheel.

The current design decision

At the moment two different harvesters exist. One for mail and mailman mailing 
lists and another one as an IRCBot. The problem is case b) above. We want to 
link to the external discussion from Crossforum, yet we don't own or control 
the (web) services exposing them. This means a) we can't be sure to get a URL 
to the post provided (mailman e.g. does not export URLs to the web archive), 
so we have to scrape them at least in some cases and b) URLs already stored in 
the DB are subject to changes if the web provider of the communicational 
medium decides to change the URL layout or even migrates the web interface to 
a new platform. This means we need the possibility to regenerate URLs in the 
future. 
Now we have two different concepts:

a) The MailHarvester approach

We store all communicational data in a mailbox and can just rerun the 
harvester on the stored mail data to update our database. Scraping happens on 
parsing the data and it is very easy to fix the database by fixing the code and 
simply rerun the harvester (even without service interruption). This proved to 
be very comfortable to expand or change the code base without losing the 
history of discussions. It also keeps all data which can potentially be 
necessary for scraping. Scraping is a very fragile process which might need 
any unique information hidden in the communicational data. Since future Web-
UIs of alien services (which are the target of the URLs), might need such 
information to be successfully scraped, it is risky to trust the stored data 
int he DB to be sufficient for the future, even if it is well-designed.

b) The IRCHarvester approach

Have an active Bot listening to the channel and allow it it to back-talk to 
the channel to provide some useful information. In our case this is showing 
information and links to mentioned diff-urls in channel, which fits IRC very 
well. We parse messages as they arrive, but don't store them. The problem is 
that past data is lost if we need to a) regenerate data for the difference feed 
and b) especially if we need to recalculate the URLs. So we lose history here 
and we might miss relevant data to recalculate/scrape a new URL.

Proposed solution (activitiy diagram attached)

The current harvester's functionality would be at the parsing step. This means 
in the future we will not use an IRCBot as a harvester but an IRC log parser 
which parses the written logs in real time. The current IRCBot will have to 
move to a seperate program/modul (either merged with a simple [standard-
format] log writer itself or a seperate bot).
The URL parsing will happen for all data the bot parses, meaning that it will 
reparse any data which will be deployed in its archive again. We need some 
mailbox like mechanism to have a new queue of logs and "current" queue of 
already read logs. Moving them to the new queue will make the harvester 
reparse their data and update the db.
Side effect of this data mining: The vote-server provider can always generate 
her own archive for historical reasons from the stored data. This allows more 
aggressive strategies for linking in other services as the admin is in the 
total control of all (gathered) communicational data and not only of the bites 
in the DB. All this communication happens on public channels, so there is no 
fundamental privacy issue here. The data might be subject to copyright or 
other restrictions, yet this lies in the responsibility of the admin.

Alternative solution

Try to store as much relevant data in the DB as possible and recalculate the 
URL if it is missing (deleted) with a helper utitlity from the stored data in 
the DB. This means we cannot correct wrong or limited data in the DB, but we 
will store less data and can avoid the extra storing step. We can also write 
integrated live agents for each channel which can talk back the queried diff-
data. This is only useful in very fast communication channels like IRC though.

In both situations the recalculation of the URL will be scriptable with sane 
defaults and examples from the config script, allowing immediate and individual 
solutions by the admin on URL-changes even during runtime of the diff-bite 
servlet.

Suggestiions, criticism or simple questions? Tell me! :-)

c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Harvester.png
Type: image/png
Size: 14055 bytes
Desc: not available
URL: <http://mail.zelea.com/list/votorola/attachments/20110729/4d4143c8/attachment.png>