Harvester Roadmap

conseo 4consensus at web.de
Thu May 10 20:39:09 EDT 2012


Hi,

> 
> > 1. Extend the PipermailHarvester to track any number of forums and not
> > just
> > one.
> > 2. Track the state properly for each forum (internal to
> > PipermailHarvester). 3. Make this state persistent on disk (internal to
> > PipermailHarvester). 4. Design a proper SQL table layout for the gathered
> > messages (internal to DiffMessageTable).
> > 5. Configure the forums by querying the wiki on startup instead of
> > hardcoding them (PipermailHarvester). This should also happen from time to
> > time during runtime.

DONE.(1) The new Configurator (2) loads all forum configurations from the Wiki 
(might be a separate forum-wiki later as Mike has suggested, to ease cross 
project collaboration) and crawls all Pipermail archives. 
I have also been able to proper TCP session reusage in the http client code by 
using the newly released httpasyncclient of Apache (3). This might or might 
not work with a few lines as done now in the long run, but it fits our use 
case nicely and is still accessible through http-client/core APIs, so thanks 
Apache! HarvestRunner is now at least 100 lines smaller.

I have added votorola/a/diff/harvest/harvest-cache.js as an example config 
file showing how one can crawl for all subdomains of zelea.com (this config 
file is optional). Config updates (including update crawls) happen every 3 
hours atm. (hardcoded).

I see two possible roads going from here. 1) Develop a sample Detector for 
Mailman/Maildir and hammer out the API for it, so we get <10s updates to the 
feed. 2) Build the talk track to embed it in the HUD on all (or most pages). 
As the benefits of 1) only play out if people actually see the messages 
(assuming the current feed is only a demo and not used) and the Detector API 
can still be hooked in later, I tend to 2) and this also seems to follow our 
design philosophy. What do you think?

conseo

(1) http://votorola.polyc0l0r.net/hg/rev/552a9eec9d96

(2) 
http://votorola.polyc0l0r.net/javadoc/votorola/a/diff/harvest/Configurator.html
Or is ConfigDetector better? as it also emits Kicks and detects something... I 
think is still distinct from MaildirDetector or IRCDetector.

(3)
https://hc.apache.org/httpcomponents-asyncclient-dev/index.html
It is still beta, but the API is supposed to be stable. I will upgrade it once 
it is released. It builds upon the already used http-core, http-client, http-
client-nio packages.



More information about the Votorola mailing list