Harvesting design
conseo
4consensus at web.de
Fri Jul 29 12:44:25 EDT 2011
Dear list readers,
some technical stuff ahead, yet you might be interested in how to openly
approach web services to allow outreach without limiting the tooling to
yourself. So feel free to comment...
Some technical design questions need to be solved to keep the harvesting
flexible and allow us to keep references to the upstream communication data
(URLs) working.
Outline of the harvester's tasks
The harvesters are responsible for parsing different communication channels for
difference related discussions and generate a small "twitter-like" feed of
discussions around differences in Votorola to present them in our cross-project
Crossforum UI. This is really smart since:
a) We don't need to show all cluttered data happening in the communication
channles, but can easily filter for the few related discussions by checking
diff-urls and the authors of the related difference with the author/addressee of
the message. This avoids clutter and allows to only show the really relevant
discussions in general purpose floated channels.
And b) we only reference the external discussion and don't have to create our
own communicational medium, which would be less attractive than the places
where people discuss related issues usually. They simply won't move to our
discussion medium, they never do without strong reasons and even if they did,
they would break up the discussion further, which is very counterproductive
and hides the discussion further.
We also don't have to care about any development of these media since enough
really well-designed and proven communicational software like mailing-lists
and forums are already available and allow you to chose if you want to provide
some media with your vote server, so by dividing the problem correctly we
allow free combination of web services without losing functionality and we
don't have to reinvent the wheel.
The current design decision
At the moment two different harvesters exist. One for mail and mailman mailing
lists and another one as an IRCBot. The problem is case b) above. We want to
link to the external discussion from Crossforum, yet we don't own or control
the (web) services exposing them. This means a) we can't be sure to get a URL
to the post provided (mailman e.g. does not export URLs to the web archive),
so we have to scrape them at least in some cases and b) URLs already stored in
the DB are subject to changes if the web provider of the communicational
medium decides to change the URL layout or even migrates the web interface to
a new platform. This means we need the possibility to regenerate URLs in the
future.
Now we have two different concepts:
a) The MailHarvester approach
We store all communicational data in a mailbox and can just rerun the
harvester on the stored mail data to update our database. Scraping happens on
parsing the data and it is very easy to fix the database by fixing the code and
simply rerun the harvester (even without service interruption). This proved to
be very comfortable to expand or change the code base without losing the
history of discussions. It also keeps all data which can potentially be
necessary for scraping. Scraping is a very fragile process which might need
any unique information hidden in the communicational data. Since future Web-
UIs of alien services (which are the target of the URLs), might need such
information to be successfully scraped, it is risky to trust the stored data
int he DB to be sufficient for the future, even if it is well-designed.
b) The IRCHarvester approach
Have an active Bot listening to the channel and allow it it to back-talk to
the channel to provide some useful information. In our case this is showing
information and links to mentioned diff-urls in channel, which fits IRC very
well. We parse messages as they arrive, but don't store them. The problem is
that past data is lost if we need to a) regenerate data for the difference feed
and b) especially if we need to recalculate the URLs. So we lose history here
and we might miss relevant data to recalculate/scrape a new URL.
Proposed solution (activitiy diagram attached)
The current harvester's functionality would be at the parsing step. This means
in the future we will not use an IRCBot as a harvester but an IRC log parser
which parses the written logs in real time. The current IRCBot will have to
move to a seperate program/modul (either merged with a simple [standard-
format] log writer itself or a seperate bot).
The URL parsing will happen for all data the bot parses, meaning that it will
reparse any data which will be deployed in its archive again. We need some
mailbox like mechanism to have a new queue of logs and "current" queue of
already read logs. Moving them to the new queue will make the harvester
reparse their data and update the db.
Side effect of this data mining: The vote-server provider can always generate
her own archive for historical reasons from the stored data. This allows more
aggressive strategies for linking in other services as the admin is in the
total control of all (gathered) communicational data and not only of the bites
in the DB. All this communication happens on public channels, so there is no
fundamental privacy issue here. The data might be subject to copyright or
other restrictions, yet this lies in the responsibility of the admin.
Alternative solution
Try to store as much relevant data in the DB as possible and recalculate the
URL if it is missing (deleted) with a helper utitlity from the stored data in
the DB. This means we cannot correct wrong or limited data in the DB, but we
will store less data and can avoid the extra storing step. We can also write
integrated live agents for each channel which can talk back the queried diff-
data. This is only useful in very fast communication channels like IRC though.
In both situations the recalculation of the URL will be scriptable with sane
defaults and examples from the config script, allowing immediate and individual
solutions by the admin on URL-changes even during runtime of the diff-bite
servlet.
Suggestiions, criticism or simple questions? Tell me! :-)
c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Harvester.png
Type: image/png
Size: 14055 bytes
Desc: not available
URL: <http://mail.zelea.com/list/votorola/attachments/20110729/4d4143c8/attachment.png>
More information about the Votorola
mailing list