Harvester Roadmap

Tue Apr 17 15:11:57 EDT 2012

Hey M,

> 
> Thomas von der Elbe said:
> > Hi Conseo, ... thanks for the update and the work you have been
> > doing! ...
> 
> Thanks from me, too!  The work is easier when the effort is shared.

Thanks to you back. It is amazing that we still stick to Votorola just because 
we know it has a unique design.

> 
> I see.  The only problem then (which also crops up elsewhere) is that
> a timestamp is a bad primary key.  A primary key must be unique.
> Maybe replace "parsed date" with an integer counter (call it "serial")
> that is incremented for every new message added to the cache.  I think
> the DB provides counters for this purpose.

You are right. We then just use a primary key as integer internally and use 
parsed_date as a mandatory query parameter. If we use sent_date, we can't 
cache the information, because the range is not finished and can be extended 
by new harvests of old messages, we want to cache all queries on the client 
though (for this static harvest data). We will just return the data ordered by 
sent-date though, then we get the newest posts of the query first in a multi-
fetch query. Does that make sense to you?
The client will fetch something from startDate=PREVMONTH endDate=new Date() 
for the current feed until it has enough messages to fill the track (and in 
the future with backward browsing as well).

> In addition to specifying a "serial" number (don't send messages I
> already saw) the client could specify a "time" bracket (only send what
> was posted in the last month) and other filters (only messages for a
> given poll, or from a given author, etc.).

The client should cache the messages for queries like [pollname][username] as 
well, so parsed date will apply the same way. There are also caches for [] all 
polls and just [pollname].
> 
> Crucially there is no connection between "serial" filtering which is
> based on order of entry in the cache, and the other filter parameters
> which are based on other criteria.  Each serves its own purposes.
> 
> For efficiency, you could look at adding indeces to the table for some
> of the other filters.  But knowing that's always possible, I wouldn't
> worry too much about it up front.

Yes, especially for the parsed dates a b-tree index makes sense imo. 

> 
> I think the only problem here is the lack of a primary key.  If you
> add a counter ("serial" or whatever), then it should be OK, right?

For the DB, yes. For harvesting we should ensure that the same message never 
lands in the DB twice. I have done that by using a md5 hash of key values 
until now, but I will remove that and update the row by those core parameters.

> 
> If dupication of author and poll names ever poses a problem for
> storage space, we could deal with it.  I would aim for a simple design
> that works, and let the minor problems float up and speak for
> themselves later.  The trick to avoiding premature optimization is to
> discriminate between the wicked problems that must be addressed in the
> design and the trivial ones that can be left for rainy days.
> 
Ok, it is easy to separate these columns in separate tables. I just haven't 
rolled software which updates the schema.

> 
> I would avoid putting limits on the string lengths unless there's a
> good reason *and* I know that longer strings are impossible.  For the
> difference, I would use 4 separate integers.
Ok, VARCHAR and INTEGER[4] and we need the difference information.

> 
> I guess most of these columns will be defined by reference to the
> HarvestWAP javadocs?  Your javadoc server is currently down, so I'll
> wait to see your next sketch before commenting further.
Back online, but it is not changed. We might add hSentDateStart, hSentDateEnd 
to allow another query dimensions, but I don't see a use case for it yet(?).

> By config, you mean reading the properties of forum pages, and such?
> http://zelea.com/w/Concept:Forum
> Why not just take whatever the wiki cache provides?
> http://zelea.com/project/votorola/_/javadoc/votorola/a/WikiCache.html
> 
> The cache is churned (at least on our reference server) on every trace
> of the trust network.  That means your forum properties (archive URL,
> etc) will rarely be more than 4 hours old.

Sure that is possible. But since the harvesters are "daemons" because of the 
detectors, we need some live configurator pulling the information from the 
WikiCache.

conseo