Harvester Roadmap

Michael Allan mike at zelea.com
Sun Apr 15 23:38:10 EDT 2012


Hey C,

Thomas von der Elbe said:
> Hi Conseo, ... thanks for the update and the work you have been
> doing! ...

Thanks from me, too!  The work is easier when the effort is shared.

> > (2) the purpose of "parsed date" is unclear.  don't clients only
> > care about time of posting?

C said:
> That makes sense from the application/client perspective, but parsed
> date is much better for efficient access and caching. ...  Parsed
> date is the primary key, to sequentially query for new messages and
> cache them locally on the client. Sent date is not consistent as
> older messages can be added by a later back-crawl of a newly
> harvested archive.  The client simply keeps a newest_parsed_date
> marker for each poll and can then request all messages from then
> now, caching the data locally at the same time.

I see.  The only problem then (which also crops up elsewhere) is that
a timestamp is a bad primary key.  A primary key must be unique.
Maybe replace "parsed date" with an integer counter (call it "serial")
that is incremented for every new message added to the cache.  I think
the DB provides counters for this purpose.

> > "db will be ok, if we only request the newest parsed bites".
> > almost always client will request new messages, but why is
> > requesting old ones a problem?
>
> The client can cache in the same history "dimension" as the
> harvesting process and will still be largely in sync. (Note again:
> Newest posts still occur first on the client feed (sentDate).)  I am
> just worried about keeping this beast scalable, but maybe I am
> wrong? It is potentially the biggest load of the whole architecture,
> so some thoughts on this should be done imo.

In addition to specifying a "serial" number (don't send messages I
already saw) the client could specify a "time" bracket (only send what
was posted in the last month) and other filters (only messages for a
given poll, or from a given author, etc.).

Crucially there is no connection between "serial" filtering which is
based on order of entry in the cache, and the other filter parameters
which are based on other criteria.  Each serves its own purposes.

For efficiency, you could look at adding indeces to the table for some
of the other filters.  But knowing that's always possible, I wouldn't
worry too much about it up front.

> > (6) "seperate table for the usernames and pollnames, because many
> > of them will be duplicates".  i don't understand.  what problem is
> > caused by single table for harvest cache?

> The database won't store duplicate entries efficiently. ...

I think the only problem here is the lack of a primary key.  If you
add a counter ("serial" or whatever), then it should be OK, right?

If dupication of author and poll names ever poses a problem for
storage space, we could deal with it.  I would aim for a simple design
that works, and let the minor problems float up and speak for
themselves later.  The trick to avoiding premature optimization is to
discriminate between the wicked problems that must be addressed in the
design and the trivial ones that can be left for rainy days.

> Do we need the difference really? It is core to the information, but not 
> really necessary on the client-side, right? We have then
> 
> TIMESTAMP: parsed_date
> TIMESTAMP: sent_date
> VARCHAR(150): summary
> VARCHAR(255): sender
> VARCHAR(255): addressee (opposite of difference, or should this be calculed in 
> servlet through voteserver from the difference and the sender only? Do we need 
> who is who? I am not sure because of the Biters and filtering.)
> VARCHAR(255): pollname
> VARCHAR(255): url (relative to base_url)
> VARCHAR(255): base_url (unique for each archive usually)
> INTEGER[2][2]: difference

I would avoid putting limits on the string lengths unless there's a
good reason *and* I know that longer strings are impossible.  For the
difference, I would use 4 separate integers.

I guess most of these columns will be defined by reference to the
HarvestWAP javadocs?  Your javadoc server is currently down, so I'll
wait to see your next sketch before commenting further.

> > > 5. Configure the forums by querying the wiki on startup instead
> > > of hardcoding them (PipermailHarvester). This should also happen
> > > from time to time during runtime.
> 
> I thought about this and think it is best if the Harvesters
> automatically configure new forums on an event by a Controller,
> which reads the Wiki-Config and emits a one time NewForumKick once
> it finds new forums. Does that sound reasonable to you? The
> controller would also start the Detectors. It could also be a boot
> routine only and some separate background updater, but I think this
> does not diminish the life-config controller.

By config, you mean reading the properties of forum pages, and such?
http://zelea.com/w/Concept:Forum
Why not just take whatever the wiki cache provides?
http://zelea.com/project/votorola/_/javadoc/votorola/a/WikiCache.html

The cache is churned (at least on our reference server) on every trace
of the trust network.  That means your forum properties (archive URL,
etc) will rarely be more than 4 hours old.

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/



More information about the Votorola mailing list