Harvester Roadmap

Thu Apr 19 10:34:41 EDT 2012

So the cache has this web API:
http://zelea.com/project/votorola/_/javadoc/votorola/s/wap/HarvestWAP.html

And this structure:

> id        serial primary key,
> author    character varying, 
> poll_name character varying,
> summary   character varying,
> diff_key  integer array[4],
> base_url  character varying,
> url       character varying,
> sent_ts   timestamp with time zone,
> parsed_ts timestamp with time zone

diff_key is actually 4 integers, not an array.  I would use 4 integer
columns for it.

Why base_url and url?  Don't you need only one for Message.location?
http://zelea.com/project/votorola/_/javadoc/votorola/s/gwt/stage/Message.html

> ... We then just use a primary key as integer internally and use
> parsed_date as a mandatory query parameter. If we use sent_date, we
> can't cache the information, because the range is not finished and
> can be extended by new harvests of old messages, we want to cache
> all queries on the client though (for this static harvest data).
> ...

It looks like you're using "parsed" dates as primary keys in the API
(to prevent duplicates on the wire), but the primary key is actually
'id'.  Why not use 'id' and delete parsed_ts from the structure?

> ... We will just return the data ordered by sent-date though, then
> we get the newest posts of the query first in a multi- fetch
> query. Does that make sense to you?  ...

It seems you must.  I don't know how the feed client works, but surely
wants the N latest messages by send date, and not the rest.

Once it's all tested and working, you can make sent_ts an index, so
your sorts don't take forever.  Grep the code for 'CREATE INDEX' and
you'll see an example.

> Ok, it is easy to separate these columns in separate tables. I just
> haven't rolled software which updates the schema.

You could delete the table(s) instead of updating, because it's only a
cache.  Updating can be left as an optimization.

> ... We might add hSentDateStart, hSentDateEnd to allow another query
> dimensions, but I don't see a use case for it yet(?).

It depends on the client's algorithm.  Assuming the current diff feed
knows how to fill its buffer (with N latest messages) then you can
just provide it whatever parameters it needs for that.

> > By config, you mean reading the properties of forum pages, and
> > such?  http://zelea.com/w/Concept:Forum Why not just take whatever
> > the wiki cache provides?
> > http://zelea.com/project/votorola/_/javadoc/votorola/a/WikiCache.html
> > 
> > The cache is churned (at least on our reference server) on every
> > trace of the trust network.  That means your forum properties
> > (archive URL, etc) will rarely be more than 4 hours old.
> 
> Sure that is possible. But since the harvesters are "daemons"
> because of the detectors, we need some live configurator pulling the
> information from the WikiCache.

Just look at this.  If it changes (which it does every 4 hours), then
any parsed RDF you have laying around should be re-parsed:
http://zelea.com/project/votorola/_/javadoc/votorola/a/WikiCache.html#lastChurnTime()

Later (if needed), we can expose the timestamps for each page in the
wiki cache, and reparse on those instead.

(I don't understand your last question below, and will ask you to
explain by IRC.)

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/

conseo said:
> Hey M,
> 
> > 
> > Thomas von der Elbe said:
> > > Hi Conseo, ... thanks for the update and the work you have been
> > > doing! ...
> > 
> > Thanks from me, too!  The work is easier when the effort is shared.
> 
> Thanks to you back. It is amazing that we still stick to Votorola just because 
> we know it has a unique design.
> 
> > 
> > I see.  The only problem then (which also crops up elsewhere) is that
> > a timestamp is a bad primary key.  A primary key must be unique.
> > Maybe replace "parsed date" with an integer counter (call it "serial")
> > that is incremented for every new message added to the cache.  I think
> > the DB provides counters for this purpose.
> 
> You are right. We then just use a primary key as integer internally and use 
> parsed_date as a mandatory query parameter. If we use sent_date, we can't 
> cache the information, because the range is not finished and can be extended 
> by new harvests of old messages, we want to cache all queries on the client 
> though (for this static harvest data). We will just return the data ordered by 
> sent-date though, then we get the newest posts of the query first in a multi-
> fetch query. Does that make sense to you?
> The client will fetch something from startDate=PREVMONTH endDate=new Date() 
> for the current feed until it has enough messages to fill the track (and in 
> the future with backward browsing as well).
> 
> > In addition to specifying a "serial" number (don't send messages I
> > already saw) the client could specify a "time" bracket (only send what
> > was posted in the last month) and other filters (only messages for a
> > given poll, or from a given author, etc.).
> 
> The client should cache the messages for queries like [pollname][username] as 
> well, so parsed date will apply the same way. There are also caches for [] all 
> polls and just [pollname].
> > 
> > Crucially there is no connection between "serial" filtering which is
> > based on order of entry in the cache, and the other filter parameters
> > which are based on other criteria.  Each serves its own purposes.
> > 
> > For efficiency, you could look at adding indeces to the table for some
> > of the other filters.  But knowing that's always possible, I wouldn't
> > worry too much about it up front.
> 
> Yes, especially for the parsed dates a b-tree index makes sense imo. 
> 
> > 
> > I think the only problem here is the lack of a primary key.  If you
> > add a counter ("serial" or whatever), then it should be OK, right?
> 
> For the DB, yes. For harvesting we should ensure that the same message never 
> lands in the DB twice. I have done that by using a md5 hash of key values 
> until now, but I will remove that and update the row by those core parameters.
> 
> > 
> > If dupication of author and poll names ever poses a problem for
> > storage space, we could deal with it.  I would aim for a simple design
> > that works, and let the minor problems float up and speak for
> > themselves later.  The trick to avoiding premature optimization is to
> > discriminate between the wicked problems that must be addressed in the
> > design and the trivial ones that can be left for rainy days.
> > 
> Ok, it is easy to separate these columns in separate tables. I just haven't 
> rolled software which updates the schema.
> 
> > 
> > I would avoid putting limits on the string lengths unless there's a
> > good reason *and* I know that longer strings are impossible.  For the
> > difference, I would use 4 separate integers.
> Ok, VARCHAR and INTEGER[4] and we need the difference information.
> 
> > 
> > I guess most of these columns will be defined by reference to the
> > HarvestWAP javadocs?  Your javadoc server is currently down, so I'll
> > wait to see your next sketch before commenting further.
> Back online, but it is not changed. We might add hSentDateStart, hSentDateEnd 
> to allow another query dimensions, but I don't see a use case for it yet(?).
> 
> 
> > By config, you mean reading the properties of forum pages, and such?
> > http://zelea.com/w/Concept:Forum
> > Why not just take whatever the wiki cache provides?
> > http://zelea.com/project/votorola/_/javadoc/votorola/a/WikiCache.html
> > 
> > The cache is churned (at least on our reference server) on every trace
> > of the trust network.  That means your forum properties (archive URL,
> > etc) will rarely be more than 4 hours old.
> 
> Sure that is possible. But since the harvesters are "daemons" because of the 
> detectors, we need some live configurator pulling the information from the 
> WikiCache.
> 
> conseo

conseo said:
> Hey M,
> 
> I need some help.
> 
> Ok, the following columns are clear so far:
> 
> id serial primary key,
> author character varying, 
> poll_name character varying,
> summary character varying,
> diff_key integer array[4],
> base_url character varying,
> url character varying,
> sent_ts timestamp with time zone,
> parsed_ts timestamp with time zone
> 
> 
> votorola.a.diff.harvest.Message.sender() is not the
> same like author, because it refers to the nick in the
> archive. Author is the mailish-username only parsed if it is proven to be a 
> valid diff message. 
> 
> Now my problem is, should I add
> 
> addressee character varying,
> author_is_candidate boolean
> 
> which is in fact also present in the difference information. It is needed for 
> the feed to properly function (the below-bubble-mnemonic order) or should I 
> parse this in the servlet (bad performance wise as we need another query for 
> the difference on *each* request) or fetch completely separate in the client 
> and cache there (the difference information for each key it deals with). 
> 
> Is it ok to add them?
> 
> "hUser" parameter in HarvestWAP would query both addressee and author.
> 
> conseo