Harvester Roadmap

Michael Allan mike at zelea.com
Sat May 12 09:10:46 EDT 2012


conseo said:
> > I do not think we should poll the archives, not even as a fall
> > back.  It's maybe okay for initializing the forums, especially the
> > inactive ones. ...
> 
> Not only maybe, history recovery was the whole design rationale for
> harvesting (crawling). We can only omit harvesting all archives in
> the back-end if we have a 100% reliable DiffKick source, which can
> even point us to the time frame of the message, because we then will
> only burst back until we found it and leave the rest of the archive
> alone (most of it). Yet for history recovery we have no choice
> except using Google or some other off-site index to only crawl our
> messages containing a diff-url.

If I understand, I agree.  We definitely need to harvest from inactive
archives that reference diffs.  Ordinary detectors will not send kicks
for these because there is no activity to detect.  Therefore we need a
special kind of detector to bootstrap the harvesting of inactive
archives.

We also need it to promptly re-harvest from *active* archives in the
event the cache is cleared.

But this does not require polling the archives.  Once a back-harvest
is done, it stays done.

> > ... the difference will resolve to the drafters, the drafters to
> > the candidate, and the candidate to the forum!  ...  All these
> > expensive resolutions will have to be skipped ofc for redundant
> > kicks.  That probably means the determination of redundancy must
> > be a function of the difference key itself.
>
> Yes, this intersection is possible and it is a smart idea. Yet
> author and addressee, can share many forums, at least 5 is
> reasonable I guess, candidate won't do, because the relation between
> both drafters can be covoters or something else. ...

Let's define co-voters as voters of the same candidate.  We can assume
that all messages between members of a group (co-voters + candidate)
are communicated in the forums defined by the candidate.  That's maybe
1-3 forums.  So 1-3 harvesters would respond to each kick and usually
only one would find anything new.  But the others would update their
archive markers.  No problem.

That's for intra-group communications.  For *inter*group communica-
tions, the simplest thing is not to harvest them at all.  That means
messages between persons in different branches or trees, or separated
by one or more delegates in the same branch (e.g. a voter talking to
her candidate's candidate).  We ignore these, because it's currently
too complicated or inefficient to harvest them quickly.

> ... Do you have an idea to get closer to the true forum? The only
> thing I came up with is referrer-id, which is not good enough imo,
> because drafters won't follow links from the archive and it is not
> reliable in any sense (browsers can deactivate or fake it).

But it looks like complexity is the price of proximity.  Let's try for
a simpler solution.  The only cost appears to be a constraint on what
kind of messages we can harvest.  If we cannot harvest a given type of
message quickly, then probably we should not harvest that type at all.
So no inter-group messages (above), and no diff echoes (below).

> >   boolean isHarvested( DiffKey diff )

I'm now thinking we need something a little more sophisticated than
that.  Maybe:

  HarvestState getHarvestState( DiffKey diff )

   where HarvestState is one of:
     1.  unknown

     2a. harvesting
     2b. ignoring

     3a. harvested
     3b. error

It's crucial for efficiency that every potential kick (every DiffKey)
quickly resolve to a definite state.  To know whether to send a kick,
the kicker looks at the current state:

    if state is "unknown"
        if person-person relation is inter-group
            setHarvestState( "ignoring" )
        else kick all harvesters of candidate

The harvesters that respond immediately set the state to harvesting.
So they'll be well shielded from redundant kicks.

> Each key can occur in any number of messages (e.g. if it is a
> substantial difference at the core of debate over a long period of
> time). We cannot store the necessary information in the URL, because
> it can be referenced from many messages.

Maybe we shouldn't harvest the echoes.  Consider two messages:

  (1)  Blah blah?
       DIFF-URL-1

  (2)  > Blah blah?
       > DIFF-URL-1

       Blah blah blah!

We don't have a rapid detector for message (2), or for any message
that re-references the same difference.  So let's not harvest those
messages at all.  The talk track will only show messages that concern
new differences.  The user can browse the archive to follow the entire
discussion.  Some problems with this:

  a) Since harvesting generally proceeds backwards, the harvester must
     not store a message in the cache till it completes its backward
     walk and removes all echoes.

     But that's too complicated.  A better approach might be to allow
     echos to be *subsequently* marked "ignore" in the cache.  This
     would add some complexity to the API.  The client must somehow
     learn when a previously fetched message is subsequently marked
     "ignore".

  b) If the same difference is later discussed in a separate forum,
     then the message might not be harvested quickly.  As a rule, we
     probably should not harvest it at all.  That means all but the
     earliest message are marked "ignore".

     The problem is, this is a little weird.  The users might wonder
     why the message is not showing up.

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/



More information about the Votorola mailing list