Harvester Roadmap

conseo 4consensus at web.de
Tue May 15 12:27:43 EDT 2012


Michael Allan wrote:
> 
> We also need it to promptly re-harvest from *active* archives in the
> event the cache is cleared.
> 
> But this does not require polling the archives.  Once a back-harvest
> is done, it stays done.

True. Polling was to keep the marker close to the present. If we only follow 
quick bursts back on a timestamped event, we don't need polling at all, 
because we don't need to harvest the whole archive (all diff-messages 
including echoes) any more. Polling alone costs two fetches and a bit of 
regexp though, no big deal. Harvesting the whole forums in general is the 
expensive part if we track thousands of forums, I think.
 
> Let's define co-voters as voters of the same candidate.  We can assume
> that all messages between members of a group (co-voters + candidate)
> are communicated in the forums defined by the candidate.  That's maybe
> 1-3 forums.  

Hmm, could also be twenty for a professional organizer/organisation or if the 
person gets support. We could also use the cut between voter and candidate 
forums which will reduce these worse cases. In some cases the referrer will 
also be helpful btw. (web forums mainly), but we can add these heuristics 
later. 

> So 1-3 harvesters would respond to each kick and usually
> only one would find anything new.  But the others would update their
> archive markers.  No problem.
> 
> That's for intra-group communications.  For *inter*group communica-
> tions, the simplest thing is not to harvest them at all.  That means
> messages between persons in different branches or trees, or separated
> by one or more delegates in the same branch (e.g. a voter talking to
> her candidate's candidate).  We ignore these, because it's currently
> too complicated or inefficient to harvest them quickly.
Ok, so basically you strip the event source for DiffKick to something 
reliable, which we can map to some forums. +1

> 
> But it looks like complexity is the price of proximity.  Let's try for
> a simpler solution.  The only cost appears to be a constraint on what
> kind of messages we can harvest.  If we cannot harvest a given type of
> message quickly, then probably we should not harvest that type at all.
> So no inter-group messages (above), and no diff echoes (below).
> 
> > >   boolean isHarvested( DiffKey diff )
> 
> I'm now thinking we need something a little more sophisticated than
> that.  Maybe:
> 
>   HarvestState getHarvestState( DiffKey diff )
> 
>    where HarvestState is one of:
>      1.  unknown
> 
>      2a. harvesting
>      2b. ignoring
> 
>      3a. harvested
>      3b. error

Ok sounds reasonable. The Kicker will keep track of the keys on runtime, 
while it can query HarvestCache about harvested diff keys.
 
> 
> It's crucial for efficiency that every potential kick (every DiffKey)
> quickly resolve to a definite state.  To know whether to send a kick,
> the kicker looks at the current state:
> 
>     if state is "unknown"
>         if person-person relation is inter-group
>             setHarvestState( "ignoring" )
>         else kick all harvesters of candidate
> 
> The harvesters that respond immediately set the state to harvesting.
> So they'll be well shielded from redundant kicks.
I agree.

> 
> > Each key can occur in any number of messages (e.g. if it is a
> > substantial difference at the core of debate over a long period of
> > time). We cannot store the necessary information in the URL, because
> > it can be referenced from many messages.
> 
> Maybe we shouldn't harvest the echoes.  Consider two messages:
> 
>   (1)  Blah blah?
>        DIFF-URL-1
> 
>   (2)  > Blah blah?
> 
>        > DIFF-URL-1

Quotes are already stripped out *before* searching for diff-urls so these 
echoes have never been in the table (only as a bug in the beginning.) Note 
also that there are different differences in diff-key, which might be 
discussed in different forum threads because changes to drafts won't happen 
incrementally necessarily. We will miss these discussion updates, too.

> 
>        Blah blah blah!
> 
> We don't have a rapid detector for message (2), or for any message
> that re-references the same difference.  So let's not harvest those
> messages at all.  The talk track will only show messages that concern
> new differences.  The user can browse the archive to follow the entire
> discussion.  Some problems with this:
> 
>   a) Since harvesting generally proceeds backwards, the harvester must
>      not store a message in the cache till it completes its backward
>      walk and removes all echoes.
> 
>      But that's too complicated.  A better approach might be to allow
>      echos to be *subsequently* marked "ignore" in the cache.  This
>      would add some complexity to the API.  The client must somehow
>      learn when a previously fetched message is subsequently marked
>      "ignore".

This is no problem. We simply go back until the timestamp(+x) of the DiffKick. 
If we find earlier hits, we simply overwrite the DB entry (update where diff-
key = diff-key and sent-date < sent-date). At the end of the burst we will 
have the right message and for HarvestState of the diff-key it is not 
relevant. Diff-key would be the key of the table in this design as well.

> 
>   b) If the same difference is later discussed in a separate forum,
>      then the message might not be harvested quickly.  As a rule, we
>      probably should not harvest it at all.  That means all but the
>      earliest message are marked "ignore".
> 
>      The problem is, this is a little weird.  The users might wonder
>      why the message is not showing up.

Yes, alternatively we can guarantee that for the first diff-key we will update 
in <10s, while the other messages are updated by polling/updates in the 
background (e.g. after a burst for a diff-key is chilled we update the rest of 
the archive in the background). This still allows to have HarvestState-
reaction as you drafted it, if I haven't missed a problem.
The only potential problem is that we update all forums for each Kick, but if 
we want to track all messages reliably in the concept, we have to anyway. 
Forums that don't get a kick can still be polled by the ConfigKick updates 
every 3 hours, so we don't miss too many messages and have a constant load. We 
can also leave polling out in this concept completely as you supposed, which 
means some diff-keys on some forums won't be tracked reliably (e.g. inter-
group messages as you pointed out or messages where no diff-kick happens, e.g. 
because they have discussed the difference in private already and the 
harvester therefore couldn't find the kick). This is a general problem of 
using your event source only, that we don't even know that we get the kick for 
every difference. There are many private discussion channels out there... 

 
Imo the problem boils down to:
Whether we harvest the whole forums or only the small part back to the last 
diff-key event (likely only seconds). This can make a huge difference in load, 
but also means we only get fewer messages. I would harvest everything, but 
this will not scale beyond 100 concurrent harvests on a cheap single-core I 
fear. (Which still should allow more than >>1000 active forums).
If we want to know what people need from the talk track, we imo should try to 
harvest everything and adjust the design later, if we cannot scale. The 
harvesting infrastructure is also already in place, so we don't add complexity 
here, we just degrade to slowly update after each burst.
Note: If we harvest everything, we might also kick on inter-group diff-keys, 
because we then simply update the forums of author and addressee of the key. 
(Which at worst gives us the the 2 fetches as mentioned for each forum).


So you will hook me in the diff-bridge and I have 10s to reply to the callback 
to update the talk track from then for each diff-key triggered? So far this 
sounds reasonable, besides the decision of whether to harvest everything or 
only back to the DiffKick timestamp.

conseo

P.S.: Sorry for the long text, I am split between both solutions and a bit 
confused. Maybe we can skype?



More information about the Votorola mailing list