Harvester Roadmap

conseo 4consensus at web.de
Fri May 11 09:41:28 EDT 2012


Hi M,

> > 
> > What do you mean with "sync"? Atm. we have <3 hours update interval
> > which makes a track 1) already quite usable imo and 2) has already
> > everything we need to write a talk track. But if we come over issues
> > in the track design we can fix them before we target the detector
> > handling or further back-end work.
> 
> I do not think we should poll the archives, not even as a fall back.
> It's maybe okay for initializing the forums, especially the inactive
> ones.  But for day-to-day use, the front will be too slow (3 hours),
> and the back won't scale (too many forums in the world).

Not only maybe, history recovery was the whole design rationale for harvesting 
(crawling). We can only omit harvesting all archives in the back-end if we 
have a 100% reliable DiffKick source, which can even point us to the time 
frame of the message, because we then will only burst back until we found it 
and leave the rest of the archive alone (most of it). Yet for history recovery 
we have no choice except using Google or some other off-site index to only 
crawl our messages containing a diff-url. 

> 
> > > I have one doubt about that.  I vaguely recollect discussing an
> > > email-based subscription detector for Mailman. (?)  Why did we
> > > discuss that?  Do you recall?  Hopefully it's not needed, because
> > > a bridge detector is much easier.
> > 
> > There are two problems with it: 1) The kick event does not have to
> > occur, so messages might get lost if the author of the message does
> > not trigger the difference event after sending to the forum. ...
> 
> I guess you're right, but that's only because the difference bridge is
> not the only place to a view a difference.  So maybe we should raise a
> kick on every request to the difference *cache*.  That would cover the
> bridge itself, the bridge footings in the draft, and anything else we
> added in future.  If no drafter cares to look at the posted difference
> in *some* manner, then it's not important and there's no need to
> trigger a kick or to harvest anything.
> 
> Again, this approach is simple.  But more than ever, it places a
> burden on efficiency.  There will be many redundant kicks.
> 
> > ... 2) We don't know from where the event comes (because it is
> > likely that it is clicked in the mail (or other native client) and
> > we have outruled referer-id for that reason (because we cannot burst
> > on all forums all the time). ...
> 
> You mean the difference bridge (or cache, or whatever) won't know what
> forum the difference was posted in?  True, but the difference will
> resolve to the drafters, the drafters to the candidate, and the
> candidate to the forum!  (Ref the use cases linked in my last).  All
> these expensive resolutions will have to be skipped ofc for redundant
> kicks.  That probably means the determination of redundancy must be a
> function of the difference key itself.
Yes, this intersection is possible and it is a smart idea. Yet author and 
addressee, can share many forums, at least 5 is reasonable I guess, candidate 
won't do, because the relation between both drafters can be covoters or 
something else. Do you have an idea to get closer to the true forum? The only 
thing I came up with is referrer-id, which is not good enough imo, because 
drafters won't follow links from the archive and it is not reliable in any 
sense (browsers can deactivate or fake it).

> 
>   boolean isHarvested( DiffKey diff )

Each key can occur in any number of messages (e.g. if it is a substantial 
difference at the core of debate over a long period of time). We cannot store 
the necessary information in the URL, because it can be referenced from many 
messages.

> 
> If the kicker called that function and aborted redundant kicks, then
> it would be very fast indeed.
True.

> 
> > ... I am afraid, but I think we need a Maildir/Mailman detector to
> > get our <10s goal reliably.  Whether we simply write a detector for
> > Maildir which can handle all kind of forum updates (often you can
> > get notified for new messages by mail, which might work fairly well
> > for many forums) or we separate it for each forum type is yet open.
> 
> Adminstering all those subscriptions will be complicated and will only
> work for mail-based forums.  We gotta try for a more elegant solution.
I agree, although a MailDetector might get us pretty far and allows to emit 
DiffKicks reliable exactly once. Auto-subscription is indeed the most 
difficult issue, I guess.

I have to think about it again, but this <10s update problem has proven to be 
tricky :-). We need to get close to the discussion media somehow to solve this 
problem.

conseo



More information about the Votorola mailing list