Harvest scheduling and job management

Fri Mar 23 16:43:24 EDT 2012

Hi C,

Here's a more complete sketch.  I answer your comments on the earlier
sketch further below.

(a) Start up
------------
  (a1) For each forum, schedule update job (c)

(b) Kick reception
------------------
  (b1) Receive kick for a forum
  (b2) Schedule update job (c) for that forum
  (b3) Let update job run and schedule further jobs as needed

(c) Forward update job  scheduled by (a, b, c)
----------------------
   fM: Front marker recording the latest message locally cached from a
       given forum's remote archive.  When up to date, it points to
       the *end* of the archive.

  (c1) Read fM+0 (latest message cached) from fM.  If no fM+0 exists,
       then schedule a back-crawl job (d) and QUIT.
  (c2) Find fM+0 in the remote archive.
  (c3) If fM+0 is the latest message (no more to read), then QUIT.
  (c4) Try incrementing fM to next message fM+1, or goto (c1) if
       another job has since incremented it.
  (c5) Read fM+1 from the remote archive.
  (c6) If fM+1 contains a diff URL, then cache it.
  (c7) If fM+1 is the latest message (no more to read), then QUIT.
  (c8) Schedule another update job (c).

(d) Back-crawl job  scheduled by (c, d)
------------------
   bM: Back marker recording the earliest message locally cached from
       a given forum's remote archive.  When up to date, it points to
       the *start* of the archive.

  (d1) Read bM-0 (earliest message cached) from bM.  If no bM-0
       exists, then bM-0 is effectively the last message in the
       archive *plus 1*, or the end *bound* of the remote archive.
  (d2) Find bM-0 in the remote archive (or find the end bound).
  (d3) If bM-0 is the earliest message (no more to read), then QUIT.
  (d4) Try decrementing bM to previous message bM-1, or goto (d1) if
       another job has since decremented it.
  (d5) Read bM-1 from the remote archive.
  (d6) If bM-1 contains a diff URL, then cache it.
  (d7) If bM-1 is the earliest message (no more to read), then QUIT.
  (d8) Schedule another back-crawl job (d).

conseo said:
> A pipermail archive has three levels of HTML which we parse. First
> is the index itself (InitJob), scheduled by it are for each listed
> month the "date.html" post listings (MonthJob), which then schedule
> each posting from this list (PageJob). Each HarvestJob represents
> exactly one remote archive HTML page by scheduler design. These are
> the ones given by Pipermail, so I haven't added anything to the
> remote archive structure. ...

Maybe the structure of the archive does not matter for the top level
of the design.  I think all archives could be harvested with the
design sketched above; not with the same code mind you, but with the
same rough design.

> ... These levels (scraping by time backwards), seem to be pretty
> common for most web forums.  You could call "InitJob" "UpdateJob" if
> you like to, although I haven't modelled that concretely, it already
> does the same.

I do not see the same design in your code.  You appear to tie the
different job types to the structural logic of the archive, not to the
procedural logic of harvesting.  Maybe that is what confuses me when I
read your code, and maybe it also complicates your scheduling (not
sure), which is what you asked about.

In any case, if you sketch the algorithm (like above) then it'll be
easier for me to understand your design.  There is no hurry, and it's
always worth discussing design proposals in advance of coding.

> I see problems in your following proposal:
> 
> > ... (c1) Read local marker recording the last message M0 cached.

> 1. Markers are a concept of us to avoid double crawling. They are not 
> guaranteed by the remote archive. IRC archives don't have message id's for 
> example, so we fall back on date ordering only, which basically gives us a  
> list. Date's don't have previous and next items (non-discrete), so we cannot 
> create such a structure in a Harvester per se. 

I think every archive has a sequential order and we can construct
external markers that point into it.  We can increment and decrement
those markers in respect of the archive.  We do this *manually* when
reading archives ourselves, so I'm sure a program can automate it.

> 2. We then harvest forward and not backward, which gives us no
> guarantee that we can match the <10s live criterium or we have to
> burst for any number of new posts forwards (this means we burst that
> way on every Kick!). Picture 100 posts sent since the last update,
> which we cannot outrule imo.

You could modify (c) to update backwards from the end bound to the
last message cached, but it would complicate the design.  I think a
forward update is cleaner and fast enough (faster than 10s).  Combined
with a back-crawl (d) to fill the cache with old messages, it probably
covers all requirements.  What do you think?

If there is a 100 post lag for example, then it probably means the
harvester (or a kicker) was shut down.  It will catch up when
restarted and then everything will be fast again.  Isn't that true?

> > (c5) Read M1 from the remote archive.
> > (c6) If M1 contains a diff URL, then cache it.
> > (c7) If M1 is the latest message (no more to read), then quit.

> 3. We don't know when to stop. ...

Sure we do.  When we read an archive manually, we know when to stop.
See here for example, we stop at the end of the list. :-)
http://metagovernment.org/pipermail/start_metagovernment.org/2012-March/date.html

> While I know I have clarified the design rationale maybe a bit more,
> I actually wanted to get feedback if the scheduling is done right
> (independent of how to run a harvester). The concept is:
> 1) Extend a HarvestJob (I can separate it in an interface, if you
> don't like inheriting) and set the URL for each job.
> 2) Implement the run() method to read the InputStream which will be
> created by HarvestRunner and deal with the content of this fetched
> HTML page.
> 3) Schedule the job. (3) The scheduler asynchronously fetches the
> job's URL in the next possible step for this host and then runs it
> inside its thread-pool.

Scheduling a job seems almost trivial, at least with the design
sketched at top.

(e) Scheduling a job
--------------------
   sI: Minimum allowed interval between jobs, e.g. a 1 second gap.
   sT: Time of last job scheduled for a given forum.

  (e1) Atomically increment sT += sI
  (e2) Run job at time sT.

Won't that work?

Mike