Harvest scheduling and job management

Tue Mar 27 08:16:36 EDT 2012

Hi,

I have a modified proposal to track harvested parts of the archive.

Compile a list of all messages. This is normally done in several steps, 
starting with the base URL of the archive and then accessing it by some 
category, usually time. This is not strictly necessary as long as we know 
which sub-lists (meaning the lowest level) are finished. For Pipermail we 
fetch a sub-list for each month, for example. Sub-lists which are done are 
saved (serialized) as the relative URL of this sub-list. On a new crawl we 
will skip these URLs.

For the sub-list reaching into the present and future (like the current 
month), we save a "stop sign" relative URL to the newest post once we have 
harvest all older ones + the "stop sign". If the sub-list changes (e.g. next 
month), we mark it as done and switch to the new one. Here we need some sort 
of time ordering, even if the total list is not ordered in time, or we have to 
temporarily store all harvested message pages for this sub-list.
NOTE: The fact the we don't need a "timeline" or similar and can simply 
compile the list from the remote archive's structure seems to indicate that 
this is a good design, right?

Saving is done by triggering a special commit-like-job after all message jobs 
inside the sub-list. We can do this on every level as long as we schedule all 
sub jobs before the "commit-job". 

Potential problems: 

1) If we stop the harvester, we lose messages in the current job, as we can't 
commit anything for the current sub-list. Atm. our history tracks each message 
fetch. For a pipermail month list e.g., the loss could be 1000 messages or 
more depending on the archive. Yet, the harvester is not supposed to be 
stopped and started all the time and once the archive is harvested, the 
problem disappears.

2) We need more serialization storage than for marked areas, since each sub-
list needs to store its relative URL in a HashSet or al. We don't save 
relative URL's to messages though, so storage size depends on list structure 
and size, not on message count. One can omit very short sub-list jobs on the 
lowest level, because ones on the higher level will cover them after the 
initial scan in problem cases.

Both seem neglectable to me.

conseo