Harvest scheduling and job management
conseo
4consensus at web.de
Tue Mar 27 08:16:36 EDT 2012
Hi,
I have a modified proposal to track harvested parts of the archive.
Compile a list of all messages. This is normally done in several steps,
starting with the base URL of the archive and then accessing it by some
category, usually time. This is not strictly necessary as long as we know
which sub-lists (meaning the lowest level) are finished. For Pipermail we
fetch a sub-list for each month, for example. Sub-lists which are done are
saved (serialized) as the relative URL of this sub-list. On a new crawl we
will skip these URLs.
For the sub-list reaching into the present and future (like the current
month), we save a "stop sign" relative URL to the newest post once we have
harvest all older ones + the "stop sign". If the sub-list changes (e.g. next
month), we mark it as done and switch to the new one. Here we need some sort
of time ordering, even if the total list is not ordered in time, or we have to
temporarily store all harvested message pages for this sub-list.
NOTE: The fact the we don't need a "timeline" or similar and can simply
compile the list from the remote archive's structure seems to indicate that
this is a good design, right?
Saving is done by triggering a special commit-like-job after all message jobs
inside the sub-list. We can do this on every level as long as we schedule all
sub jobs before the "commit-job".
Potential problems:
1) If we stop the harvester, we lose messages in the current job, as we can't
commit anything for the current sub-list. Atm. our history tracks each message
fetch. For a pipermail month list e.g., the loss could be 1000 messages or
more depending on the archive. Yet, the harvester is not supposed to be
stopped and started all the time and once the archive is harvested, the
problem disappears.
2) We need more serialization storage than for marked areas, since each sub-
list needs to store its relative URL in a HashSet or al. We don't save
relative URL's to messages though, so storage size depends on list structure
and size, not on message count. One can omit very short sub-list jobs on the
lowest level, because ones on the higher level will cover them after the
initial scan in problem cases.
Both seem neglectable to me.
conseo
More information about the Votorola
mailing list