Harvest scheduling and job management

conseo 4consensus at web.de
Fri Mar 23 09:05:15 EDT 2012


Hi M and everybody else interested :-),

I have worked on the PipermailHarvester prototype and the scheduling framework 
for scraping jobs to balance load on the servers as well as provide almost 
live updates and keep our I/O overhead for so many connections reasonable. We 
use http-core and http-core-nio (1).

We had some discussion on IRC (2), I will respond in detail here now:

>[09:08:22] <mcallan> conseo: looking now at your code. (1) DiffKick looks 
dangerous, because a harvest based on a kick should be no different than any 
other harvest, otherwise it might not be possible to regenerate the archive by 
a crawl harvest.  to be sure of this, it would be best not to rely on 
contextual information from the kicker

Sure, it doesn't. We provide the context information because we can. This is 
helpful because a Kick will trigger a "burst" and we can decide to end the 
burst once we have found the message with the context of the kick and resume a 
normal harvest afterwards. All data is still parsed from the web and not out 
of the context. I can privatise that concept in DiffKick instead of exposing 
the context itself, I can allow to match it.

>[09:08:56] <mcallan> (regenerate the cache of the archive)
>[09:17:10] <mcallan> (2) you receive a kick.  you ignore its forum property, 
i guess because this is just test code with a hard-coded forum (ok).  then it 
looks like you start a crawl consisting of many scheduled jobs of different 
types.  this seems overcomplicated... or at least i don't see the design yet.

A pipermail archive has three levels of HTML which we parse. First is the 
index itself (InitJob), scheduled by it are for each listed month the 
"date.html" post listings (MonthJob), which then schedule each posting from 
this list (PageJob). Each HarvestJob represents exactly one remote archive 
HTML page by scheduler design. These are the ones given by Pipermail, so I 
haven't added anything to the remote archive structure. These levels (scraping 
by time backwards), seem to be pretty common for most web forums. 
You could call "InitJob" "UpdateJob" if you like to, although I haven't 
modelled that concretely, it already does the same.

>[09:45:03] <mcallan> I think you need a solid design before you get too far 
into the code.  I would start with a simple napkin sketch.  Here's my rough 
attempt:
>[09:45:15] <mcallan> (a) Receive kick
>[09:45:15] <mcallan> (b) Schedule update job
>[09:45:15] <mcallan> (c) Let update job run and schedule further update jobs 
as needed

Yep.

>[09:45:15] <mcallan> Let's look at the detail of (c), because that's 
obviously the heart of it.

I see problems in your following proposal:

>[09:45:15] <mcallan> (c1) Read local marker recording the last message M0 
cached.

1. Markers are a concept of us to avoid double crawling. They are not 
guaranteed by the remote archive. IRC archives don't have message id's for 
example, so we fall back on date ordering only, which basically gives us a  
list. Date's don't have previous and next items (non-discrete), so we cannot 
create such a structure in a Harvester per se. 

This also means btw. that HarvestHistory should is optional, as it is not 
guaranteed to represent the remote archive structure in the best way.

>[09:45:17] <mcallan> (c2) Find M0 in the remote archive.
>[09:45:20] <mcallan> (c3) If M0 is the latest message (no more to read), then 
quit.
>[09:45:22] <mcallan> (c4) Try incrementing local marker to next message M1, 
or goto (c1) if another job has since incremented it.

2. We then harvest forward and not backward, which gives us no guarantee that 
we can match the <10s live criterium or we have to burst for any number of new 
posts forwards (this means we burst that way on every Kick!). Picture 100 
posts sent since the last update, which we cannot outrule imo.

>[09:45:25] <mcallan> (c5) Read M1 from the remote archive.
>[09:45:28] <mcallan> (c6) If M1 contains a diff URL, then cache it.
>[09:45:31] <mcallan> (c7) If M1 is the latest message (no more to read), then 
quit.

3. We don't know when to stop. If we receive a 404, this can be related to any 
issue, including a missing message id which even happens for metagov 
pipermail. Also a 404 can be related to anything else. If we fetch the index 
of the latest month, then we can go backwards until we match our context or 
reach the covered HarvestHistory (which makes the jobs stop). Compared to 
walking the markers and waiting for 404, we don't have any drawbacks, the 
overhead is the same, one page fetch to determine the start-point (month of 
current date) or end-point (with 404) of the job. 
We will also very likely be in <10s, because the Kick has just been received 
and it is very likely that we hit it first with our burst. If the burst goes 
backwards and we can match the DiffKick context, we can immediately degrade to 
the 1s stepping (schedule a normal UpdateJob or whatever it is), so 100 new 
posts are no problem.

>[09:45:33] <mcallan> (c8) Schedule another update job.
>[09:48:13] <mcallan> conseo: i'll be up in 10 hours or so, and we can discuss
>[10:07:25] <mcallan> this is what i meant by sketching the algorithm of a 
single job.  note this design does not depend on the structure of the archive, 
and includes very few implementation details.  the details do not matter a 
whole lot because they can always be changed after the fact.  the design 
cannot be changed so easily once the code is written, so it's crucial to get 
it right.  not sure this is right, but it's a first stab

See above for the current design rationale, which I have developed through 
this prototype and my past experiences with pipermail and irssilog. Sorry that 
I couldn't do it before, but I wanted to get my hands a bit dirty to 
understand the potential problems of the scheduling better (that was the 
prototyping for).

While I know I have clarified the design rationale maybe a bit more, I 
actually wanted to get feedback if the scheduling is done right (independent 
of how to run a harvester). The concept is:
1) Extend a HarvestJob (I can separate it in an interface, if you don't like 
inheriting) and set the URL for each job. 
2) Implement the run() method to read the InputStream which will be created by 
HarvestRunner and deal with the content of this fetched HTML page. 
3) Schedule the job. (3) The scheduler asynchronously fetches the job's URL in 
the next possible step for this host and then runs it inside its thread-pool.

Do some checks (internal to the harvester) with HarvestHistory or your own 
persistent state tracker to avoid double crawls.

conseo

(1) https://hc.apache.org/httpcomponents-core-ga/
(2) http://zelea.com/var/cache/irc/votorola/12-03/22 
and http://zelea.com/var/cache/irc/votorola/12-03/23

(3) 
http://zelea.com/project/votorola/_/javadoc/votorola/a/diff/harvest/HarvestRunner.html#schedule%28votorola.a.diff.harvest.HarvestJob%29
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.zelea.com/list/votorola/attachments/20120323/9b026738/attachment.html>



More information about the Votorola mailing list