Harvest scheduling and job management
conseo
4consensus at web.de
Fri Mar 23 09:05:15 EDT 2012
Hi M and everybody else interested :-),
I have worked on the PipermailHarvester prototype and the scheduling framework
for scraping jobs to balance load on the servers as well as provide almost
live updates and keep our I/O overhead for so many connections reasonable. We
use http-core and http-core-nio (1).
We had some discussion on IRC (2), I will respond in detail here now:
>[09:08:22] <mcallan> conseo: looking now at your code. (1) DiffKick looks
dangerous, because a harvest based on a kick should be no different than any
other harvest, otherwise it might not be possible to regenerate the archive by
a crawl harvest. to be sure of this, it would be best not to rely on
contextual information from the kicker
Sure, it doesn't. We provide the context information because we can. This is
helpful because a Kick will trigger a "burst" and we can decide to end the
burst once we have found the message with the context of the kick and resume a
normal harvest afterwards. All data is still parsed from the web and not out
of the context. I can privatise that concept in DiffKick instead of exposing
the context itself, I can allow to match it.
>[09:08:56] <mcallan> (regenerate the cache of the archive)
>[09:17:10] <mcallan> (2) you receive a kick. you ignore its forum property,
i guess because this is just test code with a hard-coded forum (ok). then it
looks like you start a crawl consisting of many scheduled jobs of different
types. this seems overcomplicated... or at least i don't see the design yet.
A pipermail archive has three levels of HTML which we parse. First is the
index itself (InitJob), scheduled by it are for each listed month the
"date.html" post listings (MonthJob), which then schedule each posting from
this list (PageJob). Each HarvestJob represents exactly one remote archive
HTML page by scheduler design. These are the ones given by Pipermail, so I
haven't added anything to the remote archive structure. These levels (scraping
by time backwards), seem to be pretty common for most web forums.
You could call "InitJob" "UpdateJob" if you like to, although I haven't
modelled that concretely, it already does the same.
>[09:45:03] <mcallan> I think you need a solid design before you get too far
into the code. I would start with a simple napkin sketch. Here's my rough
attempt:
>[09:45:15] <mcallan> (a) Receive kick
>[09:45:15] <mcallan> (b) Schedule update job
>[09:45:15] <mcallan> (c) Let update job run and schedule further update jobs
as needed
Yep.
>[09:45:15] <mcallan> Let's look at the detail of (c), because that's
obviously the heart of it.
I see problems in your following proposal:
>[09:45:15] <mcallan> (c1) Read local marker recording the last message M0
cached.
1. Markers are a concept of us to avoid double crawling. They are not
guaranteed by the remote archive. IRC archives don't have message id's for
example, so we fall back on date ordering only, which basically gives us a
list. Date's don't have previous and next items (non-discrete), so we cannot
create such a structure in a Harvester per se.
This also means btw. that HarvestHistory should is optional, as it is not
guaranteed to represent the remote archive structure in the best way.
>[09:45:17] <mcallan> (c2) Find M0 in the remote archive.
>[09:45:20] <mcallan> (c3) If M0 is the latest message (no more to read), then
quit.
>[09:45:22] <mcallan> (c4) Try incrementing local marker to next message M1,
or goto (c1) if another job has since incremented it.
2. We then harvest forward and not backward, which gives us no guarantee that
we can match the <10s live criterium or we have to burst for any number of new
posts forwards (this means we burst that way on every Kick!). Picture 100
posts sent since the last update, which we cannot outrule imo.
>[09:45:25] <mcallan> (c5) Read M1 from the remote archive.
>[09:45:28] <mcallan> (c6) If M1 contains a diff URL, then cache it.
>[09:45:31] <mcallan> (c7) If M1 is the latest message (no more to read), then
quit.
3. We don't know when to stop. If we receive a 404, this can be related to any
issue, including a missing message id which even happens for metagov
pipermail. Also a 404 can be related to anything else. If we fetch the index
of the latest month, then we can go backwards until we match our context or
reach the covered HarvestHistory (which makes the jobs stop). Compared to
walking the markers and waiting for 404, we don't have any drawbacks, the
overhead is the same, one page fetch to determine the start-point (month of
current date) or end-point (with 404) of the job.
We will also very likely be in <10s, because the Kick has just been received
and it is very likely that we hit it first with our burst. If the burst goes
backwards and we can match the DiffKick context, we can immediately degrade to
the 1s stepping (schedule a normal UpdateJob or whatever it is), so 100 new
posts are no problem.
>[09:45:33] <mcallan> (c8) Schedule another update job.
>[09:48:13] <mcallan> conseo: i'll be up in 10 hours or so, and we can discuss
>[10:07:25] <mcallan> this is what i meant by sketching the algorithm of a
single job. note this design does not depend on the structure of the archive,
and includes very few implementation details. the details do not matter a
whole lot because they can always be changed after the fact. the design
cannot be changed so easily once the code is written, so it's crucial to get
it right. not sure this is right, but it's a first stab
See above for the current design rationale, which I have developed through
this prototype and my past experiences with pipermail and irssilog. Sorry that
I couldn't do it before, but I wanted to get my hands a bit dirty to
understand the potential problems of the scheduling better (that was the
prototyping for).
While I know I have clarified the design rationale maybe a bit more, I
actually wanted to get feedback if the scheduling is done right (independent
of how to run a harvester). The concept is:
1) Extend a HarvestJob (I can separate it in an interface, if you don't like
inheriting) and set the URL for each job.
2) Implement the run() method to read the InputStream which will be created by
HarvestRunner and deal with the content of this fetched HTML page.
3) Schedule the job. (3) The scheduler asynchronously fetches the job's URL in
the next possible step for this host and then runs it inside its thread-pool.
Do some checks (internal to the harvester) with HarvestHistory or your own
persistent state tracker to avoid double crawls.
conseo
(1) https://hc.apache.org/httpcomponents-core-ga/
(2) http://zelea.com/var/cache/irc/votorola/12-03/22
and http://zelea.com/var/cache/irc/votorola/12-03/23
(3)
http://zelea.com/project/votorola/_/javadoc/votorola/a/diff/harvest/HarvestRunner.html#schedule%28votorola.a.diff.harvest.HarvestJob%29
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.zelea.com/list/votorola/attachments/20120323/9b026738/attachment.html>
More information about the Votorola
mailing list