Webscraping
Michael Allan
mike at zelea.com
Sun Aug 7 18:29:17 EDT 2011
I prefer to roll my own scraping solutions in Java, but I can't fault
the choice of a specialized language instead. I think it comes down
to preference.
It's good to see this effort going into the the diff feed, especially
at the back end. It's crucial to be flexible there because we're
supposed to be capable of aggregating thousands of lists and other
forums. Eventually it'll be too much for one developer to add support
for all the various archiving formats. Ideally each developer will be
able to take ownership of his own additions to the code.
My own rule of thumb is to avoid forcing developers (including myself)
into restrictive and complicated frameworks. It's better to give them
a set of useful tools and components from which each can build his own
solution, in his own style. Generally that means an API of small
tools, components and utility methods, most of which are optional and
can be used independently of each other.
M
conseo wrote:
> Hey guys,
>
> as Mike has pointed out our current harvesting solution to gather
> communicational information from different channels like e-mail and IRC has a
> fundamental flaw since it does not allow to scan backwards, but only once you
> have joined the channel live. To address this Mike has proposed to use web
> scraping by default, since we have to scrape the URL anyway (well, we might
> also get it from the channel, but even something as open as Pipermail is not
> exposing URLs without scraping). The one drawback this has is up-to-dateness
> of data, which mostly hurts on IRC and similiar mediums, since crossforum
> gives a delayed realtime view. We might want to add a live bot there, which
> also allows to support the discussion with links to the positions of diff-urls
> in channel and will be able to trigger web-scans from there. This is issue is
> still to be resolved.
>
> I have had a look into several solutions. My first hit was a web crawler for
> java (crawler4j), which was nice and is likely easing the task of writing web
> scrapers, but it is still only a scraper and all parsing has to be done
> besides it.
>
> I also came across http://web-harvest.sourceforge.net/. It is a perfect match
> so far. It combines many Java standard features like apache libraries, XPath
> parsing with Saxon, embedded scripting in BeanShell (Java access to the JVM
> which is running the code), Groofy and Javascript and defines a set of
> proccessors which allow to significantly reduce the amount of code needed to do
> scraping. I have attached a first prototype of a pipermail scraper for us and
> it works great so far. In under 65 lines of code :-D. Fetch the http://web-
> harvest.sourceforge.net/download/webharvest2b1-exe.zip archive (the other one
> didn't work for me, I likely have messed something in my classpath), extract
> it and execute: (beware: once you run this command you will try to access all
> the archive in a short period of time, be nice and don't stress the server too
> much):
>
> java -jar webharvest_all_2.jar config=pipermail.xml workdir=/tmp debug=yes \
> "#startUrl=http://mail.zelea.com/list/votorola/" \
> "#diffUrl=http://u.zelea.com:8080/v/w/D"
>
> My test run took a bit more than 4 minutes for the whole archive back to 2007.
> Not bad :-D.
>
> We can also embed that as a library and write our own code around it in Java,
> starting with a transparent plugin API. All dependencies are BSD, Apache
> licensed, MPL or LGPL. The nice thing is that it is generic, with an IDE, docs
> and using known standard tools like XPath, which makes it really easy for
> admins to adjust scrapers or write ones for their own. It allows access to
> JSON, Xml and JDBC sources as well as files besides http connections.
>
> Even if we don't use it in the end for some reason (couldn't find one yet, but
> we never know), it gives me a good idea of how to use XML tools and in general
> how to approach scraping. XPath and XQuery are new and very interesting to me.
> So far I like them much better than our older Java code for obvious reasons
> (code size and robustness). So I focus on getting a first Pipermail scraper
> running with the DB till Sunday and then we will see further.
>
> What do you think? Any other considerations I have overseen?
>
> c
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: pipermail.xml
> Type: application/xml
> Size: 1801 bytes
> Desc: not available
> URL: <http://mail.zelea.com/list/votorola/attachments/20110805/c805c66a/attachment.xml>
> -------------- next part --------------
> Author: ThomasvonderElbe at gmx.de
> -----------------------------------------------------
> Title: [MG] Metaquestion for Metagovernment
> -----------------------------------------------------
> Cool! Thank you Paul and Matteo! Since yesterday we have 6 participating
> people and all votes on a consensus draft. So if this holds for 3 weeks,
> it will come into effect.
>
> In the meanwhile, you can still work on improving the solution in your
> own drafts and with each other, like Mike wants to do together with Ed.
> ... just dont pull back your vote from the current consensus until we
> have a new one, otherwise the waiting period would have to start anew then.
>
> Thomas
More information about the Votorola
mailing list