Webscraping
conseo
4consensus at web.de
Fri Aug 5 00:38:46 EDT 2011
Hey guys,
as Mike has pointed out our current harvesting solution to gather
communicational information from different channels like e-mail and IRC has a
fundamental flaw since it does not allow to scan backwards, but only once you
have joined the channel live. To address this Mike has proposed to use web
scraping by default, since we have to scrape the URL anyway (well, we might
also get it from the channel, but even something as open as Pipermail is not
exposing URLs without scraping). The one drawback this has is up-to-dateness
of data, which mostly hurts on IRC and similiar mediums, since crossforum
gives a delayed realtime view. We might want to add a live bot there, which
also allows to support the discussion with links to the positions of diff-urls
in channel and will be able to trigger web-scans from there. This is issue is
still to be resolved.
I have had a look into several solutions. My first hit was a web crawler for
java (crawler4j), which was nice and is likely easing the task of writing web
scrapers, but it is still only a scraper and all parsing has to be done
besides it.
I also came across http://web-harvest.sourceforge.net/. It is a perfect match
so far. It combines many Java standard features like apache libraries, XPath
parsing with Saxon, embedded scripting in BeanShell (Java access to the JVM
which is running the code), Groofy and Javascript and defines a set of
proccessors which allow to significantly reduce the amount of code needed to do
scraping. I have attached a first prototype of a pipermail scraper for us and
it works great so far. In under 65 lines of code :-D. Fetch the http://web-
harvest.sourceforge.net/download/webharvest2b1-exe.zip archive (the other one
didn't work for me, I likely have messed something in my classpath), extract
it and execute: (beware: once you run this command you will try to access all
the archive in a short period of time, be nice and don't stress the server too
much):
java -jar webharvest_all_2.jar config=pipermail.xml workdir=/tmp debug=yes \
"#startUrl=http://mail.zelea.com/list/votorola/" \
"#diffUrl=http://u.zelea.com:8080/v/w/D"
My test run took a bit more than 4 minutes for the whole archive back to 2007.
Not bad :-D.
We can also embed that as a library and write our own code around it in Java,
starting with a transparent plugin API. All dependencies are BSD, Apache
licensed, MPL or LGPL. The nice thing is that it is generic, with an IDE, docs
and using known standard tools like XPath, which makes it really easy for
admins to adjust scrapers or write ones for their own. It allows access to
JSON, Xml and JDBC sources as well as files besides http connections.
Even if we don't use it in the end for some reason (couldn't find one yet, but
we never know), it gives me a good idea of how to use XML tools and in general
how to approach scraping. XPath and XQuery are new and very interesting to me.
So far I like them much better than our older Java code for obvious reasons
(code size and robustness). So I focus on getting a first Pipermail scraper
running with the DB till Sunday and then we will see further.
What do you think? Any other considerations I have overseen?
c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pipermail.xml
Type: application/xml
Size: 1801 bytes
Desc: not available
URL: <http://mail.zelea.com/list/votorola/attachments/20110805/c805c66a/attachment.xml>
-------------- next part --------------
Author: ThomasvonderElbe at gmx.de
-----------------------------------------------------
Title: [MG] Metaquestion for Metagovernment
-----------------------------------------------------
Cool! Thank you Paul and Matteo! Since yesterday we have 6 participating
people and all votes on a consensus draft. So if this holds for 3 weeks,
it will come into effect.
In the meanwhile, you can still work on improving the solution in your
own drafts and with each other, like Mike wants to do together with Ed.
... just dont pull back your vote from the current consensus until we
have a new one, otherwise the waiting period would have to start anew then.
Thomas
On Sun, 01 May 2011 0:16, Michael Allan wrote:
>
>
>
>
>
>
>
>>>
>>
>
>
>
More information about the Votorola
mailing list