Webscraping

Fri Aug 5 00:38:46 EDT 2011

Hey guys,

as Mike has pointed out our current harvesting solution to gather 
communicational information from different channels like e-mail and IRC has a 
fundamental flaw since it does not allow to scan backwards, but only once you 
have joined the channel live. To address this Mike has proposed to use web 
scraping by default, since we have to scrape the URL anyway (well, we might 
also get it from the channel, but even something as open as Pipermail is not 
exposing URLs without scraping). The one drawback this has is up-to-dateness 
of data, which mostly hurts on IRC and similiar mediums, since crossforum 
gives a delayed realtime view. We might want to add a live bot there, which 
also allows to support the discussion with links to the positions of diff-urls 
in channel and will be able to trigger web-scans from there. This is issue is 
still to be resolved.

I have had a look into several solutions. My first hit was a web crawler for 
java (crawler4j), which was nice and is likely easing the task of writing web 
scrapers, but it is still only a scraper and all parsing has to be done 
besides it.

I also came across http://web-harvest.sourceforge.net/. It is a perfect match 
so far. It combines many Java standard features like apache libraries, XPath 
parsing with Saxon, embedded scripting in BeanShell (Java access to the JVM 
which is running the code), Groofy and Javascript and defines a set of 
proccessors which allow to significantly reduce the amount of code needed to do 
scraping. I have attached a first prototype of a pipermail scraper for us and 
it works great so far. In under 65 lines of code :-D. Fetch the http://web-
harvest.sourceforge.net/download/webharvest2b1-exe.zip archive (the other one 
didn't work for me, I likely have messed something in my classpath), extract 
it and execute: (beware: once you run this command you will try to access all 
the archive in a short period of time, be nice and don't stress the server too 
much):

java -jar webharvest_all_2.jar config=pipermail.xml workdir=/tmp debug=yes \ 
"#startUrl=http://mail.zelea.com/list/votorola/" \ 
"#diffUrl=http://u.zelea.com:8080/v/w/D"

My test run took a bit more than 4 minutes for the whole archive back to 2007. 
Not bad :-D.

We can also embed that as a library and write our own code around it in Java, 
starting with a transparent plugin API. All dependencies are BSD, Apache 
licensed, MPL or LGPL. The nice thing is that it is generic, with an IDE, docs 
and using known standard tools like XPath, which makes it really easy for 
admins to adjust scrapers or write ones for their own. It allows access to 
JSON, Xml and JDBC sources as well as files besides http connections. 

Even if we don't use it in the end for some reason (couldn't find one yet, but 
we never know), it gives me a good idea of how to use XML tools and in general 
how to approach scraping. XPath and XQuery are new and very interesting to me. 
So far I like them much better than our older Java code for obvious reasons 
(code size and robustness). So I focus on getting a first Pipermail scraper 
running with the DB till Sunday and then we will see further.

What do you think? Any other considerations I have overseen?

c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pipermail.xml
Type: application/xml
Size: 1801 bytes
Desc: not available
URL: <http://mail.zelea.com/list/votorola/attachments/20110805/c805c66a/attachment.xml>
-------------- next part --------------
Author: ThomasvonderElbe at gmx.de
-----------------------------------------------------
Title: [MG] Metaquestion for Metagovernment
-----------------------------------------------------
Cool! Thank you Paul and Matteo! Since yesterday we have 6 participating 
people and all votes on a consensus draft. So if this holds for 3 weeks, 
it will come into effect.

In the meanwhile, you can still work on improving the solution in your 
own drafts and with each other, like Mike wants to do together with Ed. 
... just dont pull back your vote from the current consensus until we 
have a new one, otherwise the waiting period would have to start anew then.

Thomas

On Sun, 01 May 2011 0:16, Michael Allan wrote:
>

>

>

>

>

>

>

>>>

>>

>

>

>