Adding jar dependencies to Votorola

Tue Aug 30 04:11:00 EDT 2011

Hey C, following up on our IRC chat about this,

I found a web-harvest package with separate jars:
http://zelea.com/var/tmp-public/

Here's how I would normally go about adding the minimum set of
necessary jars to Votorola.  Maybe this is the best approach for
webharvest too.  (In any case, it's worth posting for future
reference.)

  1. Begin with the main jar, webharvest_2.jar.

  2. Add that jar to the proper directory under votorola/g.

     For webharvest.jar, I guess it's votorola/g/web.

  3. Remove the version number from the jar's file name.

  4. Add a .txt file with same name.  This file should contain:

     * URL to home page of project.

     * Version number of jar.

     * Formal pointer to location of the jars licence file in the
       votorola codebase, e.g.: votorola/_/licence/Apache-2.0.txt

     * List of dependants of this jar, i.e. code or other jars that
       require it.

  5. Do a clean build (if needed) and run it.  If you get no errors,
     then you're done.

     Otherwise you probably got a NoClassDefFoundError, so:

  6. Locate the jar that contains the missing class and repeat steps 2
     to 5.

Don't duplicate any of the jars we already have of course, but just:

 * List votorola/g/web/webharvest.jar as one of its dependants
 * Update the version, if necessary

Currently we have:

     arq.jar
     commons-codec.jar
     commons-httpclient.jar
     commons-logging.jar
     concurrent.jar
     getopt.jar
     google-gson-stream.jar
     gwt-openlayers-client.jar
     icu4j.jar
     irclib.jar
     iri.jar
     javamaildir.jar
     jena.jar
     jersey-client.jar
     jersey-core.jar
     lib-gwt-svg.jar
     mail.jar
     nekohtml.jar
     openid4java.jar
     postgresql-jdbc.jar
     servlet-api.jar
     slf4j-api.jar
     slf4j-jdk14.jar
     wicket.jar
     xercesImpl.jar

This is a lot of work (I know it), but I think it's better than the
alternatives.  If we went outside the repo and asked the admin to
install the component separately, then we'd have to document it.  It
would complicate things for developers, for our release builds and for
admins.

What do you think?

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/

conseo wrote:
> Hey guys,
> 
> as Mike has pointed out our current harvesting solution to gather 
> communicational information from different channels like e-mail and IRC has a 
> fundamental flaw since it does not allow to scan backwards, but only once you 
> have joined the channel live. To address this Mike has proposed to use web 
> scraping by default, since we have to scrape the URL anyway (well, we might 
> also get it from the channel, but even something as open as Pipermail is not 
> exposing URLs without scraping). The one drawback this has is up-to-dateness 
> of data, which mostly hurts on IRC and similiar mediums, since crossforum 
> gives a delayed realtime view. We might want to add a live bot there, which 
> also allows to support the discussion with links to the positions of diff-urls 
> in channel and will be able to trigger web-scans from there. This is issue is 
> still to be resolved.
> 
> I have had a look into several solutions. My first hit was a web crawler for 
> java (crawler4j), which was nice and is likely easing the task of writing web 
> scrapers, but it is still only a scraper and all parsing has to be done 
> besides it.
> 
> I also came across http://web-harvest.sourceforge.net/. It is a perfect match 
> so far. It combines many Java standard features like apache libraries, XPath 
> parsing with Saxon, embedded scripting in BeanShell (Java access to the JVM 
> which is running the code), Groofy and Javascript and defines a set of 
> proccessors which allow to significantly reduce the amount of code needed to do 
> scraping. I have attached a first prototype of a pipermail scraper for us and 
> it works great so far. In under 65 lines of code :-D. Fetch the http://web-
> harvest.sourceforge.net/download/webharvest2b1-exe.zip archive (the other one 
> didn't work for me, I likely have messed something in my classpath), extract 
> it and execute: (beware: once you run this command you will try to access all 
> the archive in a short period of time, be nice and don't stress the server too 
> much):
> 
> java -jar webharvest_all_2.jar config=pipermail.xml workdir=/tmp debug=yes \ 
> "#startUrl=http://mail.zelea.com/list/votorola/" \ 
> "#diffUrl=http://u.zelea.com:8080/v/w/D"
> 
> My test run took a bit more than 4 minutes for the whole archive back to 2007. 
> Not bad :-D.
> 
> We can also embed that as a library and write our own code around it in Java, 
> starting with a transparent plugin API. All dependencies are BSD, Apache 
> licensed, MPL or LGPL. The nice thing is that it is generic, with an IDE, docs 
> and using known standard tools like XPath, which makes it really easy for 
> admins to adjust scrapers or write ones for their own. It allows access to 
> JSON, Xml and JDBC sources as well as files besides http connections. 
> 
> Even if we don't use it in the end for some reason (couldn't find one yet, but 
> we never know), it gives me a good idea of how to use XML tools and in general 
> how to approach scraping. XPath and XQuery are new and very interesting to me. 
> So far I like them much better than our older Java code for obvious reasons 
> (code size and robustness). So I focus on getting a first Pipermail scraper 
> running with the DB till Sunday and then we will see further.
> 
> What do you think? Any other considerations I have overseen?
> 
> c