Adding jar dependencies to Votorola
Michael Allan
mike at zelea.com
Tue Aug 30 04:11:00 EDT 2011
Hey C, following up on our IRC chat about this,
I found a web-harvest package with separate jars:
http://zelea.com/var/tmp-public/
Here's how I would normally go about adding the minimum set of
necessary jars to Votorola. Maybe this is the best approach for
webharvest too. (In any case, it's worth posting for future
reference.)
1. Begin with the main jar, webharvest_2.jar.
2. Add that jar to the proper directory under votorola/g.
For webharvest.jar, I guess it's votorola/g/web.
3. Remove the version number from the jar's file name.
4. Add a .txt file with same name. This file should contain:
* URL to home page of project.
* Version number of jar.
* Formal pointer to location of the jars licence file in the
votorola codebase, e.g.: votorola/_/licence/Apache-2.0.txt
* List of dependants of this jar, i.e. code or other jars that
require it.
5. Do a clean build (if needed) and run it. If you get no errors,
then you're done.
Otherwise you probably got a NoClassDefFoundError, so:
6. Locate the jar that contains the missing class and repeat steps 2
to 5.
Don't duplicate any of the jars we already have of course, but just:
* List votorola/g/web/webharvest.jar as one of its dependants
* Update the version, if necessary
Currently we have:
arq.jar
commons-codec.jar
commons-httpclient.jar
commons-logging.jar
concurrent.jar
getopt.jar
google-gson-stream.jar
gwt-openlayers-client.jar
icu4j.jar
irclib.jar
iri.jar
javamaildir.jar
jena.jar
jersey-client.jar
jersey-core.jar
lib-gwt-svg.jar
mail.jar
nekohtml.jar
openid4java.jar
postgresql-jdbc.jar
servlet-api.jar
slf4j-api.jar
slf4j-jdk14.jar
wicket.jar
xercesImpl.jar
This is a lot of work (I know it), but I think it's better than the
alternatives. If we went outside the repo and asked the admin to
install the component separately, then we'd have to document it. It
would complicate things for developers, for our release builds and for
admins.
What do you think?
--
Michael Allan
Toronto, +1 416-699-9528
http://zelea.com/
conseo wrote:
> Hey guys,
>
> as Mike has pointed out our current harvesting solution to gather
> communicational information from different channels like e-mail and IRC has a
> fundamental flaw since it does not allow to scan backwards, but only once you
> have joined the channel live. To address this Mike has proposed to use web
> scraping by default, since we have to scrape the URL anyway (well, we might
> also get it from the channel, but even something as open as Pipermail is not
> exposing URLs without scraping). The one drawback this has is up-to-dateness
> of data, which mostly hurts on IRC and similiar mediums, since crossforum
> gives a delayed realtime view. We might want to add a live bot there, which
> also allows to support the discussion with links to the positions of diff-urls
> in channel and will be able to trigger web-scans from there. This is issue is
> still to be resolved.
>
> I have had a look into several solutions. My first hit was a web crawler for
> java (crawler4j), which was nice and is likely easing the task of writing web
> scrapers, but it is still only a scraper and all parsing has to be done
> besides it.
>
> I also came across http://web-harvest.sourceforge.net/. It is a perfect match
> so far. It combines many Java standard features like apache libraries, XPath
> parsing with Saxon, embedded scripting in BeanShell (Java access to the JVM
> which is running the code), Groofy and Javascript and defines a set of
> proccessors which allow to significantly reduce the amount of code needed to do
> scraping. I have attached a first prototype of a pipermail scraper for us and
> it works great so far. In under 65 lines of code :-D. Fetch the http://web-
> harvest.sourceforge.net/download/webharvest2b1-exe.zip archive (the other one
> didn't work for me, I likely have messed something in my classpath), extract
> it and execute: (beware: once you run this command you will try to access all
> the archive in a short period of time, be nice and don't stress the server too
> much):
>
> java -jar webharvest_all_2.jar config=pipermail.xml workdir=/tmp debug=yes \
> "#startUrl=http://mail.zelea.com/list/votorola/" \
> "#diffUrl=http://u.zelea.com:8080/v/w/D"
>
> My test run took a bit more than 4 minutes for the whole archive back to 2007.
> Not bad :-D.
>
> We can also embed that as a library and write our own code around it in Java,
> starting with a transparent plugin API. All dependencies are BSD, Apache
> licensed, MPL or LGPL. The nice thing is that it is generic, with an IDE, docs
> and using known standard tools like XPath, which makes it really easy for
> admins to adjust scrapers or write ones for their own. It allows access to
> JSON, Xml and JDBC sources as well as files besides http connections.
>
> Even if we don't use it in the end for some reason (couldn't find one yet, but
> we never know), it gives me a good idea of how to use XML tools and in general
> how to approach scraping. XPath and XQuery are new and very interesting to me.
> So far I like them much better than our older Java code for obvious reasons
> (code size and robustness). So I focus on getting a first Pipermail scraper
> running with the DB till Sunday and then we will see further.
>
> What do you think? Any other considerations I have overseen?
>
> c
More information about the Votorola
mailing list