[MG] Summary parsing (was Re: Visualization problems for crossforum theatre)
Michael Allan
mike at zelea.com
Tue Jan 18 09:44:41 EST 2011
We spoke since, and I think we agreed (just for starters) to go with a
single summary per message regardless of the number of diff URLs
embedded in it. The logical composition of a bite is then:
* bite
* message
* url
* summary
* text
* isTruncated
* diff
* url
The summary text we agreed might simply be the leading text of the
message stripped of any quoted material. A better parsing algorithm
for this (simpler than what we spoke of) might be:
loop
{
read next word of message;
if( word.length + summary.length > 150 ) end loop;
else summary += word;
}
isTrucated = not( summary.lastWord has proper sentence ending );
So forget about trying to parse for whole sentences. Just properly
set the isTruncated flag - or maybe even scratch that, and leave it
for the client to figure out. The simpler the better.
--
Michael Allan
Toronto, +1 416-699-9528
http://zelea.com/
conseo wrote:
> > Thomas von der Elbe wrote:
> > > Conseo or Mike, can you please briefly explain what information
> > > exactly the difference feed shows! Something like: user X compared
> > > position a and position b of poll p at time t?
> >
> > That's pretty much it. Except it's always tied to a particular
> > discussion post in which a difference is being discussed. Last Conseo
> > and I spoke, I think we agreed that each bite of the diff feed will
> > contain:
> >
> > a) Post URL, e.g.:
> >
> > http://metagovernment.org/pipermail/start_metagovernment.org/2010-Septembe
> > r/003091.html
> >
> > b) Short tweet-like summary of the post
> >
> > Hey Thomas, (sorry for the delay) | I made the terminology changes
> > ("namegiver", "donation") that we agreed to previously.
>
> I'd like to have some feedback about how this should be done, because it is
> not easy to parse a summary out of an E-Mail. Atm. I search the position of
> any diffUrl and grep a text string of ~300 characters around of it. I have
> approached it that way, because it is likely to get some relevant information
> around the oocurance of the diff url. On the other hand it might be very
> specific text in the middle of an argument, which cannot be understood easily.
> I could also parse let's say ~3 sentences one before, the one with the url and
> one after.
> Using the approach proposed by Michael would mean that we simply grep the first
> not-quoted paragraph. It is difficult to see if there is relevant data in it. If
> we do language and term checks, we have to do it for every language.
> It might also be specific to the way the involved parties communicate, e.g. if
> they usually try to advertise them and their party first. They might even try
> to get better coverage in Crossforum (which is not such a big problem atm.
> though).
> Another problem is that the first paragraph might be specific to one of diff-
> urls, while you can and likely will discuss several differences in one Mail to
> give a picture of your point of view. If we use some generic summary it will
> be the same for every diff url occuring, even if they cover different parts of
> the argument and the summary only covers the first diff url.
>
> We could leave it to the writer to tag it with <summary url="...">Some
> summary, ....</summary>, but any formatting is likely to be discarded by
> users.
>
> What should I do?
Originally posted to the mailing list of the Metagovernment Project:
http://metagovernment.org/mailman/listinfo/start_metagovernment.org
More information about the Votorola
mailing list