[MG] Summary parsing (was Re: Visualization problems for crossforum theatre)

Tue Jan 18 09:44:41 EST 2011

We spoke since, and I think we agreed (just for starters) to go with a
single summary per message regardless of the number of diff URLs
embedded in it.  The logical composition of a bite is then:

  * bite

      * message

          * url

          * summary
              * text
              * isTruncated

      * diff

          * url

The summary text we agreed might simply be the leading text of the
message stripped of any quoted material.  A better parsing algorithm
for this (simpler than what we spoke of) might be:

    loop
    {
        read next word of message;

        if( word.length + summary.length > 150 ) end loop;
        else summary += word;
    }

    isTrucated = not( summary.lastWord has proper sentence ending );

So forget about trying to parse for whole sentences.  Just properly
set the isTruncated flag - or maybe even scratch that, and leave it
for the client to figure out.  The simpler the better.

-- 
Michael Allan

Toronto, +1 416-699-9528
http://zelea.com/

conseo wrote:
> > Thomas von der Elbe wrote:
> > > Conseo or Mike, can you please briefly explain what information
> > > exactly the difference feed shows! Something like: user X compared
> > > position a and position b of poll p at time t?
> > 
> > That's pretty much it.  Except it's always tied to a particular
> > discussion post in which a difference is being discussed.  Last Conseo
> > and I spoke, I think we agreed that each bite of the diff feed will
> > contain:
> > 
> >  a) Post URL, e.g.:
> >    
> > http://metagovernment.org/pipermail/start_metagovernment.org/2010-Septembe
> > r/003091.html
> > 
> >  b) Short tweet-like summary of the post
> > 
> >     Hey Thomas, (sorry for the delay) | I made the terminology changes
> >     ("namegiver", "donation") that we agreed to previously.
> 
> I'd like to have some feedback about how this should be done, because it is 
> not easy to parse a summary out of an E-Mail. Atm. I search the position of 
> any diffUrl and grep a text string of ~300 characters around of it. I have 
> approached it that way, because it is likely to get some relevant information 
> around the oocurance of the diff url. On the other hand it might be very 
> specific text in the middle of an argument, which cannot be understood easily. 
> I could also parse let's say ~3 sentences one before, the one with the url and 
> one after.
> Using the approach proposed by Michael would mean that we simply grep the first 
> not-quoted paragraph. It is difficult to see if there is relevant data in it. If 
> we do language and term checks, we have to do it for every language. 
> It might also be specific to the way the involved parties communicate, e.g. if 
> they usually try to advertise them and their party first. They might even try 
> to get better coverage in Crossforum (which is not such a big problem atm. 
> though).
> Another problem is that the first paragraph might be specific to one of diff- 
> urls, while you can and likely will discuss several differences in one Mail to 
> give a picture of your point of view. If we use some generic summary it will 
> be the same for every diff url occuring, even if they cover different parts of 
> the argument and the summary only covers the first diff url.
> 
> We could leave it to the writer to tag it with <summary url="...">Some 
> summary, ....</summary>, but any formatting is likely to be discarded by 
> users.
> 
> What should I do? 

Originally posted to the mailing list of the Metagovernment Project:
http://metagovernment.org/mailman/listinfo/start_metagovernment.org