Similarity Metrics and Clustering and Merging of Positions

Sat Dec 5 01:52:12 EST 2009

Hi Kai,

Some of my favourite courses at school (way back) had to do with
clustering and other apps of multivariate analysis - mostly in the
context of population biology.  I don't remember much of the math (not
much chance to use it) but I know that some of it applies here.
(Apologies again for holding up your original post.)

> 2009/12/3 Thomas von der Elbe <ThomasvonderElbe at gmx.de>:
> > ... Everybody can have his own position, but because they are
> > structured in trees, the vast amount of them will be arranged
> > according to affinity. This will be done by the users themselves:
> > they look for the best place to vote by going from the trunk of
> > the tree further out to the leaves. This way they would only have
> > to choose a few times (at each bifurcation) between a low number
> > of alternatives...
> 
> This is very similar to what I proposed, but it assumes that a single
> hierarchy can be imposed onto the list of opinions. This is hardly a
> valid assumption.

A hierarchy is imposed but (just to be clear) it's not mainly for
purposes of classification.  It's for communication.  It just happens
that the resulting communicative structure is also a good (though not
always ideal) class structure.  As Thomas says, there'll be an
affinity among the neighbouring texts.  That's because they are in
communication.

> Both techniques rely on user feedback and analysis, so there is no
> difference when it comes to that. Both take their intelligence from
> the users not from some kind of AI.

Except only that the "technique" he describes is not classification,
but rather search within a pre-given class structure.  You propose a
novel technique of classification.

> Assume a position is composed of many sub-positions. Let's denote
> these with capital letters from A to Z.

A position is something we formalize as a text document, e.g. a wiki
page.  We don't yet formalize sub-positions.  But I don't think your
method depends on this.

You propose clustering based on a pairwise comparison of whole
positions.  Probably we don't need that for positions, because it's
like Thomas says: the positions in branches are going to be very, very
similar to each other.  (They have to be, or the resulting "noise"
will disrupt the lines of communication.)  So your method is likely to
end up replicating the existing tree structure.

There *are* places where clustering would be useful, though.  It would
be useful for classifying polls, for example.  So the voters could
find their way around in issue space: over here are the polls related
to tax law, over here foreign policy, and so forth.  There was a brief
mention of that here:
http://groups.google.com/group/votorola/t/ccc5f17ec83f2ad8

There are other places too, I'm sure.  A little further in the future,
there'll be a formal substructure to the texts.  We're working with a
population model (recombinant text) and we hope at some point to
formalize not only the recombinant individuals (documents) but also
the sub-units of recombination.  That'll open up a new dimension for
sophisticated pattern matching, classification and search techniques.
(This is actually one of my favourite topics.  But it's still kind of
 futur-ish.)

Best,
-- 
Michael Allan

Toronto, +1 647-436-4521
http://zelea.com/