[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Sukhotin's algorithm etc



Many thanks to everybody who kindly replied to my question.  If some of
you may remember (or not remember) I made an attempt to classify the
symbols in VMS as vowels and consonants using the LSC data for VMS-A and
VMS-B (whoever is interested can see it at www.bigfoot.com/~perakh/Texts/,
paper number 8)   Of 37 differing symbols I counted in the Courrier's
rendition, I tentatively determined 7 vowels and 14 consonants, the rest
being ambiguous. The ratio of the frequency of the supposed vowels in the
VMS text to that in the VMS alphabet turned out to be within the range
found for 12 natural languages. Of course, without a verification by some
other method, those results remain quite hypothetical. I thought that
maybe Sukhotin's method would provide corroboration (or negation) of my
attempt, but now I see you do not believe Sukhotin's alg is reliable.
Pity. Jorge, I am looking forward to your further messages. Best to all.
Mark

Jorge Stolfi wrote:

>     > Can anybody explain how Sukhotin does the job of sorting out
>     > vowels and consonants?
>
> I recall that Jacques posted an explanation to the mailing list,
> a couple of years ago.
>
> The basis of the algorithm is the observation that in most languages
> Vs and Cs have a tendency to alternate: Vs are mostly surrounded by
> C's and vice-versa. So the algorithm tries to find a partition of the
> alphabet in two classes X and Y that maximizes the number of XY and YX
> pairs, and minimizes the number of XX and YY pairs.
>
>     > Did anybody apply that method to VMS, and if yes, were the
>     > symbols in VMS reliably shown to be either vowels or consonants?
>
> Jacques did, and I gather that the results were incoclusive.
>
> I have tried to do roughly the same thing, by hand and by ad-hoc
> algorithms. I did find some structure in the aphabet, which is now
> part of the crust-mantle-core paradigm.
>
> Basically, one can distingush several classes of letters (gallows,
> benches, dealers, etc.) which have similar digraph statistics; but
> there doesn't seem to be any simple mapping of those classes to a
> plausible `vowels and consonants' bipartition. Moreover, although
> those statistical classes seem clar-cut, and are fairly compatible
> with the morphological classification of the symbols, if I slightly
> change the similarity measure, I get a very different set of classes
> --- also clear-cut and compatible.
>
> But two days ago I noticed another weird thing about the digraph
> frequencies, with sort of explains that ambiguity (and why Sukhotin's
> algorithm couldn't possibly work). Stay tuned...
>
> All the best,
>
> --stolfi