[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Sukhotin's vowel algorithm for the billions



... retrieved  from a seldom visited corner of my hard
disks. I am sure that it is in voynich archives somewhere,
but where? This is not a straight translation. I have
found my original translation, but his style is so
convoluted and verbose that you need a supply of aspirin
to cope. So this is the ...  Readers' Digest (tm) version.

----------------------------------------------------------
   CLASSIFYING LETTERS INTO VOWELS AND CONSONANTS

PRELIMINARY ASSUMPTIONS

1) The set of the written symbols the text is known:
   that is, the number and the distinctive features of the
   symbols are known.
2) The symbols are alphabetic, that is, they represent
   phonemes, not syllables or words.


SET OF ACCEPTABLE SOLUTIONS

An acceptable solution is a partition of the set of the
symbols of the text S into two disjoint subsets V and C
(vowels and consonants) the union of which is S.

If the alphabet of the language consists of n letters,
there are 2**n acceptable solutions.


OBJECTIVE FUNCTION

Texts written in an alphabetical system have the following
properties:

1) vowels tend to appear next to consonants rather than next
   to vowels.

2) consonants tend to appear next to vowels rather than next
   to consonants.

3) the most frequent phoneme in all languages known to date
   is a vowel (perhaps a consequence of the fact that
   all known languages have fewer vowels than consonants)

If languages did not have properties (1) and (2), the text of
property (2) would read: "cnsnntsooa tnde to ppraea nxte
to vwlsoe rhtrae thna nxte to cnsnntsooa" or "cnsnnts tnd
t ppr nxt t vwls rthr thn nxt t cnsnnts ooa e o aea e o oe ae
a e o ooa".

The objective function must then express either the frequency with
which members of set V occur next to members of set C, or the
frequency with which members of the same set occur contiguously.
In the former case the true solution is that for which the
function reaches a maximum, in the latter case that for which it
reaches a minimum.


DECIPHERMENT ALGORITHM

Consider an arbitrary solution from the set of acceptable solutions.
It is a partition of an alphabet into two subsets, vowels and
consonants.

Record in a 2x2 matrix the number of times each vowel of the
text occurs next to a vowel, and next to a consonant, and the
number of times each consonant occurs next to a consonant,
and next to a vowel:

.                Vowels     Consonants
.             .-----------------------.
. Vowels      |   f(V,V)  |  f(V,C)   |
.             |-----------+-----------|
. Consonants  |   f(C,V)  |  f(C,C)   |
.             '-----------------------'


The sum of the entries being constant, f(V,V)+f(C,C) is minimum
when f(C,V)+f(V,C) is maximum.

Computing all 2x2 matrices to choose the one for which the
objective function f(C,C)+f(V,V) is minimum is too expensive
computationally, since for a language with an alphabet of n
letters there are 2**n such matrices.

Record in the entries of an nxn matrix (n being the number
of the letters in the alphabet of the text) the number of times
f(i,j) letter i occurs next to letter j.

Fill its main diagonal with zeroes.

Let Sum(i) be the sum of the entries of the ith row.
Calculate Sum(i) for each row.

Let Cat(i) be the category (vowel or consonant) of letter i.
Set all letters to consonant, (i.e. for i:=1 to n do Cat(i):=consonant).

Repeat
    Select the letter m for which Sum(m) is maximum and Cat(i) is
    consonant.
    If Sum(m)> 0 then
       Set Cat(m) to vowel.
       Let Sum(i) = Sum(i)- f(i,m)*2 for all i's for which
       Cat(i) is consonant.
Until Sum(m)=0.

This algorithm was programmed for a BESM-2. The texts chosen for
the experiment contained 10,000 elements each. The results are
perfect for the Russian and Spanish texts.

E,a,o,i,u,y, and k were classified as vowels in the French text
and all the other letters as consonants. The letter k occurred
only six times, in abbreviations of foreign origin.

In the English text, the letters e,a,o,i,t,u,y were identified as
vowels, and the other letters as consonants. It is interesting that
t was incorrectly classified, probably because the combination th
following a consonant is extremely frequent, eg.: of the....
As such errors occur regularly, it is desirable to build algorithms
which correct the first, but do not change its results when they
are satisfactory, whilst improving them when they are not.

An improved algorithm  is based on the following idea: if we
have a string of five letters: x1, x2, x3, x4, x5 and if the
middle letter, x3, belongs to the vowel class, the majority
of the remaining letters x1, x2, x4, and x5 are likely to
belong to the other class (consonants) rather than to the same
(vowels). This improved algorithm, also programmed for computer,
gave satisfactory results.

As a general rule, it seems that the improved algorithm must
be designed so as to make the increase or decrease of the
objective function depend on a segment longer than is used
in the basic algorithm.

Let us mention in conclusion that, given a text coded into its
constituent morphemes we can expect the consonant/vowel algorithm
to analyze its components into notional morphemes (roots)
and auxiliary morphemes (such as, for instance, endings, articles,
prepositions, and conjunction).