[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Antoine Casanova's research
> [Adam McLean:] I have just reread Antoine Casanova's posting on
> 6th March 2000, based on his thesis ... I have not seen on the
> Voynich list a critical revue of Antoine's work, and I wonder
> how his thesis is received by the main researchers on this list.
Thanks for reminding me of this posting, and prompting me to
re-read it more carefully.
> ... which reveals a structure within the individual 'tokens' in
> the Voynich language.
I will try to describe my understanding of Antoine's method, and
discuss his results in relation to the crust-core-mantle paradigm.
I have only managed to read a couple of chapters of his thesis; I was
unable to fetch and/or unpack the whole work for some reason ---
probably Windows/Unix incompatibilities. My comments below are based
on his posting only; may Antoine forgive me if if I got it all wrong.
Substitution patterns
Antoine's method, if I got it correctly, is to compare all words of
the same length n, separately for each n, looking for pairs of words
that differ in only one letter position.
The results of these comparisons are summarized by a
`substitution pattern' P_n, which is a permutation (p_1,.. p_n) of the
digits from 1 to n. For example, Antoine found that
P_4 = ( 1 2 4 3 ) and P_6 = ( 1 4 2 3 6 7 5 ).
To determine the pattern P_n, Antoine examines each letter position
i in turn, from 1 to n, and counts the words w of length n for which
there exists another word that differs from w only in that letter.
[I am not entirely sure about this point; he may count instead the
*pairs* of words that differ only at that letter position.] Let's denote these
counts by s_1,.. s_n.
For instance, from Friedman's transcription, Antoine got the following s_i:
n = 3: 194 120 158
n = 4: 397 308 195 253
n = 5: 459 263 315 208 276
n = 6: 238 143 171 164 121 170
n = 7: 81 40 58 42 38 43 43
n = 8: 8 2 9 5 8 9 5 7
I.e., among all words of length 4, there are 397 words that remain
valid by an appropriate replacement of the first letter; 308 words that
can be modified in the second letter; and so on.
The permutation digit p_i is then simply the rank of the
corresponding count s_i among the list s_1,.. s_n. Thus
P_6 = ( 1 4 2 3 6 7 5 ) means that, among all words of length 6, the
first letter is the `easiest' to replace, the second letter is the
4th easiest, and so on. Here are the patterns he got:
n = 3: ( 1 3 2 )
n = 4: ( 1 2 4 3 )
n = 5: ( 1 4 2 5 3 )
n = 6: ( 1 4 2 5 6 3 )
n = 7: ( 1 4 2 3 6 7 5 )
n = 8: ( 6 8 1 2 3 4 7 5 )
The rules
Antoine then looked for a single formula that would account for all
these patterns. He first noted that, for n between 3 and 7,
Rule 1: p_1 is always 1.
Rule 2: p_{n-1} is always n.
In other words, the first letter is the easiest one to replace (the
"most substituted" as Antoine calls it), and the next-to-last letter
is the hardest one to replace. (The case n = 8 is an exception to
both rules; but there are to few words of 8 letters, so the ranking
is very uncertain anyway.)
Antoine then describes a procedure that will generate P_{n} from
P_{n+1}. I won't copy the procedure here, because it is somewhat
complicated, and, in my opinion, not very significant. (More on that
below.)
Justification
One justification [mine, not his] for this analysis is that it is
expected to reveal the "points of inflection" of inflected languages
For instance, in a large sample of Italian, one will find many pairs
of words that differ in the last letter only:
rosso/rossa/rosse/rossi
ritorno/ritorni/ritorna/ritornò
etc. Thus the pattern P_n for Italian will probably end in 1, meaning that
the last position is the single letter that can be replaced more easily
within the language.
For Spanish and Portuguese, which usually make the plural by adding
"s", the next-to-last position is easily substituted too:
blancas/blancos
rojas/rojos
etc.. Additionally, the numerous verb inflections will provide
many pairs differing in the 3rd and 4th letter from the end.
I don't dare to guess the relative ranking of those positions; but I
expect that, in any Romance language, the last three letters will be
the most substituted, i.e. will get the rankings 1..3.
If the same analysis were applied to English, I would still expect
the last few positions to be the most easily substituted, because of
pairs like
painter/painted
married/marries
However, the difference between s_n and the average s_i should be
much less for English than for Romance languages. I won't be
surprised if the positions with ranks 2 and 3 are not at the end but
somewhere else along the word.
Thus, Antoine's analysis should distinguish quite easily the Indo-European
languages from other language families with different places and methods of
inflection. It may also be able to distinguish Romance from Germanic, and
perhaps even Italian from Spanish. And it will certainly distinguish
IE languages from random gibberish (unless the author has been
careful to generate end-inflected gibberish, of course).
My comments
In my view, the most significant feature of Antoine's substitution
patterns is that the first letter of a Voynichese word seem to have
more "inflectional freedom", while the final letters are relatively
invariant. These patterns are precisely oposite to what we would
expect to see in Indo-European languages (at least Romance and
Germanic), where grammatical inflection usually modifies letters
near the end of the word.
Presumably this is what Antoine has in mind whe he says that
Voynichese words are "built from synthetic rules which exclude ...
natural language". Anyway, I think that this conclusion is
unwarranted. After all, there are non-IE natural languages, which I
do not dare to mention by name 8-), that do seem to have
`substitution patterns' similar to those of Voynichese.
Thus I don't accept Antoine conclusion that Voynichese must be an
artificial language, or at best a code based on "progressive
modification [similar to] the discs of Alberti". It
cannot be just some IE language with a funny alphabet, sure;
but we already knew that.
I find it interesting also that his analysis yield a very anomalous
pattern for n = 8, namely P_8 = ( 6 8 1 2 3 4 7 5 ). While that
pattern may be just a noise artifact, it may also be telling us that
the rare 8-letter words are mostly the result of joining a 2-letter
word to a 6-letter one.
I am not sure what to make of Antoine's rules for generating P_n
from P_{n+1}. For one thing, they seem to be a bit too complicated
given the limited amount of data that they have to explain.
Moreover, the counts s_2,.. s_{n-2} seem to be fairly similar, and
the differences seem to be mostly statistical noise; therefore,
their relative ranks do not seem to be very significant. Indeed,
applying Antoine's method to Currier's transcription we get
P_6 = ( 1 4 2 6 5 3 ), whereas from Friedman's we get
P_6 = ( 1 5 2 4 6 3 ). Moreover, the latter would change to
P_6 = ( 1 5 3 4 6 2 ) if we omitted just two words from the
input text.
But the main limitation I see in Antoine's method is that he considers
the absolute position of each letter in the word to be a significant
parameter for statistical analysis. I.e., he assumes implicitly that
an n-letter word contains exactly n "inflectional", slots, each each
of them containing exactly one letter. This view seems too
simplistic when one considers the patterns of inflection of natural
languages, where each morphological "slot" can usually be filled by
strings of different lengths, including zero. To uncover the
inflection rules of English, for example, one would have to compare
words of different lengths, because the key substitution patterns
are
dog / dogs / dog's / dogs'
dance / dances / danced / dancing / dancer / dancers / ...
strong / stronger /strongest / strongly
and so on.
Another problem of Antoine's method is that the most important
structural features of words in natural languages are usually based
on *relative* letter positions, and may not be visible at all in an
analysis based on absolute positions. For example, in Spanish there
is a particularly strong alternation of vowels and consonants, so
that if words were aligned by syllables one would surely find that
the "even" letter slots have very different substitution properties
than the "odd" slots. But since Spanish words may begin with either
vowel or consonant, and may contain occasional VV and CC clusters,
the 3rd and 4th letters in a 6-letter word should be about as likely
to be VC as CV; and, therefore, will probably have very similar
substitution statistics.
Indeed, aligning words letter-by-letter is a bit like classifying
fractional numeric data like 3.15 and -0027 into classes by the
number of characters, and then analyzing the statistics of the ith
character within each class, without regards for leading zeros,
omitted signs, or the position of the decimal point. While some
statistical features of the data may still have some visible
manifestation after such mangling, we cannot expect to get reliable
and understandable results unless we learn to align the data by the
decimal point before doing the analysis.
These problems are relevant for Vonichese too, only even more so.
First, we already know tha there are many potentially important
"inflections" that involve a change in length, like
<okeedy>/<qokeedy>; and yet these "inflections" will not register in
Antoine's analysis. More importantly, if we factor VMS words as
specified by the crust-mantle-core (CMK) paradigm, we find that each
CMK component can vary in length from 0 to 3 letters, more or less
independently. Moreover, the Friedman and Currier alphabets
sometimes use two or more letters to denote combinations, like EVA
<ee> or <ke>, that are probably single letters in the `true' VMS
alphabet. Thus, when Antoine sorts the Voynichese words by length,
aligns them letter-by-letter, and analyzes the properties of each
letter slot, he is analyzing a mixture of core, mantle, and crust
slots -- which are statistically as different as night and day.
Thus it is not surprsing that Antoine's counts s_i, for positions i
near the middle of the word, are all pretty much the same (with
differences comparable to the statistical noise). Because of the
varying-length components, the 4th letter in a 6-letter word should
be pretty much like a letter picked at a random among letters 2..5.
Even for slots 1 and n, where we do see significant differences in
the s_i, the counts will get "blurred" because the initial CMK
components can be empty.
Possible improvements
One possible way of improving Antoine's statistics is to replace his
fixed-length letter slots, defined by absolute letter position, by
the seven slots defined by the CMK model (initial <q>, crust prefix,
mantle prefix, core, mantle suffix, crust suffix, and final group).
The result of this analysis would be a table saying, for example,
that there are NNN word pairs that differ only in the initial CMK
component (i.e by the addition/omission of an initial <q>), and MMM
pairs that differ only in the core component (which has 13 possible
values, { k t p f ke te pe fe cth ckh cph cfh ckhe cthe cphe cfhe }
or empty).
I have tried to do some of this analysis myself, but I am still
troubled by some ambiguities of the model (what to do with the
letters <aoy>, whether final <r> is a final group like <n> or part
of the crust, etc.)
Another variant of Antoine's analysis that may be worth trying, and
which does not depend on a prior word model, is to use the `string
edit distance' instead of Hamming's. In other words, one should look
for pairs of words that differ by *insertion or deletion* of a
single letter, as well as replacement.
However, this fix addresses only the the problem of length-changing
inflections like those of English. To address the misalignment
problem, we could label the slots with the letters that occur next
to them, rather than with their absolute positions in the word. We
would then get a table saying e.g. that there are 698 words pairs
that differ by editing (inserting, deleting, or replacing) a single letter
between a <k> and a <d>, and 345 that differ by a letter edit
between <.> (word space) and <k>.
Actually, we can use this criterion to define a `Voynichese word
graph:' let the vertices be the words, and let the edges be the
word pairs that differ by a single letter-edit operation. It would be
very intersting to see a picture of this graph, or at least a
tabulation of its connected components, and compare the results with
the analogous data for English, Latin, Chinese (oops, pardon my
French!), etc.
I hope it helps. Again, my apologies to Antoine if I misrepresented his
work.
All the best,
--stolfi