[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Antoine Casanova's research



    > [Adam McLean:] I have just reread Antoine Casanova's posting on
    > 6th March 2000, based on his thesis ... I have not seen on the
    > Voynich list a critical revue of Antoine's work, and I wonder
    > how his thesis is received by the main researchers on this list.
    
Thanks for reminding me of this posting, and prompting me to 
re-read it more carefully.

    > ... which reveals a structure within the individual 'tokens' in
    > the Voynich language.
    
I will try to describe my understanding of Antoine's method, and
discuss his results in relation to the crust-core-mantle paradigm.

I have only managed to read a couple of chapters of his thesis; I was
unable to fetch and/or unpack the whole work for some reason ---
probably Windows/Unix incompatibilities. My comments below are based
on his posting only; may Antoine forgive me if if I got it all wrong.

Substitution patterns

  Antoine's method, if I got it correctly, is to compare all words of
  the same length n, separately for each n, looking for pairs of words
  that differ in only one letter position.
  
  The results of these comparisons are summarized by a
  `substitution pattern' P_n, which is a permutation (p_1,.. p_n) of the
  digits from 1 to n. For example, Antoine found that 
  P_4 = ( 1 2 4 3 ) and P_6 = ( 1 4 2 3 6 7 5 ).

  To determine the pattern P_n, Antoine examines each letter position
  i in turn, from 1 to n, and counts the words w of length n for which
  there exists another word that differs from w only in that letter.
  [I am not entirely sure about this point; he may count instead the
  *pairs* of words that differ only at that letter position.] Let's denote these
  counts by s_1,.. s_n. 
  
  For instance, from Friedman's transcription, Antoine got the following s_i:

    n = 3:  194 120 158
    n = 4:  397 308 195 253
    n = 5:  459 263 315 208 276
    n = 6:  238 143 171 164 121 170
    n = 7:  81  40  58  42  38  43  43
    n = 8:  8   2   9   5   8   9   5  7

  I.e., among all words of length 4, there are 397 words that remain
  valid by an appropriate replacement of the first letter; 308 words that 
  can be modified in the second letter; and so on. 

  The permutation digit p_i is then simply the rank of the
  corresponding count s_i among the list s_1,.. s_n. Thus 
  P_6 = ( 1 4 2 3 6 7 5 ) means that, among all words of length 6, the
  first letter is the `easiest' to replace, the second letter is the
  4th easiest, and so on. Here are the patterns he got:

    n = 3:  ( 1 3 2 )
    n = 4:  ( 1 2 4 3 )
    n = 5:  ( 1 4 2 5 3 )
    n = 6:  ( 1 4 2 5 6 3 )
    n = 7:  ( 1 4 2 3 6 7 5 )
    n = 8:  ( 6 8 1 2 3 4 7 5 )

The rules

  Antoine then looked for a single formula that would account for all
  these patterns. He first noted that, for n between 3 and 7,

    Rule 1:  p_1 is always 1.  

    Rule 2:  p_{n-1} is always n.

  In other words, the first letter is the easiest one to replace (the
  "most substituted" as Antoine calls it), and the next-to-last letter
  is the hardest one to replace. (The case n = 8 is an exception to
  both rules; but there are to few words of 8 letters, so the ranking
  is very uncertain anyway.)
  
  Antoine then describes a procedure that will generate P_{n} from
  P_{n+1}. I won't copy the procedure here, because it is somewhat
  complicated, and, in my opinion, not very significant. (More on that
  below.)

Justification

  One justification [mine, not his] for this analysis is that it is
  expected to reveal the "points of inflection" of inflected languages
  For instance, in a large sample of Italian, one will find many pairs
  of words that differ in the last letter only:

    rosso/rossa/rosse/rossi 
    ritorno/ritorni/ritorna/ritornò

  etc. Thus the pattern P_n for Italian will probably end in 1, meaning that 
  the last position is the single letter that can be replaced more easily
  within the language.  

  For Spanish and Portuguese, which usually make the plural by adding
  "s", the next-to-last position is easily substituted too:

    blancas/blancos
    rojas/rojos

  etc.. Additionally, the numerous verb inflections will provide
  many pairs differing in the 3rd and 4th letter from the end.
  
  I don't dare to guess the relative ranking of those positions; but I
  expect that, in any Romance language, the last three letters will be
  the most substituted, i.e. will get the rankings 1..3.
  
  If the same analysis were applied to English, I would still expect
  the last few positions to be the most easily substituted, because of
  pairs like
  
    painter/painted
    married/marries
    
  However, the difference between s_n and the average s_i should be
  much less for English than for Romance languages. I won't be
  surprised if the positions with ranks 2 and 3 are not at the end but
  somewhere else along the word.
  
  Thus, Antoine's analysis should distinguish quite easily the Indo-European
  languages from other language families with different places and methods of
  inflection. It may also be able to distinguish Romance from Germanic, and
  perhaps even Italian from Spanish. And it will certainly distinguish
  IE languages from random gibberish (unless the author has been
  careful to generate end-inflected gibberish, of course).

My comments

  In my view, the most significant feature of Antoine's substitution
  patterns is that the first letter of a Voynichese word seem to have
  more "inflectional freedom", while the final letters are relatively
  invariant. These patterns are precisely oposite to what we would
  expect to see in Indo-European languages (at least Romance and
  Germanic), where grammatical inflection usually modifies letters
  near the end of the word.
  
  Presumably this is what Antoine has in mind whe he says that
  Voynichese words are "built from synthetic rules which exclude ...
  natural language". Anyway, I think that this conclusion is
  unwarranted. After all, there are non-IE natural languages, which I
  do not dare to mention by name 8-), that do seem to have
  `substitution patterns' similar to those of Voynichese.
  
  Thus I don't accept Antoine conclusion that Voynichese must be an
  artificial language, or at best a code based on "progressive
  modification [similar to] the discs of Alberti".  It 
  cannot be just some IE language with a funny alphabet, sure;
  but we already knew that.  
  
  I find it interesting also that his analysis yield a very anomalous
  pattern for n = 8, namely P_8 = ( 6 8 1 2 3 4 7 5 ). While that
  pattern may be just a noise artifact, it may also be telling us that
  the rare 8-letter words are mostly the result of joining a 2-letter
  word to a 6-letter one.
  
  I am not sure what to make of Antoine's rules for generating P_n
  from P_{n+1}. For one thing, they seem to be a bit too complicated
  given the limited amount of data that they have to explain.
  Moreover, the counts s_2,.. s_{n-2} seem to be fairly similar, and
  the differences seem to be mostly statistical noise; therefore,
  their relative ranks do not seem to be very significant. Indeed,
  applying Antoine's method to Currier's transcription we get 
  P_6 = ( 1 4 2 6 5 3 ), whereas from Friedman's we get 
  P_6 = ( 1 5 2 4 6 3 ). Moreover, the latter would change to
  P_6 = ( 1 5 3 4 6 2 ) if we omitted just two words from the
  input text.
  
  But the main limitation I see in Antoine's method is that he considers
  the absolute position of each letter in the word to be a significant
  parameter for statistical analysis. I.e., he assumes implicitly that
  an n-letter word contains exactly n "inflectional", slots, each each
  of them containing exactly one letter. This view seems too
  simplistic when one considers the patterns of inflection of natural
  languages, where each morphological "slot" can usually be filled by
  strings of different lengths, including zero. To uncover the
  inflection rules of English, for example, one would have to compare
  words of different lengths, because the key substitution patterns
  are
  
    dog / dogs / dog's / dogs' 
    dance / dances / danced / dancing / dancer / dancers / ...
    strong / stronger /strongest / strongly
    
  and so on.  
  
  Another problem of Antoine's method is that the most important
  structural features of words in natural languages are usually based
  on *relative* letter positions, and may not be visible at all in an
  analysis based on absolute positions. For example, in Spanish there
  is a particularly strong alternation of vowels and consonants, so
  that if words were aligned by syllables one would surely find that
  the "even" letter slots have very different substitution properties
  than the "odd" slots. But since Spanish words may begin with either
  vowel or consonant, and may contain occasional VV and CC clusters,
  the 3rd and 4th letters in a 6-letter word should be about as likely
  to be VC as CV; and, therefore, will probably have very similar
  substitution statistics.
  
  Indeed, aligning words letter-by-letter is a bit like classifying
  fractional numeric data like 3.15 and -0027 into classes by the
  number of characters, and then analyzing the statistics of the ith
  character within each class, without regards for leading zeros,
  omitted signs, or the position of the decimal point. While some
  statistical features of the data may still have some visible
  manifestation after such mangling, we cannot expect to get reliable
  and understandable results unless we learn to align the data by the
  decimal point before doing the analysis.
  
  These problems are relevant for Vonichese too, only even more so.
  First, we already know tha there are many potentially important
  "inflections" that involve a change in length, like
  <okeedy>/<qokeedy>; and yet these "inflections" will not register in
  Antoine's analysis. More importantly, if we factor VMS words as
  specified by the crust-mantle-core (CMK) paradigm, we find that each
  CMK component can vary in length from 0 to 3 letters, more or less
  independently. Moreover, the Friedman and Currier alphabets
  sometimes use two or more letters to denote combinations, like EVA
  <ee> or <ke>, that are probably single letters in the `true' VMS
  alphabet. Thus, when Antoine sorts the Voynichese words by length,
  aligns them letter-by-letter, and analyzes the properties of each
  letter slot, he is analyzing a mixture of core, mantle, and crust
  slots -- which are statistically as different as night and day.
  
  Thus it is not surprsing that Antoine's counts s_i, for positions i
  near the middle of the word, are all pretty much the same (with
  differences comparable to the statistical noise). Because of the
  varying-length components, the 4th letter in a 6-letter word should
  be pretty much like a letter picked at a random among letters 2..5.
  Even for slots 1 and n, where we do see significant differences in
  the s_i, the counts will get "blurred" because the initial CMK
  components can be empty.

Possible improvements

  One possible way of improving Antoine's statistics is to replace his
  fixed-length letter slots, defined by absolute letter position, by
  the seven slots defined by the CMK model (initial <q>, crust prefix,
  mantle prefix, core, mantle suffix, crust suffix, and final group).
  
  The result of this analysis would be a table saying, for example,
  that there are NNN word pairs that differ only in the initial CMK
  component (i.e by the addition/omission of an initial <q>), and MMM
  pairs that differ only in the core component (which has 13 possible
  values, { k t p f ke te pe fe cth ckh cph cfh ckhe cthe cphe cfhe }
  or empty). 
  
  I have tried to do some of this analysis myself, but I am still
  troubled by some ambiguities of the model (what to do with the
  letters <aoy>, whether final <r> is a final group like <n> or part
  of the crust, etc.)
  
  Another variant of Antoine's analysis that may be worth trying, and
  which does not depend on a prior word model, is to use the `string
  edit distance' instead of Hamming's. In other words, one should look
  for pairs of words that differ by *insertion or deletion* of a
  single letter, as well as replacement. 
  
  However, this fix addresses only the the problem of length-changing
  inflections like those of English. To address the misalignment
  problem, we could label the slots with the letters that occur next
  to them, rather than with their absolute positions in the word. We
  would then get a table saying e.g. that there are 698 words pairs
  that differ by editing (inserting, deleting, or replacing) a single letter
  between a <k> and a <d>, and 345 that differ by a letter edit
  between <.> (word space) and <k>.
  
  Actually, we can use this criterion to define a `Voynichese word
  graph:' let the vertices be the words, and let the edges be the 
  word pairs that differ by a single letter-edit operation. It would be
  very intersting to see a picture of this graph, or at least a
  tabulation of its connected components, and compare the results with
  the analogous data for English, Latin, Chinese (oops, pardon my
  French!), etc.
  
I hope it helps. Again, my apologies to Antoine if I misrepresented his
work.  

All the best,

--stolfi