[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Re: Numbercrunching "word" tuples

By now I've learnt my lesson ... almost everything I brought up has been
discussed before.
And you're so kind to me, you don't even say RTFMLA ( = read the mailing
list archive please).

I've found two of the earliest "long repeating strings seekers". Very
interesting results.
Some of the long lines will probably be broken by the mail programs ...


Subject: Voynich MS progress report
Date: Wed, 09 Aug 1995 17:30:40 +0100
From: Mike Roe <Michael.Roe@xxxxxxxxxxxx>

<snip - read on, there are veeeeery loooong repeating strings at end of
ext  - petr>

3.4 Repeated Phrases

In most texts, there is are one or more sequences of words that occur
repeatedly. Identifying these can give be a help in decipherment.

[ There is an amusing example of this in the Mayan script. There is a
  of glyphs commonly found on ceramic drinking vessels. When the script was
  deciphered, a subsequence of this sequence turned out to mean
  ``painted drinking vessel'' ...]

So, what are the common subsequences in the Voynich MS?

When looking for repeated sequences, there is a danger of missing them,
because the repeats aren't exactly the same. This can arise in several ways:

(a) Transcription errors. It's hard to transcribe a script you can't read,
    and the machine-readable transcription is bound to contain quite a few
    errors. This can lead to repeats being missed because the second
    occurance is mangled during transcription.
(b) Glyph variants. If the same character can be written several different
    ways, and the author of the text knows this (but we don't) then we can
    miss repeats because the second occurance is written in a visually
    different but functionally identical way.
(c) The rules of the language may permit the same thing to be said in
    several different ways. Writers in most languages vary the form of
    expression of common phrases to make the text more interesting for
    the reader. [The notable exception to this is ISO International
    which try very hard to always use the same expression for the same
    concept, in order to minimise the risk of errors in translation. This is
    one of the reasons why International Standards are so awful to read
:-) ]

In view of this, I use the following algorithm for finding text repeats:

(a) The text is reduced to a ``fingerprint'' of the text. This is an
    information-loosing transformation that tries to make likely
    errors or glyph variants of a text have the same fingerprint as the
    original text. This increases the chance of finding repeats, at the
    penalty of false positives: different texts that have the same
    by chance.

    The ``fingerprint'' algorithm is designed taking into account the
    form of the glyphs (and hence likely reading errors); known likely
    transcription errors (data obtained by comparing independent
    of the same sections of the manuscript, e.g. First Study Group vs
    D'Imperio); possible sound-alike writings of a word (homophones) (using
    information from the phonetic anlaysis); possible equivalent-meaning
    words (from the morphological analysis).

(b) For every glyph in the text, a context record is created, containing the
    fingerprint for that glyph and the following 30 or so glyphs. [NB The
    fingerprint algorithm does not necessarily work on a glyph-by-glyph
    it may take into account adjoining glyphs] If necessary,  read the
    lines of text to get enough context. (This avoids missing repeats that
    are broken across line boundaries).

(c) The entire set of context records is sorted into alphabetical order.
    We don't know what the true collation order of the Voynich script is,
    for this purpose it doesn't matter. The idea is that records that start
    in the same way will be sorted together.

(d) Scan through the sorted records and find pairs of records which are the
    same in the first N characters. (N being 20 or so). These are easy to
    find, as pairs will be next to each other in the sorted sequence.

(e) From the context records of the matched pairs, go back to the original
    transcribed text to see whether this is really a match or a ``false
    drop'', If necessary, re-transcribe the lines in question from the
    microfilm to resolve this question.

Preliminary results:

The text contains a sequence which is repeated not just twice, but four
or more times. Significantly, all of the occurances are in ``Author B''

(The following lines are longer than 80 characters, and so may get mangled
by some peoples terminals/e-mail systems. Sorry if this happens!)

<f83r.7> 2OEZC8.EZCC89.4CCC89.4OF9.O4OE.RZCC89.4OFC89.4OPCC89.4OPCC89-

There are also a few twice-only repeats:


<f26r.4> 4OFC89.SCO2.9PC89.4OFC89.9PC89.SCFC89.8AM.O8AJ.2AE89-
<f81v.12>  4OE.OE.S89.ZC89.4OFC89.9PC89.SCPC89.EFC8C9.9PC89-


<snip - petr >


Date: Mon, 27 Jan 92 07:13 PST
From: wet!naga@xxxxxxxxxxxx (Peter Davidson)
Subject: Initial assault; modest gains.

Report #1 on some initial statistical investigations, by naga.

Jim Reed kindly sent me a machine-readable version of Mary D'Imperio's
transcription, which is in Currier notation.  The first thing I did was to
separate out A and B, remove the line identifications, spaces and the MS-DOS
end-of-line bytes, obtaining two files consisting purely of Voynich letters
(according to the judgement of the transcriber).  These files are

                  VOYNICH.A       33,702 bytes
                  VOYNICH.B       49,341 bytes

<snip - petr>

I have so far developed two kinds of program for statistical analysis of
files.  One kind does frequency analyses, such as calculating the
probability that a given letter occurs n places after some other given
letter.  I decided I needed to think more about how to distinguish
anomalies, and so turned to the other kind, which searches for, and counts,

I immediately found that there were plenty of repetitions but due to
(a) the size of the files, (b) the speed (or lack of it) of a 33 MHz 386
and (c) the likely non-optimization of my search algorithm (and other
factors) it was going to take quite a while to gather the data, and
there'd be a lot of it.

I wrote a program to (attempt to) ascertain the length of the longest
string of characters in a file which is repeated at least once.  I
discovered that in VOYNICH.A the strings:


occur at least twice (so the size of the longest repeating string is at
least 16 letters - no doubt more).  (Is ZOE the name of one of the nymphs?)

In VOYNICH.B the following strings are repeated:


so the size of the longest repeating string is at least 20 letters - again
no doubt more.

So I ran my repetition-count program on VOYNICH.B looking for strings of
exactly 12 letters, and in ten minutes or so ( s l o w . . . ) it came
up with:

                    String          number of occurrences

                    /SC89/9PC89/        2
                    PC89/4OFC89/        6
                    C89-4OFCC89/        7
                    /4OFC89/8AR/        3
                    AE/SC89/2AR-        3
                    8AM/ZC89/4OF        3
                    PC89/4OFCC89        6
                    FC89/4OFC89/       20
                    CO89/4OFC89/        2
                    /SCO89/4OPCC        2
                    89/4OFO89/4O        2
                    PC89/4OPC89/        9
                    89/8AR/SC89/        2
                    AM/OFCC89/4O        3
                    OE/SC89/4OFC       10
                    C9/ZCOE/4OFC        2
                    9FCC89/SC89/        2
                    ZC89/4OFC89/       24
                    /OFC89/OFC89        3
                    /4OFC89/4OFC       24

Clearly the 89s are up to something, /SC89/ and /4OFC89/ particularly.

Clearly also some way is needed to pick out the *significant* occurrences.
(I mean *scientifically*.  We can all see that FC89/4OFC89/, ZC89/4OFC89/
and /4OFC89/4OFC stand out clearly.)

How many times might we *expect*, say, PC89, to occur in VOYNICH.B?
My first approach to this question was as follows:

The number of times a string of letters c1 c2 ... cn is expected to occur
is N * p(c1) * p(c2) * ... * p(cn), where the p()'s are the probabilities
of occurrences of single letters (taken from the appropriate table above)
and N is the size of the file.  (Actually it should be N-n+1, but let's not
be picky.)  For the purpose of calculating an expected value this assumes
(what is false) that the probability of occurrence of a letter is
of the letters occurring in its immediate vicinity.

The problem with this approach is that almost *all* of the repetitions that
occur, even those occurring just twice, seem to occur far more often than
expected (according to this notion of "expected").  For example, if you
multiply the individual probabilities of the letters occurring in a string
of 12 letters, as above, and multiply this by 49,341 you generally get
something much less than 1.

Even for the shorter strings, I have to use a factor of 100 (i.e. just look
at the strings that occur 100 times more than expected) in order to
eliminate most of them, which doesn't seem quite right.

However, I have not had time to reflect properly on these matters, so I
may be missing a few things.  Comment is welcome.  Given the abundance
of data available, what to look for?  How to calculate expected occurrences
so as to compare with actual occurrences.  And perhaps Jacques can throw
some light on what those 89's are doing.

In the meantime I'll get back to what I "should" be working on (though
I admit this Voynich stuff is rather interesting).


End of selection

Petr Kazil - Urban Adventure in Rotterdam