[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Re: Word distribution

On Saturday 06 Mar 2004 12:58 pm, Nick Pelling wrote:
> You must be extremely careful when interpreting rank frequency law graphs:
> what they're claiming is that, if you rank all the words in a text by their
> frequency, then their frequencies will generally tail off according to a
> certain kind of (logarithmically straight-line) way.

It is a power law (i.e. straight in a double logarithmic plot) with exponent 
very close to -1.0

> However, the same is
> also broadly true of random texts (as Gabriel mentions and we know the VMs
> is, in many ways, more structured than random). This is therefore
> problematic to draw conclusions (especially as to "languageness") from. You
> must similarly be careful when interpreting number frequency law graphs.

I believe that the reasons for Zipf's law in random texts have little to do 
with the case of natural languages. Reading Wentian Li's paper(s), you will 
immediately realise that in random texts (where the space is just another 
character), the probability of finding *very* long words decays much slower 
than for natural languages. However, ridiculously long words do not happen in 
languages (Ok, Jacques, I am prepared for some examples :-)  Shall I say 
"unlikely" instead ?).
If one generates a random text and looks at the word and token length 
distributions, these are *very* different from those in a natural language. 
See figure 1b in the preprint of my Cryptologia paper, the curve called 
Forced single spaces (this is a random text with the same space proportion as 
the vms.

> What are Zipf's Laws all about in natural language? FWIW, I believe they
> reflect three different kinds of mechanisms, which have different
> (overlapping) degrees of usefulness (and hence frequencies):
> (1) syntactic infrastructure (words like "the", "and" etc);
> (2) global relevance (signifiers reused globally to explain/describe
> different things); and
> (3) local relevance (signifiers reused locally in a narrative to provide
> dramatic structure).

Yes, there has been some debate about this and how to draw those limits. 
Andras Kornai has published a very interesting paper about this precise 
problem (the mid-range words) (it is in his website).

> The good thing about Zipf's Laws is that they allow a kind of comparison
> between radically different texts: but the bad thing about them is they
> don't tell you about actual instance count per se, because those kinds of
> things are (for the most part) abstracted out as part of the process.

Note that one can also measure "distances" between Zipf's ranks (See Havlin's 
paper it is referenced in my page on Zipf's laws). Although the plot is rank 
and frequency, you know what word has which rank, and so comparison between 
ranks is possible.

Despite all this, one has to keep in mind that one may end up with a Zipf's 
distribution for a reason other than the vms being meaningful or structured 
in a "language-like" fashion. So while it should be noticed, it is not a 
proof of "languageness" as you said.
Something that I am quite uneasy about is that we should expect to find some 
grammatical constructs, but this has not been very successful (or the search 
has not been very throrough, I am not sure which one).

> I stand by my assertion (though it chimes with my own experience, I don't
> believe I originated it?) that the instance count of Voynichese words seems
> generally low compared with natural languages: and I also don't believe
> that Zipf's Laws are the right way to test this assertion.

If you think a bit more about this, you will realise that the number of 
different words in a corpus which follows Zipf's law is the approximately 
expected number for that particular corpus size. In other words, if it 
follows Zipf's law, then the relative frequencies and the lexicon size are 
more or less what you expect in other natural languages. 
As Rene pointed out, if a language follows Z' law then the increase of lexicon 
size with corpus size follows a particular pattern (which I seem to 
remember is also a power law, but I would apreciate to be corrected if that 
is not the case).



To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list