Understanding the Second-Order Entropies of Voynich Text

by Dennis J. Stallings

May 11, 1998

Abstract

The anomalous second-order entropies of Voynich text are among its most puzzling features. h1-h2, the difference between conditional first- and second order entropies, equals the difference H1-h2, the difference between the first-order absolute entropy and the second- order conditional entropy. h1-h2 or H1-h2 is a theoretically significant number; it denotes the average information carried by the first character in a digraph about the second one. Therefor it was chosen as a simple measure of what is being sought, although the whole entropy profile of text samples was considered.

Tests show that Voynich text does not have its low h2 measures solely because of a repetitious underlying text, that is, one that often repeats the same words and phrases. Tests also show that the low h2 measures are probably not due to an underlying low-entropy natural language. A verbose cipher, one which substitutes several ciphertext characters for one plaintext character, can produce the entropy profile of Voynich text.

Introduction
Measures of Relative Second-Order Entropy
Entropies of Voynich Texts
Verbose Ciphers
Repetitive Texts
Schizophrenic Language
Low-Entropy Natural Languages

Japanese
Hawaiian
Discussion of Phonemic versus Syllabic Notation

The Size of the Character Set
The Effect of Word Divisions
Redundancy
The Effect of Syllable Divisions

Final Thoughts on Low-Entropy Natural Languages

Suggestions for Further Work
Acknowledgments
References for Electronic Texts
Printed References

Introduction

William Ralph Bennett first applied the entropy concept to the study of the Voynich Manuscript in his Scientific and Engineering Problem Solving with the Computer (Englewood Cliffs: Prentice-Hall, 1976). His book has introduced many people to the VMs.

The repetitive nature of VMs text is obvious to casual examination. Entropy is one possible numerical measure of a text's repetitiousness. The higher the text's repetitiousness, the lower the second-order entropy (information carried in letter pairs). Bennett noted that only some Polynesian languages have second-order entropies as low as VMs text. Typical ciphers do not have a low second-order entropy either.

This paper examines other possible reasons for the low second- order entropy of Voynich texts: a verbose cipher or a repetitious underlying text. It also examines the low-entropy natural languages Hawaiian and Japanese for further insight into that hypothesis.

Measures of Relative Second-Order Entropy

Jacques Guy's MONKEY program was used to calculate second-order entropies. (Note: the bug-free, "sensible" MONKEY on the EVMT Project Home Page was used; the author believes that the version of MONKEY on Garbo as of this writing has bugs.) Note that MONKEY in its present form only takes the first 32,000 characters in a file. Some long texts were divided up into portions so that MONKEY could analyze them separately.

The conditional entropies were used, as is customary on the Voynich E-mail list. Say that H1 is the absolute first-order entropy and H2 is the absolute second-order entropy. Then h1 and h2 are the first- and second-order conditional entropies. h2 = H2-H1, since it is conditional on more than one character. h1 = H1, since it depends on only single characters; thus h1 is really not conditional.

The following measures were considered:

h0: zero-order entropy (log2 of the number of different characters)

h1: first-order conditional or absolute entropy

h2: second-order conditional entropy

h1-h2: difference between conditional first- and second order entropies, which equals the difference -

H1-h2: the difference between the first-order absolute entropy and the second-order conditional entropy.

As will be seen, there is a need here to compare systems with very different numbers of characters, to scale the statistics somehow to the size of the character set. h1-h2 or H1-h2 is a theoretically significant number; it denotes the average information carried by the first character in a digraph about the second one. It is perhaps the best single, simple measure of what is being sought.

The % of the second-order maximum absolute entropy might have been used. One could calculate the % of H2 from the total H2 that could be delivered by each alphabet. Using digraphs with an alphabet of m characters, H2(max) is:

log2(m^2)

and the %H2(max) is:

(H2/log2(m^2))/100

However, the H2(max) depends tremendously on m, the size of the character set chosen. For Voynich text, Currier has 36 characters and Basic Frogguy has 23 characters. Characters that are hardly ever used have little effect on h1 and h2, but could make a tremendous difference in H2(max). Therefore, this measure was not used.

To start the discussion, here are some data from the English King James Bible:

Table 1:

English King James Bible - 1 Kings

Passage Beginning at # ch. File Size h0 h1 h2 h1-h2

1:1
27

32000

4.755

4.022

3.068

0.953

8:19
27

32000

4.755

4.028

3.090

0.939

15:27
27

32000

4.755

3.998

3.092

0.906

Average of three
27

96000

4.755

4.016

3.083

0.933

The h1-h2 range for different portions of the same text is 0.906-0.953.

And here are data on the corresponding portions of the Latin Vulgate Bible:

Table 2:

Latin Vulgate Bible - 1 Kings

Passage Beginning at # ch. File Size h0 h1 h2 h1-h2

1:1
24

32000

4.585

4.002

3.309

0.692

8:19
24

32000

4.585

3.994

3.287

0.707

15:27
24

32000

4.585

4.005

3.304

0.700

Average of three
24

96000

4.585

4.000

3.300

0.700

The average h1-h2 is 0.700, compared to 0.933 for the English text. This is undoubtedly due to the fact that English uses more combinations of two or more letters to represent single phonemes than Latin does. The range of h1-h2 for the Latin text is 0.692-0.707, narrower than for the English text.

The next table shows the h1-h2 statistic for assorted files in various languages and notations. This shows how the h1-h2 statistic sometimes shows unexpected information. For instance, Hawaiian and Japanese have low h2 values, approaching Voynich text, in phonemic notation. However, the h1-h2 values for Hawaiian and Japanese are far less than Voynich text.

Table 3:

h1-h2 Statistics for Selected Texts

File
# ch.

File Size

h0

h1

h2

h1-h2

Latin - Vulgate Bible, 1 Kings, first 32K
24

32000

4.585

4.002

3.309

0.692

Hawaiian (Bennett, limited phonemic)
13

15000

3.700

3.200

2.454

0.746

Hawaiian newspaper (full phonemic)
19

13473

4.248

3.575

2.650

0.925

English - King James Bible - Genesis, first 32K
27

32000

4.755

3.969

3.020

0.949

Japanese Tale of Genji - Section 1 (romaji)
22

32000

4.459

3.763

2.677

1.086

Japanese Tale of Genji - Section 1 (kana)
71

20622

6.150

4.764

3.393

1.370

Voynich Herbal-B (Currier)
34

13858

5.087

3.796

2.267

1.529

Voynich Herbal-B (EVA)
21

16061

4.392

3.859

2.081

1.778

Entropies of Voynich Texts

Here are entropy results for Voynich texts, a sample of Herbal-A and Herbal-B. The Herbal-A sample's h1-h2 ranges 1.479-1.945, depending on which transcription alphabet is used. The Herbal-B sample's h1-h2 ranges 1.529-1.897. All these are far greater than the 0.93 for English and 0.70 for Latin.

The choice of transcription alphabet also makes an enormous difference. From Currier to Frogguy the range of h1-h2 is 1.5-1.9. The direction is what one would expect. Currier is the most synthetic, while Frogguy is the most analytical, decomposing single Currier characters into several Frogguy characters. Thus Currier Q = Frogguy cqpt.

Table 4:

Voynich Texts

Type of Voynich Text
Transcription Alphabet

# ch.

File Size

h0

h1

h2

h1-h2

Herbal-A
Currier

33

9804

5.044

3.792

2.313

1.479

Herbal-A
FSG

24

10074

4.585

3.801

2.286

1.515

Herbal-A
EVA

21

12218

4.392

3.802

1.990

1.812

Herbal-A
Frogguy

21

13479

4.392

3.826

1.882

1.945

Herbal-B
Currier

34

13858

5.087

3.796

2.267

1.529

Herbal-B
FSG

24

14203

4.585

3.804

2.244

1.560

Herbal-B
EVA

21

16061

4.392

3.859

2.081

1.778

Herbal-B
Frogguy

21

17909

4.392

3.846

1.949

1.897

The samples of Voynich text are relatively small. The following statistics of samples of a single known Latin text gives some idea of how much difference this might make.

Table 5:

Texts from Latin Vulgate Bible, 1 Kings, For Study of Effect of Sample Size on Entropy Data. Passages All Begin at 1:1

Passage Ending at
# ch.

File Size

h0

h1

h2

h1-h2

2:18
23

8929

4.524

3.994

3.263

0.731

4:21
24

18623

4.585

3.995

3.298

0.697

7:17
24

29647

4.585

4.003

3.309

0.694

It is doubtful whether h1-h2 or any other single measure can tell us all we want. However, the representation system is probably the heart of the issue. The following discussion of verbose ciphers is a case in point.

Verbose Ciphers

A verbose cipher, one that substitutes several ciphertext characters for one plaintext character, can produce the entropy profile seen with Voynich text. Such a system is Cat Latin C, which is to be applied to Latin plaintext. Vowels and consonants were added roughly in proportion to their occurence in Latin. This keeps the h1 roughly the same as with Latin and Voynich FSG. The repeated digraphs are what reduce h2 to where it is desired. If q is followed by u, it is as with normal Latin; otherwise it fits one of the consonant patterns. So this scheme is unambiguous. This scheme does produce VMs-like entropies!

This table shows the Cat Latin verbose cipher:

Table 6:

Cat Latin C

Plaintext Ciphertext

a a

b bqbababa

c c

d dqdede

e e

f fqfififi

g gqgogogo

h h

i i

j jqjajaja

k k

m mqmememe

n nqninini

o o

p pqpopopo

qu qu

r rqrarara

s sqsesese

t tqtititi

u u

v v

w w

x xqxoxoxo

y y

z zqzazaza

For comparison here are VMs results in FSG, since the size of that character set is closest to Latin.

Table 7:

Verbose Cipher Compared to Voynich Text

File
# ch.

File Size

h0

h1

h2

h1-h2

Voynich Herbal-A (FSG)
24

10074

4.585

3.801

2.286

1.515

Voynich Herbal-B (FSG)
24

14203

4.585

3.804

2.244

1.560

Latin Vulgate, 1 Kings, 1:1 - 2:11
23

8232

4.524

3.996

3.262

0.734

Above passage, Cat Latin C
23

28754

4.524

3.873

2.278

1.595

However, it's clear that this is not the same pattern as Voynich text. It might be best to look for patterns subjectively. Here are some text samples.

The start of the Voynich Herbal-A sample file (f29v, lines 1- 9), in EVA:

kshol qoocph shor pshocph shepchy qoty dy shory
ykcholy qoty chy dy qokchol chor tchy qokchody cheor o
chor chol chy choiin
tshoiin cheor chor o chty qotol sheol shor daiin qoty
otol chol daiin chkaiin shoiin qotchey qotshey daiiin
daiin chkaiin
pchol oiir chol tsho daiin sho teo chy chtshy dair am
okain chan chain cthor dain yk chy daiin cthol
sot chear chl s choly dar

The beginning of a Hawaiian sample file, from a Hawaiian newspaper, to be discussed later:

kepakemapa mei puke kepakemapa mei mahalo 'ia ka 'Olelo hawai'i e nA mAka' na ho'Olanani kim ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka po'e kumu ka po'e mAkua a me ka po'e hoa o kElA 'ano kEia 'ano o ka 'Olelo hawai'i a ma laila nO ho'i i launa ai ka po'e ma o ka 'Olelo hawai'i kapa 'ia kEia lA hoihoi 'o ka lA 'ohana

Finally, the beginning of the Latin Vulgate 1 Kings in Cat Latin C:

etqtititi rqrararaexqxoxoxo dqdedeavidqdede sqseseseenqnininiuerqrararaatqtititi habqbababaebqbababaatqtititique aetqtititiatqtititiisqsesese pqpopopolurqrararaimqmememeosqsesese dqdedeiesqsesese cumqmememeque opqpopopoerqrararairqrararaetqtititiurqrarara vesqsesesetqtititiibqbababausqsesese nqnininionqninini calefqfififiiebqbababaatqtititi dqdedeixqxoxoxoerqrararaunqnininitqtititi erqrararagqgogogoo ei sqseseseerqrararavi ...

Look at these samples and think about the kind of repetition involved in each case! The "Cat Latin C" verbose cipher is clearly not the same thing as Voynichese.

Here are the entropy values for these samples:

Table 8:

Statistics on Text Samples

File
# ch.

File Size

h0

h1

h2

h1-h2

Voynich Herbal-A (EVA)
21

12218

4.392

3.802

1.990

1.812

Hawaiian newspaper (full phonemic)
19

13473

4.248

3.575

2.650

0.925

Latin Vulgate, 1 Kings, 1:1 - 2:11, Cat Latin C
23

28754

4.524

3.873

2.278

1.595

The author's personal opinion is that the rigid internal structure of Voynich text accounts for the low h2 measures. The majority of Voynich "words" follow a paradigm. Robert Firth (Work Note #24) and Jorge Stolfi (Voynich Page) both have identified paradigms. Captain Prescott Currier (Currier's Papers ) identified several other kinds of internal structure in Voynich text.

Repetitive Texts

From time to time, some have suggested that the Voynich Manuscript is simply a very repetitious text. Here is a magical spell in medieval High German that is repetitious:

 
         eiris sazun idisi             sazun her duoder
         suma hapt heptidun            suma heri lezidun
         suma clubodun                 umbi cuoniouuidi
         insprinc haptbandun           inuar uigandun

         phol ende uuodan              uuorun zi holza
         du uuart demo balderes uolon  sin uuoz birenkit
         thu biguol en sinthgunt       sunna era suister
         thu biguol en friia           uolla era suister
         thu biguol en uuodan          so he uuola conda
         sose benrenki                 sose bluotrenki
         sose lidirenki
         ben zi bena                   bluot zi bluoda
         lid zi geliden                sose gelimida sin

Merseburger Zaubersprüche (Magic Spells from Merseburg) in Old High German. Note: 'uu' = 'w'.

An experiment to test this idea is to take samples of known repetitious texts (food recipes, religious texts, catalogs) and compare their second-order entropies with those of known texts that should be less repetitious (prose fiction, essays).

Note that some long texts were larger than MONKEY's 32,000 character limit; in those cases MONKEY just took the first 32,000 characters. Some long texts were divided up into separate portions that MONKEY could analyze.

Jacobean English. Ever since its publication, many commentators have noted how repetitious the Book of Mormon is. The Bible itself is, of course, somewhat repetitious. A (relatively) non-repetitious text in Jacobean English is the Essays of Sir Francis Bacon.

The Book of Mormon appears to be the most repetitious. h1- h2 for the Book of Mormon excerpts range 0.931-0.980. The King James Bible is next, 0.904-0.983. The non-repetitious Essays of Francis Bacon have 0.827-0.837. Taking averages, the difference for h1-h2 between the most repetitious text and the least is 0.951 versus 0.831, a difference of 0.120.

Table 9:

Elizabethan English Texts of Varying Repetition

File
# ch.

File Size

h0

h1

h2

h1-h2

Book of Mormon - 1 Nephi
27

32000

4.755

4.033

3.090

0.942

Book of Mormon - Alma
27

32000

4.755

4.041

3.109

0.931

Book of Mormon - Ether
27

32000

4.755

4.009

3.029

0.980

King James Bible - Genesis
27

32000

4.755

3.969

3.020

0.949

King James Bible -Joshua
27

32000

4.755

4.012

3.029

0.983

King James Bible -Acts
27

32000

4.755

4.041

3.137

0.904

Francis Bacon's Essays, Part 1
27

32000

4.755

4.048

3.220

0.827

Francis Bacon's Essays, Part 2
27

32000

4.755

4.042

3.214

0.828

Francis Bacon's Essays, Part 3
27

32000

4.755

4.066

3.229

0.837

Latin (Late Classical). Samples of the Vulgate Bible and Boethius' Consolations of Philosophy were analyzed. There is little difference in the statistics between the Vulgate Bible and the presumably less repetitious Consolatio Philosophiae.

Table 10:

Latin Texts of Varying Repetition

File
# ch.

File Size

h0

h1

h2

h1-h2

1 Kings, Vulgate, 1:1
24

32000

4.585

4.002

3.309

0.692

1 Kings, Vulgate, 8:19
24

32000

4.585

3.994

3.287

0.707

1 Kings, Vulgate, 15:27
24

32000

4.585

4.005

3.304

0.700

Boethius - Consolatio Philosophiae - Books 3 & 4
25

32000

4.644

3.971

3.272

0.699

Modern English. Repetitive texts: food recipes (chicken and Cajun), a catalog of technical standards, and a Roman Catholic litany. For a non-repetitious text: a short story, "The Blue Hotel" by Stephen Crane.

The non-repetitious short story "The Blue Hotel" has an h1-h2 of 0.826, while the repetitious Roman Catholic Litany has an h1-h2 of 0.968. The difference is 0.968 - 0.826 = 0.142. The other texts mostly fall in between, although the presumably repetitious Cajun recipe has an h1-h2 of 0.827, almost identical to the short story.

Table 11:

Modern English Texts of Varying Repetition

File
# ch.

File Size

h0

h1

h2

h1-h2

Modern English - Roman Catholic litany
26

9492

4.700

4.071

3.103

0.968

Modern English - ISO 14000 catalog
27

6696

4.755

4.076

3.137

0.939

Modern English - The Blue Hotel by Stephen Crane (short story)
27

32000

4.755

4.073

3.247

0.826

Modern English - Cajun recipe
27

27363

4.755

4.124

3.297

0.827

Modern English- Chicken recipe
27

18461

4.755

4.131

3.193

0.938

For comparison, here are data for Voynich texts in FSG, which has the character set closest in size to the ordinary Latin alphabet.

Table 12:

Voynich Texts in FSG

Type of Voynich Text
Transcription Alphabet

# ch.

File Size

h0

h1

h2

h1-h2

Herbal-A
FSG

24

10074

4.585

3.801

2.286

1.515

Herbal-B
FSG

24

14203

4.585

3.804

2.244

1.560

When one compares the h1-h2 values of Voynich text with the differences due to repetition in English texts (0.968 - 0.826 = 0.142 for modern English and 0.951 - 0.831 = 0.120 for Jacobean English) with the h1- h2 values for Voynich text (1.515 or 1.560), it becomes clear that repetitious underlying format or subject matter could not change a text in a normal European language to a Voynich text! Thus, Voynich text does clearly not have its low h2 measures solely because of a repetitious underlying text, that is, one that often repeats the same words and phrases.

Schizophrenic Language

In an important paper that discusses the Voynich Manuscript, Professor Sergio Toresella says that the VMs author had a psychiatric disturbance. In one of the works cited by Toresella in this connection, Creativity by Silvano Arieti, Arieti talks about the distorted language of schizophrenics but not other language phenomena.

At the Kooks Museum, there is a sample of schizophrenic language. In the Schizophrenic Wing, there is a transcript of flyers by Francis E. Dec, containing two Rants:

Kooks Museum

Francis E. Dec, Esquire

Transcripts of flyers

Here is an excerpt from Rant #2:

"Computer God computerized brain thinking sealed robot operating arm surgery cabinet machine removal of most of the frontal command lobe of the brain, gradually, during lifetime and overnight in all insane asylums after Computer God kosher bosher one month probation period creating helpless, hopeless Computer God Frankenstein Earphone Radio parroting puppet brainless slaves, resulting in millions of hopeless helpless homeless derelicts in all Jerusalem, U.S.A. cities and Soviet slave work camps. Not only the hangman rope deadly gangster parroting puppet scum-on-top know this top medical secret, even worse, deadly gangster Jew disease from deaf Ronnie Reagan to U.S.S.R. Gorbachev know this oy vay Computer God Containment Policy top secret. Eventual brain lobotomization of the entire world population for the Worldwide Deadly Gangster Communist Computer God overall plan, an ideal worldwide population of light-skinned, low hopeless and helpless Jew-mulattos, the communist black wave of the future."

The samples and discussion of schizophrenic talk in Arieti resemble Francis Dec's, in repeated but disconnected ideas, alliteration, etc.

MONKEY was run on the two Rants and the results were compared with examples of normal English text:

Table 13:

Schizophrenic Rant Compared to Other English Texts

File
# ch.

File Size

h0

h1

h2

h1-h2

Schizophrenic rant
27

12967

4.755

4.182

3.428

0.755

King James Bible - Genesis
27

32000

4.755

3.969

3.020

0.949

Francis Bacon's Essays, Part 1
27

32000

4.755

4.048

3.220

0.827

Modern English - Roman Catholic litany
26

9492

4.700

4.071

3.103

0.968

Modern English - The Blue Hotel by Stephen Crane (short story)
27

32000

4.755

4.073

3.247

0.826

The second-order entropy of the schizophrenic rants is definitely higher, and h1-h2 lower, than any of the ordinary texts. As with the repetitive texts, the nature of the text itself would not by itself explain the puzzling nature of VMs text.

Low-Entropy Natural Languages

One may write Japanese in Latin characters (romaji) or in syllabic scripts (hiragana and katakana, the kana). In romaji Japanese is a low-entropy language because of a relatively low phonemic inventory and severe phonotactic constraints. A Japanese syllable may begin in zero or one consonant (counting ts, ry, and ky as one consonant), have one vowel, and end with nothing or -n (although the following syllable's consonant may be doubled). (There are at least some long and short vowels in Japanese, which complicates this a little.)

However, the very fact of these severe phonotactic constraints makes only a limited number of syllables possible in Japanese and therefore makes a syllabic script such as kana feasible. One would expect Japanese in kana to have a higher relative h2 (lower h1- h2) than Japanese in romaji.

Hawaiian has even more severe phonotactic constraints, and thus one ought to be able to write Hawaiian in a syllabic script. In Hawaiian a syllable may begin in zero or one consonant, have only one vowel, and may only end in nothing! Hawaiian has a much more limited phonemic inventory than Japanese. Hawaiian is especially significant because Bennett compared Voynichese to Hawaiian and noted that they had similar second-order entropies. Bennett said that some Polynesian languages are the only natural languages with second-order entropies as low as Voynichese.

Therefore, in order to gain insight on these issues, Hawaiian and Japanese are compared in syllabic as well as phonemic notation.

Japanese

The classic Japanese novel Tale of Genji is written almost entirely in kana. Gabriel Landini kindly adapted this both into romaji and into a kana notation that MONKEY could analyze.

Table 14:

Entropies of Japanese in Romaji and Kana

File Orthography # ch. File Size h0 h1 h2 h1-h2

Tale of Genji - Section 1
Romaji

22

32000

4.459

3.763

2.677

1.086

Tale of Genji - Section 2
Romaji

20

31505

4.322

3.751

2.627

1.124

Tale of Genji - Section 3
Romaji

20

29474

4.322

3.749

2.639

1.110

Tale of Genji - Section 4
Romaji

20

32000

4.322

3.750

2.641

1.109

Tale of Genji - Section 5
Romaji

20

27064

4.322

3.744

2.630

1.114

Tale of Genji - Overall
Romaji

22

152043

4.459

3.751

2.643

1.108

Tale of Genji - Section 1
Kana

71

20622

6.150

4.764

3.393

1.370

Tale of Genji - Section 2
Kana

71

20622

6.150

4.764

3.393

1.370

Tale of Genji - Section 3
Kana

70

18574

6.129

4.709

3.410

1.298

Tale of Genji - Section 4
Kana

70

20386

6.129

4.716

3.464

1.252

Tale of Genji - Section 5
Kana

70

17096

6.129

4.698

3.362

1.337

Tale of Genji - Overall
Kana

71

97300

6.150

4.730

3.404

1.326

As one would expect, the absolute h0, h1, and h2 numbers for kana are much higher than those for romaji. However, the differences for h1-h2 are consistently higher for kana, which one would not expect.

Hawaiian

Bennett did his Hawaiian study with a limited Hawaiian orthography that did not recognize vowel length or the glottal stop. Therefore, statistics were run both on Hawaiian in limited phonemic and syllabic spellings, with long/short vowels not separated and glottal stop not indicated, and in full phonemic and syllabic notation.

Hawaiian has the following phonemes:

Consonants: h k l m n p w '(glottal stop)

Vowels: a e i o u A E I O U (cap's means long)

Bennett used a "lossy" Hawaiian orthography that did not distinguish the long vowels and did not write the glottal stop (call this Hawaiian limited phonemic). He also had his own Voynich transcription alphabet. Finally, he only compared the absolute h2 values and not relative measures such as h1-h2. It's as good as any an illustration of the problems here.

Here is a sample of the Hawaiian newspaper text used in this paper for statistics in Bennett's notation:

ma ka la o malaki ua noa ka paka o kapiolani no ke anaina na lakou ke kuleana o ka malama ana ma ka olelo ana aku i ka olelo hawaii ma laila no i Akoakoa ai ka poe haumana ka

And here is the same text in full phonemic notation:

ma ka lA o malaki ua noa ka pAka 'o kapi'olani no ke anaina na lAkou ke kuleana 'o ka mAlama 'ana ma ka 'Olelo 'ana aku i ka 'Olelo hawai'i ma laila nO i 'Akoakoa ai ka po'e haumAna ka

Here are the entropy values.

Table 15:

Entropies of Hawaiian Texts in Different Orthographies

File
Orthography

# ch.

File Size

h0

h1

h2

h1-h2

Hawaiian (Bennett)
limited phonemic

13

15000

3.700

3.200

2.454

0.746

Hawaiian newspaper
limited phonemic

13

13097

3.700

3.224

2.437

0.787

Hawaiian newspaper
limited syllabic

39

9533

5.285

3.816

2.929

0.887

Hawaiian newspaper
full phonemic

19

13473

4.248

3.575

2.650

0.925

Hawaiian newspaper
full syllabic

77

9160

6.267

4.361

3.162

1.200

And here are data for Bennett's and this paper's Voynich texts for comparison:

Table 16:

Voynich Texts for Comparison with Hawaiian

Type of Voynich Text
Transcription Alphabet

# ch.

File Size

h0

h1

h2

h1-h2

Voynich (Bennett)
Bennett

21

10000

4.392

3.660

2.220

1.440

Herbal-A
Currier

33

9804

5.044

3.792

2.313

1.479

Herbal-A
FSG

24

10074

4.585

3.801

2.286

1.515

Herbal-A
EVA

21

12218

4.392

3.802

1.990

1.812

Herbal-A
Frogguy

21

13479

4.392

3.826

1.882

1.945

Herbal-B
Currier

34

13858

5.087

3.796

2.267

1.529

Herbal-B
FSG

24

14203

4.585

3.804

2.244

1.560

Herbal-B
EVA

21

16061

4.392

3.859

2.081

1.778

Herbal-B
Frogguy

21

17909

4.392

3.846

1.949

1.897

Bennett compared his Voynich text in a 21-character transcription to Hawaiian in a 13-character orthography (including the space character). He got h2 values of 2.220 for Voynich text and 2.454 for his Hawaiian text. However, a sample of Hawaiian text in a full phonemic orthography, with 19 characters including spaces, has h2 of 2.650, even higher. A comparison of h1-h2 values shows a dramatic difference between Hawaiian and Japanese on one hand and Voynichese on the other. h1-h2 equals 1.8 for Voynichese in EVA. h1-h2 is 0.746 for Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic notation, and 1.1 for Japanese romaji. These figures are all very different from Voynichese.

Discussion of Phonemic versus Syllabic Notation

While perhaps not germane to the Voynich Manuscript problem, it is odd that h1-h2 increases from phonemic to syllabic notation, both for Japanese and Hawaiian. In syllabic notation, given the first character, the second character is more predictable than it is in phonemic notation. This is quite puzzling. How can we explain these results for Hawaiian and Japanese?

The Size of the Character Set

In going from phonemic to syllabic, the text becomes shorter, more information is packed into fewer characters --but that is accomplished by using a larger character set. The numbers of characters for the syllabic notations are more than three times those for the phonemic notations. The measure h1-h2 was chosen to minimize the effect of the size of the character set, but surely is not entirely successful in doing that.

The Effect of Word Divisions

Perhaps one loses predictability because the number of space characters in relation to the total is greater for syllabic notation than for phonemic. If that were the case, leaving out the spaces ought to decrease h1-h2 for syllabic notation more than for phonemic notation. MONKEY runs were made leaving out the spaces to test this. However, the h1-h2 results for syllabic notation decrease less than those for phonemic notation do.

Table 17:

The Effect of Word Divisions on Statistics for Japanese and Hawaiian

File
Orthography

Spaces Included

# ch.

File Size

h0

h1

h2

h1-h2

Japanese Tale of Genji - Section 1
Romaji

Yes

22

32000

4.459

3.763

2.677

1.086

Japanese Tale of Genji - Section 1
Romaji

No

21

26106

4.392

3.803

2.935

0.868

Japanese Tale of Genji - Section 1
Kana

Yes

71

20622

6.150

4.764

3.393

1.370

Japanese Tale of Genji - Section 1
Kana

No

70

14051

6.129

5.666

4.330

1.337

Hawaiian newspaper
Full Phonemic

Yes

19

13473

4.248

3.575

2.650

0.925

Hawaiian newspaper
Full Phonemic

No

18

10433

4.170

3.622

2.935

0.687

Hawaiian newspaper
Full Syllabic

Yes

77

9160

6.267

4.361

3.162

1.200

Hawaiian newspaper
Full Syllabic

No

76

6120

6.248

5.156

3.982

1.174

Redundancy

Gabriel Landini, who did graduate studies in Japan, noted that the redundancy of Japanese is only apparent, that it is actually rather ambiguous. In writing this is overcome with ideographs (kanji), while in speech it is overcome with the context of the speech and with rigid structures (phrases and expressions).

However, Jacques Guy (doctorate in Polynesian languages, was once fluent in Tahitian) notes that Tahitian (similar to Hawaiian) is no more ambiguous than English or French! So redundancy is not likely the explanation.

The Effect of Syllable Divisions

Could the (relatively) high h1-h2 values for syllabic Hawaiian and Japanese mean that combinations of two syllables (eg. yama in Japanese, wiki in Hawaiian) are as repetitious and fixed as combinations of phonemes within syllables?

The phonemic vs. syllabic problem here is more complex than this. Take "yamamoto" in romaji and in kana: (ya)(ma)(mo)(to). When we are analysing the second-order entropy in romaji, one is looking for the distributions of "ya" "am" "mo" "ot" "to", while for kana it is "(ya)(ma)" "(ma)(mo)" "(mo)(to)". For half (or so) of the romaji, one deals with combinations of letters ("am", "ot") that are never represented in kana. So the second-order entropy in one type of text is not strictly comparable with the second-order entropy in the other. The second-order entropy order of the romaji text is in principle "near" in meaning to the first-order entropy of the kana, but about only half of the digraphs correspond to kana.

While the differences in statistics between syllabic and phonemic notation are interesting, they are not necessarily relevant to the Voynich Manuscript. They are chiefly interesting in raising questions about the use of the entropy concept.

Final Thoughts on Low-Entropy Natural Languages

Consider again the start of the Herbal-A sample file (f29v, lines 1-9), in EVA:

kshol qoocph shor pshocph shepchy qoty dy shory

ykcholy qoty chy dy qokchol chor tchy qokchody cheor o

chor chol chy choiin

tshoiin cheor chor o chty qotol sheol shor daiin qoty

otol chol daiin chkaiin shoiin qotchey qotshey daiiin

daiin chkaiin

pchol oiir chol tsho daiin sho teo chy chtshy dair am

okain chan chain cthor dain yk chy daiin cthol

sot chear chl s choly dar

And then the beginning of the Hawaiian newspaper sample file:

One sees that the low h2's of Hawaiian and Japanese are due to their very strict consonant-vowel alternation. The EVA Voynich sample shows that the consonant-vowel alternation of Voynichese (as determined by the Sukhotin vowel-recognition algorithm) is not as strict.

Once again, h1-h2 equals 1.8 for Voynichese in EVA. h1-h2 is 0.746 for Bennett's Hawaiian data, 0.925 for Hawaiian in full phonemic notation, and 1.1 for Japanese romaji. These figures are all very different from Voynichese.

For these reasons, it seems unlikely that an underlying low- entropy natural language explains the low h2 measures of Voynich text.

Suggestions for Further Work

The various h2 measures are only crude, partial measures of all the factors that interest us. However, the entropy measure will continue to be useful. It would be nice to have a program that would calculate the entropies of files larger than 32K and calculate higher- order entropies more accurately.

The author believes that the "paradigms" and other structural restrictions of Voynichese explain the low h2 measures. Further study of these structural constraints will be most useful.

Acknowledgments

Many of these ideas and data were previously discussed on the Voynich E-mail list. A special thanks to Gabriel Landini and Rene Zandbergen for their assistance.

References for Electronic Texts

Voynich Text

Rene Zandbergen kindly provided samples of Herbal-B and Herbal-A from voynich.now.

Herbal-B: 26r, 26v, 31r, 31v, 33r, 33v, 34r, 34v, 39r, 39v, 40r, 40v, 41r, 41v, 43r, 43v, 46r, 46v, 48r, 48v, 50r, 50v, 55r, 55v, 57r

Selected Herbal-A: 28v, 29r, 29v, 30r, 30v, 32r, 32v, 35r, 35v, 36r, 36v, 37r, 37v, 38r, 38v, 42r, 42v, 44r, 44v, 45r, 45v, 47r, 47v, 49r, 49v

Jacobean English

Book of Mormon

Bible, KJV

Sir Francis Bacon, Essays

Late Classical Latin Vulgate Latin Bible

Modern English

Japanese Text

Gabriel Landini kindly prepared this. The text is from the Genji monogatari's [Tale of Genji, a classic Japanese novel mostly written in hiragana] first 4 parts: 01 Kiritsubo 02 Hahakigi 03 Utsusemi 04 Yugao.

The "kana" output is not kana, of course, but an arbitrary substitution for kana so that MONKEY could be applied.

Hawaiian

The author prepared the Hawaiian texts. Hawaiian has the following phonemes:

Consonants: h k l m n p w '(glottal stop)

Vowels: a e i o u A E I O U (cap's means long)

However, the difference between long and short vowels is often not indicated. Also, the glottal stop is often not written. Obviously both of these things need to be written, since even with them Hawaiian has a rather limited phonemic inventory!

The Hawaiian text came from all the articles in this issue of a Hawaiian newspaper:

Na Maka o Kana
Puke 5, Pepa 5
15 Malaki, 1997

The text was changed to the notation above. All numbers, English, Japanese, and other foreign words were removed until the character set (the number of characters MONKEY showed) matched the Hawaiian notation. A syllabic script for Hawaiian using characters that MONKEY recognizes was devised.

Schizophrenic Language

At the Kooks Museum, in the Schizophrenic Wing, there is a transcript of flyers by Francis E. Dec, containing two schizophrenic Rants:

Francis E. Dec, Esquire
Transcripts of flyers

Printed References

Arieti, Silvano. Creativity : the magic synthesis. New York : Basic Books, c1976. Library of Congress call number: BF408.A64

Bennett, William Ralph. Scientific and Engineering Problem Solving with the Computer. Englewood Cliffs: Prentice-Hall, 1976. [Contains a chapter on VMS.]
D'Imperio, M. E. The Voynich Manuscript--An Elegant Enigma. National Security Agency, 1978. Aegean Park Press, 1978?
Toresella, Sergio. ``Gli erbari degli alchimisti.'' [Alchemical herbals.] In Arte farmaceutica e piante medicinali -- erbari, vasi, strumenti e testi dalle raccolte liguri, [Pharmaceutical art and medicinal plants -- herbals, jars, instruments and texts of the Ligurian collections.] Liana Saginati, ed. Pisa: Pacini Editore, 1996, pp.31-70. [Profusely illustrated. Fits the VMS into an ``alchemical herbal'' tradition.]

Hosted by www.Geocities.ws

Passage Beginning at	# ch.	File Size	h0	h1	h2	h1-h2
1:1	27	32000	4.755	4.022	3.068	0.953
8:19	27	32000	4.755	4.028	3.090	0.939
15:27	27	32000	4.755	3.998	3.092	0.906
Average of three	27	96000	4.755	4.016	3.083	0.933

File	# ch.	File Size	h0	h1	h2	h1-h2
Latin - Vulgate Bible, 1 Kings, first 32K	24	32000	4.585	4.002	3.309	0.692
Hawaiian (Bennett, limited phonemic)	13	15000	3.700	3.200	2.454	0.746
Hawaiian newspaper (full phonemic)	19	13473	4.248	3.575	2.650	0.925
English - King James Bible - Genesis, first 32K	27	32000	4.755	3.969	3.020	0.949
Japanese Tale of Genji - Section 1 (romaji)	22	32000	4.459	3.763	2.677	1.086
Japanese Tale of Genji - Section 1 (kana)	71	20622	6.150	4.764	3.393	1.370
Voynich Herbal-B (Currier)	34	13858	5.087	3.796	2.267	1.529
Voynich Herbal-B (EVA)	21	16061	4.392	3.859	2.081	1.778

Type of Voynich Text	Transcription Alphabet	# ch.	File Size	h0	h1	h2	h1-h2
Herbal-A	Currier	33	9804	5.044	3.792	2.313	1.479
Herbal-A	FSG	24	10074	4.585	3.801	2.286	1.515
Herbal-A	EVA	21	12218	4.392	3.802	1.990	1.812
Herbal-A	Frogguy	21	13479	4.392	3.826	1.882	1.945
Herbal-B	Currier	34	13858	5.087	3.796	2.267	1.529
Herbal-B	FSG	24	14203	4.585	3.804	2.244	1.560
Herbal-B	EVA	21	16061	4.392	3.859	2.081	1.778
Herbal-B	Frogguy	21	17909	4.392	3.846	1.949	1.897

Passage Ending at	# ch.	File Size	h0	h1	h2	h1-h2
2:18	23	8929	4.524	3.994	3.263	0.731
4:21	24	18623	4.585	3.995	3.298	0.697
7:17	24	29647	4.585	4.003	3.309	0.694

Plaintext	Ciphertext
a	a
b	bqbababa
c	c
d	dqdede
e	e
f	fqfififi
g	gqgogogo
h	h
i	i
j	jqjajaja
k	k
m	mqmememe
n	nqninini
o	o
p	pqpopopo
qu	qu
r	rqrarara
s	sqsesese
t	tqtititi
u	u
v	v
w	w
x	xqxoxoxo
y	y
z	zqzazaza

File	Orthography	# ch.	File Size	h0	h1	h2	h1-h2
Tale of Genji - Section 1	Romaji	22	32000	4.459	3.763	2.677	1.086
Tale of Genji - Section 2	Romaji	20	31505	4.322	3.751	2.627	1.124
Tale of Genji - Section 3	Romaji	20	29474	4.322	3.749	2.639	1.110
Tale of Genji - Section 4	Romaji	20	32000	4.322	3.750	2.641	1.109
Tale of Genji - Section 5	Romaji	20	27064	4.322	3.744	2.630	1.114
Tale of Genji - Overall	Romaji	22	152043	4.459	3.751	2.643	1.108
Tale of Genji - Section 1	Kana	71	20622	6.150	4.764	3.393	1.370
Tale of Genji - Section 2	Kana	71	20622	6.150	4.764	3.393	1.370
Tale of Genji - Section 3	Kana	70	18574	6.129	4.709	3.410	1.298
Tale of Genji - Section 4	Kana	70	20386	6.129	4.716	3.464	1.252
Tale of Genji - Section 5	Kana	70	17096	6.129	4.698	3.362	1.337
Tale of Genji - Overall	Kana	71	97300	6.150	4.730	3.404	1.326

File	Orthography	Spaces Included	# ch.	File Size	h0	h1	h2	h1-h2
Japanese Tale of Genji - Section 1	Romaji	Yes	22	32000	4.459	3.763	2.677	1.086
Japanese Tale of Genji - Section 1	Romaji	No	21	26106	4.392	3.803	2.935	0.868
Japanese Tale of Genji - Section 1	Kana	Yes	71	20622	6.150	4.764	3.393	1.370
Japanese Tale of Genji - Section 1	Kana	No	70	14051	6.129	5.666	4.330	1.337
Hawaiian newspaper	Full Phonemic	Yes	19	13473	4.248	3.575	2.650	0.925
Hawaiian newspaper	Full Phonemic	No	18	10433	4.170	3.622	2.935	0.687
Hawaiian newspaper	Full Syllabic	Yes	77	9160	6.267	4.361	3.162	1.200
Hawaiian newspaper	Full Syllabic	No	76	6120	6.248	5.156	3.982	1.174

Understanding the Second-Order Entropies of Voynich Text

by Dennis J. Stallings

May 11, 1998

Abstract

Table of Contents

Introduction

Measures of Relative Second-Order Entropy

Entropies of Voynich Texts

Verbose Ciphers

Repetitive Texts

Schizophrenic Language

Low-Entropy Natural Languages

Japanese

Hawaiian

Discussion of Phonemic versus Syllabic Notation

The Size of the Character Set

The Effect of Word Divisions

Redundancy

The Effect of Syllable Divisions

Final Thoughts on Low-Entropy Natural Languages

Suggestions for Further Work

Acknowledgments

References for Electronic Texts

Printed References