Voynich Reconsidered: punctuation or junk

In recent articles on this platform, and in my bookVoynich Reconsidered (Schiffer Publishing, 2024), I alluded to two phenomena that I had detected in the text of the Voynich manuscript. They are as follows:
• the well-documented “marching order” of the glyphs within Voynich “words”: first identified by Mary D’Imperio around 1978 in a classified paper for the National Security Agency, and restated in quantitative terms by Massimiliano Zattera at the Voynich 2022 conference;
• what I call the “truncation effect”: the phenomenon whereby we can remove the initial and final glyphs from every Voynich “word”, and apparently (at least as measured by the incidence of hapax legomena), retain any meaning that exists within the text.
These two phenomena are not necessarily related. But they might be. I have conjectured that they could be interpreted in at least the following ways:
• as evidence that the Voynich scribes mapped from letters in precursor languages to glyphs; and then, in each Voynich “word”, sorted the glyphs in a prescribed order (which, centuries later, D’Imperio would call the “five states”, and Zattera would characterise as the “slot alphabet”)
• as evidence that in each Voynich “word” (or in some of the “words”) the initial and final glyphs have no semantic meaning: that is, they are either some form of punctuation, or possibly junk.
These interpretations are not mutually exclusive. It could be that the glyphs are ordered within “words”, in some kind of sequence that, in a natural language, we would call alphabetical; and also, that some of the “initial” and “final” glyphs are meaningless fillers, or junk.

Glyphs as punctuation

If some of the glyphs represent punctuation in the precursor documents, we need to find some means of distinguishing them from glyphs that represent letters.

Here, frequency analysis might be a useful tool.

I am inclined again to turn to some of my favorite medieval documents: Dante’s La Divina Commedia (Italian) and Monarchia (Latin); Meshari (Albanian), Dalimilova kronika (Bohemian), the Auchinleck manuscript (English), La Farce de Maistre Pathelin (French), Cantigas d'Amigo (Galician-Portuguese), and Der Ackermann aus Böhmen (German). In these documents, punctuation accounts for between 3.2 percent and 6.9 percent of the characters.

In the v101 transliteration of the Voynich manuscript, there are 158,940 glyphs and 40,706 “words”, of which 1,948 are single-glyph “words” and 38,758 have “initial” glyphs and “final” glyphs. Thus “initial” glyphs alone account for 24.3 percent of all the glyphs: which seems far too many to represent punctuation.

To my mind, if there is punctuation in the Voynich manuscript, it resides in some smaller set of glyphs. Conceivably, certain glyphs with frequencies of between 0.1 and 2.0 percent are punctuation marks, for example the following:

La Divina Commedia - punctuation frequencies
The most common punctuation marks in selected medieval documents; and some Voynich glyphs with comparable frequencies. In both cases, frequencies are expressed relative to the character total, excluding spaces. Author’s analysis.

The glyph {2} and its variants are particularly interesting, since each of them looks like the glyph {1} with a diacritic or superscript. Perhaps the diacritic itself is the punctuation mark.

Glyphs as junk

In a previous post, I assembled my calculations of the incidence of hapax legomena (words which occur only once) in the Voynich manuscript and other medieval documents. The idea is that hapax legomena is an indicator of the presence of gibberish (or junk). For a given length of document, the higher the incidence of hapax legomena, the more likely that the document is junk, or contains junk.

In that post, I reported that the v101 transliteration of the Voynich manuscript, with 40,706 “words”, had a 16.5 percent incidence of hapax legomena. I took extracts of about 40,000 words from seven medieval literary documents, in Albanian, English, Galician-Portuguese and Italian. These extracts had incidences of hapax legomena ranging from 2.7 percent to 11.8 percent.

Since by this metric the Voynich manuscript was an outlier, I saw here some evidence that the manuscript might contain text without meaningful content: that is, junk.

The question is then: how do we identify junk in the Voynich manuscript?

Random or systematic junk?

It seems to me that if there is junk in the Voynich manuscript, we need to know whether it is random. An example in the English language might illustrate. Let us take the first six words of the Gettysburg Address, and insert random letters at random locations: in both cases, using the RAND() function in Excel to generate randomness. Here is one possible result:
fohurqo scgarkorltesd apnddl sdecvujenho yealrss akgjok.
Although this sequence looks like gibberish, a careful reading will extricate the words:
four score and seven years ago.
A reader of English can do this. But there is no-one who reads the Voynich script. Therefore, I think, if the Voynich text contains random sequences of junk, of random length, the only way to detect such junk is first to map the Voynich text successfully to a natural language. Thereafter, a speaker or reader of that language will have a chance of distinguishing the content from the junk.

However, there is a possibility that if there is junk in the Voynich manuscript, we can detect it before we know the precursor language or the mapping system. This possibility would arise if the junk had been inserted systematically.

Let us transport ourselves in time and space to the moment of conception of the Voynich manuscript. It could be fifteenth-century Italy, or it could be another date and place; it does not matter. There was a producer who commissioned the manuscript. This person was wealthy: sufficiently so to engage a team of scribes for a period of months or years, and to pay for the vellum from at least fifteen calfskins, and for all the other materials and supplies that the scribes would need.

The scribes would be professionals; their job was to write, for payment, whatever their clients wished to be written. They would not be authors or creators of content. Therefore, the producer had to provide, either precursor documents in languages or scripts which the scribes could read; or a person who could read such documents and could dictate their content to the scribes, and would remain at the workplace with the scribes until the job was done.

If the producer wanted junk to be inserted in the manuscript, he or she would give instructions to the scribes as to how this should be done. Those instructions had to be systematic. Having given them, the producer would go away, to attend to business, or to matters of court, or whatever was his or her source of wealth or power. The scribes would go to work, and the producer would return in due course to collect the finished manuscript.

Herein, to my mind, lies our hope of simplifying or cleaning the Voynich manuscript, if such a process is required. Our task is to recreate the producer’s instructions to the scribes.

I can imagine at least three ways of instructing the scribes to incorporate systematic junk:
• to add one random glyph (or more than one, but some prescribed number) at some defined position or positions (for example, the beginning or end of every “word”, or line, or paragraph)
• to define certain glyphs as meaningless, and to insert them at prescribed positions
• to define certain glyphs as meaningless, but only when used in certain positions within the “word”, line or paragraph.
In all cases, the producer would have to instruct the scribes as to how much flexibility they were permitted; whether or not to preserve word breaks in the precursor documents; and above all, how to order the glyphs within the “words”, so as to create the “marching order” that D’Imperio would discover and Zattera would quantify.

I can think of at least one test for identifying systematic junk in the Voynich manuscript. If such junk exists, removing it should preserve the meaningful content. The incidence of hapax legomena is one indicator of the presence of meaning. Therefore, we can take any suitable transliteration of the manuscript, and execute processes such as the following:
• remove all “initial” glyphs, or all “final” glyphs, or both
• remove specific glyphs: for example, all occurrences of {4o} or {9}, wherever they may occur
• remove specific glyphs, but only in specific positions in the “word”: for example, all occurrences of {9} as a “final” glyph;
and in each case, calculate the incidence of hapax legomena in the truncated document. The process which yields the lowest incidence of hapax legomena might be the cleaning process that removes the junk and leaves the content.

I am working on some tests of this nature. More later.
 •  0 comments  •  flag
Share on Twitter
Published on March 29, 2024 23:45 Tags: commedia, d-imperio, dante, voynich, zattera
No comments have been added yet.


Great 20th century mysteries

Robert H. Edwards
In this platform on GoodReads/Amazon, I am assembling some of the backstories to my research for D. B. Cooper and Flight 305 (Schiffer Books, 2021), Mallory, Irvine, Everest: The Last Step But One (Pe ...more
Follow Robert H. Edwards's blog with rss.