Voynich Reconsidered: the Sukhotin algorithm
To my mind, the greatest mystery of the Voynich manuscript is the possibility that its inscrutable glyphs are a representation of a meaningful text in a natural human language or several such languages. As a corollary, I see the greatest challenge as the identification of that language or those languages.
In my book Voynich Reconsidered (Schiffer Publishing, 2024), I set out my approach to a possible identification of precursor languages. In particular, in several chapters, I examined the application of frequency analysis.
I use the term "frequency analysis" to refer to the well-documented phenomenon that in most phonetic languages, the frequencies of the letters have a signature distribution. If a text is sufficiently long, even if we do not know the script, we can identify the most frequent symbols; indeed, we can rank every symbol, from most frequent to least frequent. In modern Italian, for example, the most frequent letters are E, A, I, O, N and T; in medieval Italian, as per the OVI corpus, the most frequent letters are E, A, I, O, N and R.
In the case of the Voynich manuscript, not only do we not know the script; but we do not know definitively where a glyph begins and ends. There are many visual forms in the manuscript which could be interpreted as one glyph or two, or even three. Furthermore, we do not know whether strings of glyphs, that look like words, really represent words in the presumed precursor languages. To address such uncertainties, we must be prepared to work with multiple transliterations of the manuscript: what I have called permutations of the text.
Likewise, we do not know the precursor languages. But we can work with multiple languages. We can assign a provisional ranking to languages, in terms of their probability. Here I have proposed that the medieval languages associated with a series of concentric circles, centered on Italy where Wilfrid Voynich rediscovered the manuscript, are more probable than others.
In Voynich Reconsidered,/i>, I wrote that frequency analysis, if applied systematically to a large number of permutations of text and languages, might be a basis for identifying meaning within the text. I added that this was a task for programmers with plenty of computing power: more than I possess.
Sukhotin
However, if we could classify the glyphs into those that represent vowels, and those that represent consonants, our task would be simpler by an order of magnitude. The number of permutations, at least, would be greatly reduced. In that case, we would only need to map vowel-glyphs to vowels, and consonant-glyphs to consonants. Most medieval European languages have about five vowels, and about twenty consonants (in both cases, for the moment, leaving aside accents and diacritics). In the Voynich manuscript, there are at most twenty-five distinct glyphs that account for over 95 percent of the text.
The question is therefore: if we have before us an unknown script, and we make the heroic assumption that the individual glyphs have been mapped from letters in a natural language, is there an algorithm for identifying the vowels and the consonants?
Thanks to the work of the French linguist Jacques Guy, I learned that such an algorithm existed. It was published in 1962 by Boris V. Sukhotin in his article “Экспериментальное видение буквенных классов с помощью EBM” (“Experimental Vision of Letter Classes Using EVM”), in Проблемы Структурной Лингвистики (Problems of structural linguistics).
Jacques Guy, writing in Cryptologia of July 1991, described Sukhotin’s algorithm as follows:

An extract from a cartoon in Jacques Guy's article "Vowel identification: an old (but good) algorithm" in Cryptologia of July 1991. The caption reads: “The artisan hit his thumb”. Image credit: JD.
Goldsmith and Xanthos, in their article "Learning Phonological Categories" (Language, March 2009), observed that Sukhotin had made another basic assumption. They wrote:
I infer that the way the Sukhotin algorithm works is as follows:
Sukhotin was evidently working in the tradition of the Russian mathematician Andrey Andreyevich Markov, who created the concept now known as Markov chains, or Markov processes. One of Markov’s earliest works was a study, in 1913, of the frequencies of vowels and consonants in Pushkin’s novel Eugene Onegin. This pathbreaking analysis was, in part, the inspiration for the chapter in Voynich Reconsidered on mathematical approaches to the Voynich manuscript, which I titled, simply, "Markov".
In my book Voynich Reconsidered (Schiffer Publishing, 2024), I set out my approach to a possible identification of precursor languages. In particular, in several chapters, I examined the application of frequency analysis.
I use the term "frequency analysis" to refer to the well-documented phenomenon that in most phonetic languages, the frequencies of the letters have a signature distribution. If a text is sufficiently long, even if we do not know the script, we can identify the most frequent symbols; indeed, we can rank every symbol, from most frequent to least frequent. In modern Italian, for example, the most frequent letters are E, A, I, O, N and T; in medieval Italian, as per the OVI corpus, the most frequent letters are E, A, I, O, N and R.
In the case of the Voynich manuscript, not only do we not know the script; but we do not know definitively where a glyph begins and ends. There are many visual forms in the manuscript which could be interpreted as one glyph or two, or even three. Furthermore, we do not know whether strings of glyphs, that look like words, really represent words in the presumed precursor languages. To address such uncertainties, we must be prepared to work with multiple transliterations of the manuscript: what I have called permutations of the text.
Likewise, we do not know the precursor languages. But we can work with multiple languages. We can assign a provisional ranking to languages, in terms of their probability. Here I have proposed that the medieval languages associated with a series of concentric circles, centered on Italy where Wilfrid Voynich rediscovered the manuscript, are more probable than others.
In Voynich Reconsidered,/i>, I wrote that frequency analysis, if applied systematically to a large number of permutations of text and languages, might be a basis for identifying meaning within the text. I added that this was a task for programmers with plenty of computing power: more than I possess.
Sukhotin
However, if we could classify the glyphs into those that represent vowels, and those that represent consonants, our task would be simpler by an order of magnitude. The number of permutations, at least, would be greatly reduced. In that case, we would only need to map vowel-glyphs to vowels, and consonant-glyphs to consonants. Most medieval European languages have about five vowels, and about twenty consonants (in both cases, for the moment, leaving aside accents and diacritics). In the Voynich manuscript, there are at most twenty-five distinct glyphs that account for over 95 percent of the text.
The question is therefore: if we have before us an unknown script, and we make the heroic assumption that the individual glyphs have been mapped from letters in a natural language, is there an algorithm for identifying the vowels and the consonants?
Thanks to the work of the French linguist Jacques Guy, I learned that such an algorithm existed. It was published in 1962 by Boris V. Sukhotin in his article “Экспериментальное видение буквенных классов с помощью EBM” (“Experimental Vision of Letter Classes Using EVM”), in Проблемы Структурной Лингвистики (Problems of structural linguistics).
Jacques Guy, writing in Cryptologia of July 1991, described Sukhotin’s algorithm as follows:
“Sukhotin … assume[s] a state of complete ignorance about the language, except that the writing system is alphabetical. … Sukhotin had observed that vowels tend to occur next to consonants rather than next to vowels.”Guy then worked through a manual example of Sukhotin’s algorithm, applied to the word SAGITTA, and found that the algorithm correctly identified the vowels as A and I, and the consonants as G, S and T.

An extract from a cartoon in Jacques Guy's article "Vowel identification: an old (but good) algorithm" in Cryptologia of July 1991. The caption reads: “The artisan hit his thumb”. Image credit: JD.
Goldsmith and Xanthos, in their article "Learning Phonological Categories" (Language, March 2009), observed that Sukhotin had made another basic assumption. They wrote:
"Sukhotin's algorithm ... relies on two fundamental assumptions: first, that the most frequent symbol in a transcription is always a vowel, and second, that vowels and consonants tend to alternate more often than not."Member MarcoP of the Voynich Ninja forum has alerted me that Goldsmith's and Xanthos's interpretation is not quite correct. Sukhotin assumed that the most probable vowel was the symbol that most often occurred adjacent to another symbol. A symbol which is interior to a word has two neighbors; an initial or final symbol has only one; an isolated symbol has none.
I infer that the way the Sukhotin algorithm works is as follows:
• Find the symbol with the most occurrences adjacent to another; this is most probably a vowel.This is of course a statistical approach, based on probabilities. It does not exclude pairs of vowels. Most phonetic languages have a few relatively common vowel pairs. For example, as we see from Stefan Trost's excellent website, in modern Italian the most frequent vowel pair is "IO", with a frequency of 1.1 percent; in classical Latin, "AE" (1.0 percent); in modern French, "AI" (1.9 percent).
• The immediately preceding and following symbols (if any) are probably consonants.
• If those symbols occur elsewhere in the text, the immediately preceding and following symbols (if any) are probably vowels.
• And so on, iteratively, until each symbol has been identified as either a probable vowel or a probable consonant.
Sukhotin was evidently working in the tradition of the Russian mathematician Andrey Andreyevich Markov, who created the concept now known as Markov chains, or Markov processes. One of Markov’s earliest works was a study, in 1913, of the frequencies of vowels and consonants in Pushkin’s novel Eugene Onegin. This pathbreaking analysis was, in part, the inspiration for the chapter in Voynich Reconsidered on mathematical approaches to the Voynich manuscript, which I titled, simply, "Markov".
Published on October 23, 2023 14:35
•
Tags:
boris-sukhotin, goldsmith-xanthos, jacques-guy, massimiliano-zattera, stefan-trost, voynich
No comments have been added yet.
Great 20th century mysteries
In this platform on GoodReads/Amazon, I am assembling some of the backstories to my research for D. B. Cooper and Flight 305 (Schiffer Books, 2021), Mallory, Irvine, Everest: The Last Step But One (Pe
In this platform on GoodReads/Amazon, I am assembling some of the backstories to my research for D. B. Cooper and Flight 305 (Schiffer Books, 2021), Mallory, Irvine, Everest: The Last Step But One (Pen And Sword Books, April 2024), Voynich Reconsidered (Schiffer Books, August 2024), and D. B. Cooper and Flight 305 Revisited (Schiffer Books, coming in 2026),
These articles are also an expression of my gratitude to Schiffer and to Pen And Sword, for their investment in the design and production of these books.
Every word on this blog is written by me. Nothing is generated by so-called "artificial intelligence": which is certainly artificial but is not intelligence. ...more
These articles are also an expression of my gratitude to Schiffer and to Pen And Sword, for their investment in the design and production of these books.
Every word on this blog is written by me. Nothing is generated by so-called "artificial intelligence": which is certainly artificial but is not intelligence. ...more
- Robert H. Edwards's profile
- 67 followers
