Ruminations on the Voynich Manuscript

As the “world’s most mysterious book,” the Voynich Manuscript is like an inkblot test: whatever people make of it tends to reveal more about them than it does about the manuscript itself.  Its study shares resonances with pareidolia, akin to seeing recognizable shapes in clouds or wood grains, and with the Law of the Instrument, where when your tool is a hammer, everything looks like a nail.  If you want to lay bare a person’s inner cognitive world, you could do worse than locking them away with the Voynich Manuscript for a while and seeing what hypotheses they come up with about it, what unspoken assumptions they make about it, what methods they use to attack it.

That might be why I’ve hesitated to publish anything too “serious” here about it.  There was my 2015 post “Primeval Animations and the Voynich Manuscript,” in which I go after the claim that the manuscript contains meaningful zoetrope-like animation sequences; and also my 2019 post “Griffoynich: A Real Cipher That Mimics Voynichese,” in which I introduce some new ciphers based on moving a token around a lettered game-board which each produce ciphertext that shares some—but by no means all—of the more perplexing characteristics of the Voynich Manuscript text.  Those were both relatively whimsical efforts.  But I’ve also spent odd moments for at least a dozen years past trying to make sense of the Voynich Manuscript more in earnest.  I don’t particularly feel as though I’ve ever gotten anywhere with it.  However, I’ve enjoyed reading others’ ruminations enough that I decided it was time to “give back” by sharing my own; and then, when I set about writing them up, the results turned out to be so long that if they were a piece of fiction, they’d qualify as a novella rather than a short story.

So here they are.  I’m sure I must have reinvented at least one wheel; if so, feel free to give or claim credit where credit is due in the comments section.  I’ve taken advantage of Jason Davies’ convenient Voyage the Voynich Manuscript site for consulting fascimiles and René Zandbergen’s (“ZL”) transcription for statistical analysis.  The present blog post could easily have been seven separate posts rather than just one, but out of consideration for subscribers who may not be so much into this sort of thing, I’ve combined them all to avoid clutter.  Here are brief summaries of the seven parts, in case you want to gauge whether anything below is worth your while to read:

  1. Minims, Curvelets, Flourishes: an alternative approach to defining Voynichese graphemes.
  2. Ambiguities and the Digital Principle: the harder it is to tell “glyph” types apart, the less likely the difference between them is to be meaningful; implications for Voynichese.
  3. The Hatchmark Skeleton Hypothesis: preliminary attempt at a “word paradigm” based on the alternative graphemes discussed in #1; comparison and contrast with word paradigms proposed by Jorge Stolfi, Emma May Smith, etc.
  4. Word Breaks, Line Breaks, Paragraph Breaks, Labels: mid-line word breaks can be predicted with 95% accuracy based on simple rules of glyph adjacency; spacing between certain glyph pairs is consistently inconsistent or ambiguous; contrast among glyph frequencies at the beginnings and ends of mid-line words, lines, paragraphs, and isolated labels suggests that all four are quite distinct, and that labels don’t categorically resemble mid-line words.
  5. Patterns of Similarity and Repetition: pairs and groups of identical or “similar” words (e.g., Timm pairs and Jackson sequences) can be analyzed in terms of patterned cycles in which word elements repeat semi-independently from one another; words are on average most similar to words separated from them by one intervening word; patterns of preference exist not only among single-glyph word-break combinations (as found by Smith and Ponzi) but also among double-glyph word-break combinations and among words that begin or end with particular glyphs or glyph pairs.
  6. Currier Languages and Line Positions: the aforesaid patterns vary significantly but don’t disappear when analysis is limited to Currier A or Currier B, or even to a narrower “dialect” such as Biological-B or Herbal-A; numerous common word features show definite statistical preferences for positions earlier or later in lines.
  7. Numerical Manipulations: out-on-a-limb speculation into how Voynichese might have worked as a numerical cipher in conjunction with an ordinary late medieval counting-board.

Some of my remarks will presume an early fifteenth-century date for the Voynich Manuscript, although I’m aware this is still disputed by some.  But then again, what isn’t?

§ 1

Minims, Curvelets, Flourishes

One crucial question with the Voynich Manuscript is how to identify its graphemes—the basic contrastive units or building-blocks of its writing system.  Here I’ll propose that some units that are usually taken to be single glyphs can better be tackled as combinations of simpler graphemes.  (In what follows, I’ll use the word “glyph” to refer to the larger units that have more traditionally been the objects of analysis and discussion.)  Some of the names I’ve given to my proposed graphemes will be more familiar than others: minim, curvelet, hatchmark, anticurvelet, bar, plume, leg, and flourish with subcategories head, foot, loopback, loopdown, and tail.  One advantage I see in this breakdown is that it classifies a number of seemingly rare glyphs as nothing more than unusual combinations of the same pieces and parts used to write common glyphs.

Let me be clear that I’m not proposing some steganographic approach to Voynichese in which its meaning is hidden in impracticably minute subtleties of form, as William Romaine Newbold did; or even one that’s picky about graphical nuance.  On the contrary, I speculate that the script ought to have been easy to parse for someone familiar with the system behind it, whereas some widely-accepted glyph types seem to me as though they would have been difficult for any reader to distinguish from one another reliably.  Besides, the observation that Voynich glyphs can be decomposed into smaller elements is nothing new.  It dates back at least to the seminal work of Prescott Currier, who introduced his discussion of it in a well-known 1977 presentation with the statement: “I’ve looked at most of these letters under a magnifying glass, so I think I know how they were all actually made.”  So there’s nothing revolutionary about the decomposition itself, parts of which have also been anticipated by Michael Winkelmann’s “Harmonie der Glyphenfolgen” and Brian Cham and David Jackson’s curve-line system, although I didn’t discover either until my own efforts were pretty far along.

The de facto transcription standard for the Voynich Manuscript is known as EVA, and I’ll be using it here myself when discussing particular strings of text for consistency with other Voynichological writing, even though it doesn’t fit my own hypotheses about how the script works all that well.  And with that in mind, I’d like to introduce each of my proposed graphemes not only in terms of what I see it as corresponding to in the original script, but also in terms of how EVA handles it.

A curvelet (which Currier calls a “c”-curve, and Cham and Jackson just a “curve”) is transcribed e:

voynich-characters-EA pair of curvelets joined at the top with a bar is transcribed ch:

voynich-characters-CHIf a curvelet is closed with an anticurvelet (which Currier characterizes as its “mirror image”) to form a reasonably rounded circle, it’s transcribed o:

voynich-characters-OBut if it’s closed with a straighter minim (which Cham and Jackson call a “line,” the second of two “base forms”)—and particularly by one that extends further downward than needed to close the circle—it’s transcribed a:

voynich-characters-AWhen minims appear separately by themselves, they’re transcribed i; and if a minim has a foot-flourish, a larger curve running up and leftward from its base or foot, the combination is transcribed n.  The and n are easier to illustrate in context than in isolation, so the top row below shows three examples of aiin, and the bottom row below shows three examples of ain.  Note that in each case the stroke closing the a and the stroke opening the n closely resemble the intervening minims, for instance as to slope.

voynich-characters-AIIIN+AIINIf the same minim that closes the a also has a foot-flourish, the combination is transcribed u (which is rare, but does turn up):
voynich-characters-UIf a foot-flourish is attached to a curvelet rather than to a minim, it’s transcribed b (also quite rare).  Here’s an example in context as oeeeb; like minims, curvelets are often found repeated in this fashion:voynich-characters-OEEEB

Several other glyph-type pairs can be interpreted as cases of a particular kind of flourish being added respectively to a curvelet and to a minim, analogous to the pairing of n and b.  For example, one pair of glyph types adds a head-flourish extending up and leftward from the top or head, taking a variety of shapes and extending a variety of distances, but always—I think—plainly distinct from a foot-flourish.  If a head-flourish is added to a curvelet, it’s transcribed s:

voynich-characters-SBut if it’s added to a minim (in which case it’s often shifted a bit downward from the head), it’s transcribed r:

voynich-characters-RAnother pair of glyph types continues that flourish into a closed figure-eight loop to produce a loopback-flourish.  If a loopback-flourish is added to a curvelet, it’s transcribed d:

voynich-characters-DIf it’s added to a minim, it can be transcribed j (very rare).  Here are two examples in context as ajam and aiijy:

voynich-characters-AJAM+AIIJYAnother pair of glyph types extends the figure-eight pattern into a tail running below the line, often without closing the bottom part of the loop, producing a loopdown-flourish.  (A loopback-flourish ends by looping back to the stroke to which it’s attached, while a loopdown-flourish ends by looping down.)  If a loopdown-flourish is added to a curvelet, the combination is transcribed g:

voynich-characters-GIn the case of a minim, it’s transcribed m:


If a curvelet is closed with a tail running from the top down below the line and leftward, producing a tail-flourish, it’s transcribed y:

voynich-characters-YIf a minim has a similar down-and-leftward flourish, crossing over the minim, which I’ll also call a tail-flourish, it’s transcribed l.


It’s less obvious that y and l constitute a pair than is the case with n/b, s/r, g/m, and d/j.  However, Currier and others likewise identify them as counterparts, and if other glyph types exist in pairs, it makes sense to suppose that these likely do as well.  I’ll consider this issue further below.  Cham and Jackson refer to all four flourish types as “tail modifiers,” but I find it useful to call the different types by different names, and to me a “tail” implies a stroke descending below the line.

As with the head-flourish on the rare u, some other types of flourish can also be found attached to minims that are also connected to preceding curvelets.  The last part of the following word from f107r shows this happening with a loopdown flourish, producing a kind of fusion of a and m:

In 1977, Currier concluded his discussion of the formal composition of glyphs by remarking: “All this indicates to me that considerable thought was put into how this mess was made up.  We have the fact that you can make up almost any of the other letters out of these two symbols and e; it doesn’t mean anything, but it’s interesting.”  But why shouldn’t it mean something?  As Currier himself acknowledges, the constraints on which glyphs can appear next to each other appear to follow these same distinctions of form, which suggests that the presence of minims or curvelets in a glyph might not just be an arbitrary feature of it, like the minims of uncial or Gothic script, but might have some significance in its own right.

All the letters containing an initial ‘‘c’’-curve are also the only letters that can be preceded in the same word by the little letter that looks like ‘‘c,’’ e.g. edy, eeedy. On the other hand, the letters l and r (which have very high frequencies) can never be preceded by e, ever; they are instead preceded by a.

Currier may have overstated this point, since major Voynich transcriptions show a very few cases of el and er; but even if the transcriptions are right, these combinations would still be rare enough to leave no doubt as to the statistical significance of their rarity.  Cham and Jackson took matters a step further in their curve-line system by positing a hypothesis of “glyph affinity”: that, within a word, line-glyphs must be next to other line-glyphs and curve-glyphs must be next to other curve-glyphs, with an a required at any transition.  They tested this hypothesis statistically and found that “conforming” sequences outnumber “nonconforming” ones by a ratio a little greater than 100:1.  Further refinements of their hypothesis followed, which I’ll discuss below, together with earlier investigation along similar lines by Winkelmann.  Despite her skepticism about the curve-line hypothesis specifically, Emma May Smith concedes that “there is an underlying nature of those strokes which is unexplained.”

Still, the major transcription schemes—including EVA—have treated combinations of minim+flourish (which I’ll call minimars) and curvelet+flourish (which I’ll call curveletars) as single glyphs, which effectively conceals their shared features.  The Frogguy transliteration alphabet decomposes and into a minim plus a flourish (cg, ig) but doesn’t extend this principle to the other flourishes, at least in its published version, despite a possible reference to them: “coming soon: the four flourishes.”  Meanwhile, some other systems have gone in the opposite direction by handling n, in, iin, iiin as independent glyphs, based on a conviction that they’re likely to represent discrete values and that such pre-filtering is therefore advantageous.  To a point, of course, it shouldn’t matter how glyphs are transcribed to make them computer-processable; for analysis it would be easy enough to replace every case of n (for example) with something like i+.  But in practice I suspect transcription conventions constrain thought more than we’d like to admit, as the transcriptions end up serving as surrogates for the original not just practically, but cognitively as well.  Even Winkelmann, Cham, and Jackson write in terms of discrete i– or line-glyphs and e– or curve-glyphs that correspond to EVA categories.

But to continue: some further glyph types that contain one or more taller vertical strokes fully above the base line, or ascenders, are known in the Voynich community as “gallows.”  If two of these vertical strokes, or legs, appear side by side and are connected by a horizontal stroke at the top with a loop on both sides, this is transcribed t:

voynich-characters-TIf there’s a loop only on the right-hand side, it’s transcribed k:

voynich-characters-KIf the stroke at the top isn’t connected to a second vertical leg, but to a leg that runs halfway down and then crosses leftward over the first leg, it’s transcribed p if there are two loops at the top (as with t)—

voynich-characters-P—or f if there’s a loop only on the right-hand side (as with k):


The formal distinction between t/k (with second leg downwards) and p/f (with second leg crossing back leftwards) closely resembles the one between the loopdown-flourish and the loopback-flourish.  To highlight this similarity, I’ll refer to the former as loopdown gallows and the latter as loopback gallows.  There are also some other gallows of different design, although these are all significantly less common than the four types I’ve described.

The various gallows often appear inserted into, or superimposed onto, the bar of a ch “pedestal” or “bench,” in which case the letter used to transcribe the gallows by itself is capitalized and placed between the c and h; thus, we get cKh:






and cPh:


In this last case, the third specimen is most typical, while the first two illustrate some of the range of variation that may or may not be significant.  If the bar is surmounted by a plume instead of by a gallows, it’s transcribed Sh (ideally with a capital S):


The plume resembles the head-flourish, although its curve is typically narrower and sharper along the horizontal axis, with a more developed downwards thrust, forming an arch or a partial oval or circle.  It can be found in a variety of positions within the larger bar-and-curvelets structure: sitting above, just touching, or crossing the bar, but anywhere along its length, sometimes touching the first curvelet, sometimes the second, and sometimes at the head, sometimes at the foot, sometimes in between.  The gallows, too, are inserted sometimes midway along the bar but sometimes very close to—or even into—the curvelet at either end, and can float above the bar, just touch the bar, or cross through the bar.  As we’ll see, bars can be found in other situations as well, even though they appear mostly as ch and Sh.  Some other letters besides s can also be capitalized to show that they represent glyphs linked forward by a bar.   Any letters for glyphs connected in any way, with or without a bar (which I’d consider an important distinction), can also be placed in curly brackets {}.  But even just with Sh, there has been much debate over whether each of these formal distinctions carries meaning, and the Zandbergen transcription tries to distinguish between Sh (plume associated with first curvelet) and {ch’} (plume associated with second curvelet).

Another glyph that looks like a number four with its leg running down below the base line, formally similar to the left halves of p and t.  This is transcribed q, and it’s generally connected to another following character, usually o.  Here are some specimens of the common qo combination:


This covers all the most commonly recognized glyphs, although there are other “rare” glyphs besides.

One school of thought holds that formal similarities among Voynich glyphs ought to be ignored for purposes of textual analysis—that the kind of breakdown I’ve presented above might help explain how the glyph system was initially designed, but that it has no bearing on the glyphs’ subsequent use for conveying information.  Rather than breaking glyphs into even smaller pieces, the argument might go, the task at hand should be to identify larger functional units such as or and ol.  I agree that there seem to be larger functional units at play, but I’m not sure those functional units necessarily coincide with breaks between glyphs.  So, for instance, or and ol could both consist of oi (representing one value) plus a flourish (representing another value).  That’s one possibility I mean to hold open here.

§ 2

Ambiguities and the Digital Principle

Much Voynich research centers on the statistical analysis of words, understood as glyphs or glyph sequences with spaces on both sides of them.  (I’ll refer to these units here as “words” too, without meaning to imply anything about their significance.)   In a majority of cases, it’s an easy judgment call as to whether spacing is present, but often enough it isn’t, with my favored Zandbergen transcription indicating questionable spaces with a comma, which I’ll call “comma breaks.”  Moreover, there are also some well-known passages in which spaces seem to break up obvious patterns in ways that hint that they’re not meaningful, or maybe even that they’re being used as a form of obfuscation.  A frequently-cited example appears at the top of f15v:

voynich-opening-of-f15vBased on apparent word divisions, the first line seems to begin “poror orShy choiin,” and the second line “?chor or oro r aiin” (or “raiin“), where ? represents an atypical glyph (assigned number 138 in “extended EVA”); but at the same time there’s a conspicuous pattern of repeating or that sometimes spans a word break: “p or or or Shy choiin” and “?ch or or or or aiin.”  Moreover, it’s sometimes unclear here where a word break falls, since some glyphs are separated by a gap midway between a typical word break and a typical distance between glyphs in the same word.  In the second line above, for example, should we read “raiin” or “r aiin“?  Judging from the existence of such ambiguities, I doubt spaces introduce any information that’s required for reading.  That’s not to say they aren’t patterned, however, since they plainly are; or that they don’t mean anything.  By way of analogy, the spaces in “ab ra ca dab ra” might reveal something even if “abracadabra” would be read in exactly the same way, while “abr ac ada br a” might be impermissible or unlikely.

As a further point of context, it’s worth noting that word breaks were more flexible in general during the early modern period than they are today.  Thus, in transcribing early modern European legal documents, I routinely run into uncertain word breaks or specific sequences that are sometimes written together and sometimes apart, with no apparent difference in context in meaning, even within the same document (e.g. ledict versus le dict in French; mitnamen versus mit namen in German).  Similar uncertainties persist in twenty-first-century orthography, although to a more limited extent: witness lawnmower versus lawn mowerhomepage versus home page, healthcare versus health care, etc.  Meanwhile, there’s no consensus among linguists even today as to what constitutes a word break in spoken language; except for orthographic convention, there are arguably no “objective” breaks between words in the first place that writing can reflect correctly or incorrectly.  Handwriting is more suited than movable type to treating word breaks in terms of a continuum, with some spaces being more definite—more conspicuously widened—than others.  Even if word breaks in the Voynich Manuscript were to correspond to conventional word breaks in some natural language, then, we shouldn’t expect them to be any tidier than they are in other manuscripts from the same period.

Meanwhile, some patterns of “ambiguous” spacing are consistent enough across multiple instances to suggest that they actually have some significance in themselves.   A case in point is the permutations of the glyphs o, l, r spaced out more widely than usual, but not always widely enough to establish them as obviously separate words, in the last several folios of the manuscript.  The surrounding contexts display some similarities too.  It seems nuances of spacing might reveal more than just locations of word breaks.  I’ll be developing this point further below, with statistical support.

Some other recognizable units of text that are larger than the word seem less susceptible to ambiguity.  In particular, there’s the “line,” a continuous horizontal string of glyphs; and there’s the “paragraph,” a vertically stacked group of lines.  Lines often appear to skip over parts of illustrations, such as plant stems, to fill up any available space along whatever horizontal line they occupy.  Every now and then it’s unclear whether lines really continue across a whole page or are grouped into separate blocks, as on f75r and f83r; but usually lines are easy enough to identify as such.  Paragraphs consist of sequences of fully filled-out lines ending, often, with a partially filled line that might not begin at the left margin, and sometimes also with a change in vertical spacing.  Overall, there’s less often room for doubt about what “counts” as a paragraph or a line than what “counts” as a word.

Meanwhile, there are also continuous circles of glyphs, often concentric, often with marks that seem to mark starting-points; radii with glyphs in them; and numerous “labels,” which is a catch-all term for bits of seemingly isolated text.  How isolated these last really are is hard to gauge; often enough they’re arranged across a page or around a circle in a way that might or might not constitute a sequence.  Some analyses go straight for the labels in hopes of identifying what’s being labeled and, hence, what Voynichese word corresponds to some illustrated plant or star or whatever.  For myself, I’ve tended to segregate labels from paragraphic text for analysis, and then to focus mainly on paragraphic text.

There are other graphical groupings that were likely unintentional, and that only occasionally reveal themselves, but that may still offer clues about the workings and rhythms of the writing system.  Consider the detail from f107v shown below.  The lines of text are conspicuously disjointed, as though someone had been writing individual words or word pairs below the previous without caring how or whether they lined up with each other, e.g., okain.cheor, then (returning after a pause) olkaiin.oain; orain, then (in a separate gesture) oloeeey.qokain.

For another example, here’s the last paragraph on f104v, in which the last line in particular seems to consist of three discrete chunks:

The graphical chunking here is all the more striking because it happens to coincide with a morphological similarity: okeey.okeey is an identical word repetition, while okeeey.qokeeey contains two words that are identical but for the addition of q to the second; and both of these word pairs are written on the same line (albeit with an unusually large space between them), while the preceding word pair, which doesn’t feature the oke(e)ey sequence common to the other two pairs, is vertically offset from them.

This last observation points to a phenomenon that’s often remarked upon but challenging to address quantitatively: the frequent side-by-side repetition of strikingly “similar” words.  Sometimes words repeat identically, as with okeey.okeey; and sometimes the exact same word repeats three or even four times in a row.  Identical repetitions like these aren’t the problem; they can be easily identified.  But judging “similarity”—as with okeeey.qokeeey—requires a more nuanced yardstick, since it’s not an all-or-nothing distinction.  How “similar” are word combinations such as okaiir.ykaiilgocholy.kchos, aiin.dair.air, and ykedy.qokeedy.qokedy.qokeg?  If such repeating structures are somehow significant in Voynichese—and it’s hard to imagine how they could not be—it’s still difficult to define clearly what counts as one, much less to set a computer to work on them.  At issue is how much similarity, and of what kinds, is statistically significant, as opposed to the result of a limited and structurally constrained character set.  I’ll present some ideas and experiments further along in this post.

There are also tantalizing signs of vertical patterning.  Sometimes the irregular graphical chunking of words in one line is mirrored in the lines below it, giving the impression of a secondary organization into vertical columns.  Meanwhile, the words in one line are sometimes strikingly “similar” to the words immediately above them.  In our previous example from f104v, the words immediately above okeeey.qokeeey.okeey.okeey are okeey.okeeedy.qokeey.okeeey—but that distinctive oke(e)ey sequence doesn’t appear anywhere else in the preceding line.  It’s not uncommon to find such parallels between adjacent lines that seem just too nice to chalk up to mere coincidence, even though it’s hard to define their nature other than by saying vaguely that parts of lines “look similar.”  Since transcriptions are organized by line, vertical patterning is more resistant to study than horizontal patterning.

And then there’s the matter of chronic ambiguities in the script itself, as opposed to ambiguities in how bits of text are arranged and grouped and repeated.

Most of the minimar-curveletar pairs I’ve proposed are visually obvious as pairs.  The least obvious is y/l, at least if we limit ourselves to the most typical forms of these glyphs.  However, it’s not hard to find examples with flourishes that fall between a typical y flourish and a typical l flourish, suggesting that these span a potentially infinite continuum rather than being neatly divisible into two types, and hence providing evidence that these all belong to one category that permits variation in form independent from the curveletar-minimar distinction.  The following two examples are both from f116r, illustrating a form of flourish more often associated with attached to a minim as (at least, as I’d define it):

This line of reasoning can be generalized as the digital principle: if it’s important for the reader to be able to distinguish any two graphemes from one another, we should find them written in ways that fall into clearly-bounded categories, like digital data, without lots of intermediate forms that can’t easily be categorized.  Note that the term “digital” can refer to any quantity of discrete values and isn’t limited to binary systems with just two possible values.

Sums of money as expressed in a French accounting document from 1463 (author’s collection). The amounts indicated are 229 livres, 3 sous, 4 deniers; 68 livres, 15 sous; and 45 livres,16 sous, 8 deniers.

Anyone who has engaged much with other early modern or late medieval handwriting might object that it’s typically full of ambiguous forms that can only be resolved from context, such that the digital principle shouldn’t be expected to pertain to a fifteenth-century manuscript.  For instance, a reader might need only to be familiar with the English language to recognize an ambiguously-scrawled pair of minims as u in out, but as n in ant.  Under these conditions, handwriting in which morphs into n by imperceptible degrees might still work perfectly well in practice.  Common words tend to take on the most stylized forms, in which individual letters might be impossible to make out.

But Voynichese just doesn’t seem to me as though it’s behaving that way.  Even though the script looks as though it was written fairly quickly, we don’t see anything that looks like a stylized version of a common and easily-recognizable word.  Everything looks about as carefully written as everything else.

In terms of precision, Voynichese reminds me most of the sums of money I’ve seen written in Roman numerals in early modern and late medieval documents—see the example on the right.  The forms of L, V, X, I, and so forth vary quite a lot, to the point that if you didn’t understand the system you probably couldn’t categorize the numerals correctly.  But there’s still no room for uncertainty as to which is which.  After all, if there were any confusion between and X, or between II and III, the document wouldn’t be able to do what it was supposed to do.  Of course, the existence of a tightly constrained data structure helps reduce ambiguity.  In the second line of my example, LXVIIILXVS, the first L is the numeral 50, but the second L is an abbreviation for livres, which we know because the order of monetary units is always livres, sous, deniers; the next unit shown is sous; the monetary unit is always explicitly written; and LXVIIILXV doesn’t make sense as a number.  Even so, whoever wrote these sums seems to have taken the extra precaution of writing for 50 with a curved bottom and for livres with a sharply angled bottom (perhaps marking it as an abbreviation in anticipation of £).  Somehow Voynichese feels similarly textured to me, though I’d be hard-pressed to articulate exactly why.

But some widely drawn distinctions among Voynich glyph types would run afoul of the digital principle.  Consider the distinction between and o.  In the passage below from f15v—poror.orShy.choiin according to Zandbergen—the second and third putative tokens of o are closed with nicely rounded anticurvelets, but the first and fourth are both closed with strokes angled like the subsequent minims, which fits the formal description of a I presented earlier.

Throughout the Voynich manuscript, some glyphs clearly match the shape we associate with a or the shape we associate with o, but others seem to fall somewhere in between the two.  The Zandbergen transcription uses the convention [x:y] to indicate cases of uncertainty between any two readings, and a quick text search reveals that out of 847 total bracketed notations […], 103 are [o:a] and 62 are [a:o], comprising 165 cases (19.5%) in all.  Meanwhile, it can also be hard to tell apart from or o, since the form of y itself ranges from a paradigmatic o that continues its anticurvelet into a tail-flourish to a paradigmatic a that bends into a tail-flourish.  Counts for relevant uncertainties in the Zandbergen transcription are 26 [o:y], 25 [y:o], 20 [a:y], and 8 [y:a].  If we add these to the uncertainties between and o, we get 244 (28.8%).  Thus, in over a quarter of all passages the Zandbergen transcription flags as uncertain, the sole issue is mutual ambiguity among these three specific glyph types.  Ambiguous cases such as these are not, I think, just a practical nuisance modern-day transcribers have to endure if they want to convert the Voynich Manuscript into a computer-analyzable form.  Rather, they seem liable to have caused difficulties in reading even for someone versed in the system unless the distinctions between these characters wasn’t meaningful, could be worked out from context, or were based on some formal criteria that aren’t yet clear.

Counting passages the Zandbergen transcription flags as uncertain may help us measure the relative ambiguity of glyph types, but I believe the amount of uncertainty is higher than this might suggest, since there are plenty of cases I’d personally consider ambiguous for which this transcription gives just one reading without comment.  Here’s an example from the first line of folio 1r:The Zandbergen and Takahashi transcriptions represent this as ataiin.  But is that first glyph an a closed by a minim that’s longer and more curved than usual, a y with a stubby tail-flourish, or an o with an extended anticurvelet?  To my eye, it isn’t obviously any one of these three things more than the others.  The Jorge Stolfi transcription gives the same word as ytaiin.  And here’s the very first word on f1r, potentially the first word ever written in Voynichese:

The Zandbergen and Takeshi Takahashi transcriptions represent this as fachys.  However, the closing stroke of the alleged is unusually rounded, more like the anticurvelet of o or even the tail-flourish of y.  Moreover, the alleged ch appears unusually compressed—usually it occupies the width of two ordinary curvelet-based characters, but here its second half is scaled like a mere flourish and might even be interpreted as a minim closing a curvelet, which would by definition make it another a, albeit very different from the first one.  The Jorge Stolfi transcription has fyays.  (Actually fya!ys, but the ! serves only to preserve interlinear alignment with fachys.)  Although the Zandbergen transcription doesn’t treat this case as ambiguous, it does report eleven other cases of uncertainty between ch and a, as well as twenty-nine between ch and ee, ten between and ee, and one between a and ei.  If it had been necessary to discriminate between ao, y, ch, and ee in order to look up words up in a code-book (to cite just one scenario), it’s hard to imagine how a reader wouldn’t have been stumped by what appears on the page on a fairly regular basis.

My classification of glyphs and graphemes has so far hinged on a fundamental distinction between minims and curvelets.  These forms are often straightforward enough to tell apart, but sometimes they’re not.  Consider this snippet from f7r (I’m interested for the moment in only the top and bottom words but wanted to preserve their positioning relative to each other on the page):

voynich-characters-oaiir-ambiguityThe bottom word seems unambiguous enough as oaiir, but the top word is more challenging to resolve.  The first glyph would ordinarily be transcribed o, and the last three characters as ees, but the intervening glyph looks a bit odd, if it should even be regarded a single glyph.  It might be held to fall within the range of variation for ch, which is how William F. Friedman’s “first study group” categorized it; but with its unusual closure at bottom and top it also resembles a, and except for its formal contrast with what follows, it could also be taken as ee—in fact, the Zandbergen transcription reads this word without comment as oeeees.  Meanwhile, the sequences in the top and bottom words seem to run parallel to each other: o, then a curvelet closed with a distinctive hatchmark, then two more identical hatchmarks, then a fourth hatchmark augmented with an upward flourish.  The only difference is in the shape of the hatchmarks, which appear to be curvelets in one case and minims in the other.

voynich-characters-hatchmarksThe EVA transcriptions ochees/oeeees vs. oaiir obscure what I take to be a visually compelling similarity in pattern.  Something like o(////) vs. o(\\\\) would make it readily apparent but would also entail dividing up the characters in a very unconventional way.

There are other cases in which curvelets behave in ways that are atypical of them but that parallel common minim behavior.  Consider the group of three similar strokes towards the end of this sequence on f4v.

This would read olain if the three strokes were minims, but they’re pretty definite curvelets, which results in the rare character at the end and that other peculiar structure in the middle that’s hard to classify but could be ee, ch, or a (the Zandbergen transcription has oleeeb).  Rare or unique words ending in can sometimes be turned into more common words by changing their hatchmark-curvelets (but not their forward-linked connecting curvelets) into minims, as for example with cheeeb (f20r), which becomes aiiin.

And then consider the string from f107v shown below.  The first two hatchmarks are what I’d consider paradigmatic minims, and the fourth is a paradigmatic curvelet, but the third seems to take a third form that isn’t quite one or the other, even if most transcribers might grudgingly accept it as a minim for want of another choice.


On f6v, we find another sequence that seems to contain three differently-formed hatchmark types: a curvelet with a connecting bar followed by two curvelets, two minims, and two of something that resembles the “third form” mentioned above, with a head-flourish at the end.

If a paradigmatic hatchmark-curvelet is shaped like a C and a paradigmatic minim is shaped like a backslash (\), this “third form” looks more like an L.  Examples of it can, I think, be spotted elsewhere, as on f83v, line 2:

This sequence might be parsed in traditional terms as chedy qokedy, but the two e characters aren’t entirely convincing as curvelets.  Occam’s razor nudges me away from positing a third distinct category of hatchmark, but at the same time I feel that the cases I’ve presented should complicate any straightforward binary distinction between minims and curvelets, or between characters built upon them in turn, such as and s.  It will be recalled that Cham and Jackson propose that minimars and minims are distinct glyphs that have an “affinity” for one another, and that curveletars and curvelets are distinct glyphs that have a similar “affinity.”  But based on the foregoing examples, I’m more inclined to suggest that successive hatchmarks tend to share similar forms, including strange forms, and to be patterned in certain ways, but not rigidly so; that flourishes are separate graphemes added to hatchmarks according to other patterns; and that any “affinity” among minimars and minims (on the one hand) and among curvelets and curveletars (on the other) is an epiphenomenal consequence of the interplay between hatchmarks and flourishes.  In other words, I don’t think that has an affinity with a discrete glyph b, or that has an affinity with a discrete glyph n.  Rather, I suspect that eb and in arise (as opposed to en and ib) as a natural consequence of a foot-flourish being added after ee and ii respectively.

As we’ve seen, additional marks can be attached to the bar of ch: the plume that creates Sh, and the gallows that treat it as a “pedestal” for cPh, cTh, cKh, and cFh. Moreover, we sometimes find more than two curvelets bound together with a bar, like this (f82r):

voynich-characters-chhEVA transcribes a curvelet linked both backwards and forwards by a bar with a capital H, as well as providing the convention of placing connected glyphs within brackets {…}.  A gallows or plume (or both) can sometimes be found inserted into one of these larger structures and, in these cases, generally seems to be associated with the space between two of the constituent curvelets rather than with the whole bar.  If there are two connected curvelets to the right, the first is often written oddly, as though to link or subordinate it to the second.

voynich-characters-complex-pedestal-with-gallowsSometimes the final curvelet in such a structure also has the closing stroke ordinarily associated with o (an anticurvelet), y (a tail-flourish), or (a minim).

voynich-characters-c+1Sometimes the initial curvelet is connected to one or two legs of a gallows but not through it to anything on the other side—something I think of as a “broken” bar.  The third example below, with some additional weirdness, is from f24v.  I suspect that the rare addition of a flourish to a stubby horizontal line, as seen here, may serve to associate it with a curvelet or minim to which it couldn’t be attached directly without causing confusion.  Let’s call a detached horizontal line like this one an insertion-bar.voynich-characters-C+

The line that connects qo behaves like a bar in some of these same respects.  It can hold a plume, which occupies the same range of continuously variable positions relative to the rest of the structure that we see with Sh.  It can also hold a gallows, in which case it often seems to be “broken,” and not to continue through to the o.

Sometimes a bar starts from a minim rather than a curvelet, with the intent often appearing to be to link an initial ai or oi across a gallows to a terminal ey, eey, or eeey.  (EVA seeks to accommodate these situations by capitalizing linked-forward versions of i, o, a, and y as I, O, A, and Y.)   The specimens below are from f105v, f30v, f80v, and f105v (with apologies for my inconsistency in labeling; I’ve been preparing illustrations piecemeal for some time).


Similar linkages can be found between a minim and a following curvelet without a gallows inserted between.  Whether there’s a gallows or not, the stroke at the start of the linkage would often be ambiguous be form—that is, neither an obvious curvelet nor an obvious minim—if examined out of context.

In these more elaborate structures, the minim or minims culminating in a bar are always preceded by or o, and in the example shown above from f30v, the bar extends through both the and the following minim, suggesting that it may apply to the whole preceding glyph sequence, and perhaps that bars in general can link larger units than just the glyphs to which they’re directly attached.  In one case (on f115v), we can see an even more complex set of linkages in which an extra, higher bar extends from the second curvelet in the first ch to the second leg in the gallows portion of cTh.

voynich-characters-multi-pedestalJudging from the examples presented so far, the bar seems to be tying elements of the script together into significant units, e.g., these two or three curvelets are somehow associated with each other.  Why that would have been helpful is, of course, an open question.  Bars can be found beginning and ending at curvelets, minims, and gallows legs, in pretty much any conceivable configuration—nothing seems to be ruled out on that front.

But they can even be connected to nothingas in this other example from f113r which closely parallels the preceding one.  That complicates an identification of them as linkages, except maybe by analogy with some expression such as “in- and out-boxes and -folders.”

The rare glyph x, sometimes likened to a “picnic table,” appears to be another barred structure, but its bar is typically raised to a higher level than the bar of ch, occupying the same vertical position as the loopback stroke of p (see first example below from f46r); if we count the base line, one almost gets the impression here of a four-line staff.  The element on the left of the bar sometimes appears to be a curvelet, and might even part of a curvelet sequence, but sometimes it instead looks a straight line slanted in the opposite direction as a minim, which we could call an antiminim by analogy with my term “anticurvelet.”  The combination of an antiminim with a minim is also found without a bar, producing the rare glyph transcribed (see example below from f3r, or pairings of and side by side on f57v).

Could the higher bar in the complex structure on f115v represent another instance of the kind of elevated bar we see in x?


Another complication is the “split gallows,” in which the legs are anchored to separate spots in the text, and often in ch pedestals.  Here are a couple specimens from f8v and f8r, just to illustrate:

Most speculation I’ve seen (here, for example) centers on figuring out linear sequences to which passages with split gallows could be equivalent.  For example, a split gallows might be an abbreviated or fancy way of writing two identical gallows, so that the example on the right above would be equivalent to ctho.cthey.  But I hate to shrug split gallows away like that: maybe their weirdness offers a window onto some equally weird structural principle.  What could gallows be doing in general that would cause them occasionally to point to separated spots in the text rather than just to one spot?  What broader logic could have allowed a reader to grasp intuitively what was going on in such cases?

Meanwhile, gallows legs can be found inserted into a greater variety of contexts than I’ve yet covered.  Here’s a sampling of some of the more unusual leg placements with gallows both split and unsplit.  Note the points where legs appear to be anchored in o, e, and the flourish of s, and where the two legs of straddle an e between them.  These examples remind me of nothing so much as proofreader’s marks, extraneous to the text itself but simultaneously capable of marking it up in meaningful ways.

Many other curious situations arise, but one final one I’d like to bring up here in particular appears on f54r.  A tail-flourish descending from one line has run into a minim on the line below, intersecting with it in a place where it would interfere with adding a top-joining flourish.  But the writer seems to have wanted to add such a flourish anyway and to have resorted to an unusual method of doing so.

The unique glyph that follows is assigned number 165 in “extended EVA,” and I tentatively interpret it as a head-flourish added to an insertion-bar that terminates with something resembling a tail-flourish or loopdown-flourish or the final leg of a loopdown gallows.  I’m pretty sure we’re seeing an effort here to improvise a workaround for a peculiar case of crosstalk between two lines of text, likely to express chor.chor.dam.  If so, this may shed light on the dynamics of the writing process.  It’s hard to picture the writer ignoring the crosstalk while writing choi if it had already been apparent how the word would conclude.  The behavior we see would make more sense if the word had been written in multiple stages, with choi coming first, at a time when the writer wasn’t yet sure whether this would be further elaborated into chor, chol, choiin, etc.  The crosstalk would only have been a potential problem until it became clear that the word would need to be chor (or whatever).  Another possibility we might consider is that the whole text was written in multiple passes, and that choi was written before the tail-flourish on okol, but a multiple-pass writing process wouldn’t seem to fit the evidence furnished by changes of ink (as on f104v), which implies that each new word was ordinarily written from start to finish only after the word before it was complete.

Gordon Rugg is a leading proponent of the theory that the Voynich Manuscript is a hoax, arguing that the patterns others have identified in its text could have arisen as a byproduct of a process of generating gibberish by moving templates with holes cut into them (“Cardan grilles”) around a grid filled with meaningless glyph combinations and copying the contents of the exposed cells.  But it would be hard to reconcile the actual graphical anomalies of the text with a process of copying discrete glyph groups from a template.  I won’t say that no programmatic hoaxing method could conceivably have produced the kinds of passage we’ve been examining, but Rugg’s Cardan grille mechanism seems less suited to illuminating them than others that spring to mind involving protocols for recording the results of dice throws and such.  Rugg’s work is more nuanced than his critics give him credit for, but I think it would be strengthened by cohsidering alternative hoaxing mechanisms.  His devotion to one specific solution, and his increasingly elaborate efforts to “patch” it, might even constitute a bias blind spot.

§ 3

The Hatchmark Skeleton Model

I decided to try constructing a Voynichese “word grammar” sensitive to some of the points laid out above—one based on interactions among minims, curvelets, and flourishes as separable elements rather than on whole glyphs.  I haven’t developed it as fully or methodically as some other word grammars have been developed, but just far enough to get a sense for directions in which it might lead and advantages or disadvantages it might have relative to other approaches.  One difference—I don’t know whether it’s an advantage or a disadvantage—is that I feel as though I’ve spent more time wrestling with uncommon forms than common ones, since I’m interested here in working out what’s possible within the system, no matter how rare, and not only what happens most often.

I began by looking for a common denominator among single-glyph words.  Many glyph types appear in isolation in unusual contexts such as rings or columns, where their behavior might not be subject to ordinary constraints.  But only a few glyph types appear by themselves as words in paragraphic text, which is what I’ve focused on for present purposes.  Single-glyph words are particularly prone to ambiguous spacing, so I don’t want to put too much stock in exact frequency counts.  However, the most common single-glyph word is pretty clearly s, followed by r, y, l, and o, and then (substantially rarer than the foregoing) and g.  These all consist of a single hatchmark with a flourish or, in the case of o, an anticurvelet (which I’ll provisionally classify as another kind of flourish).  This led me to speculate that the shortest valid word might consist of one hatchmark plus one flourish.

Many other short words can be analyzed similarly as a flourish added not just to one hatchmark, but to the last in a series of hatchmarks—generally at least three of them.  This observation gave me the idea of defining a “turn” as the combination of a flourish, the hatchmark to which it’s attached, and any preceding hatchmarks that don’t have flourishes; and also of proposing a rule (subject to testing) that a word shouldn’t end in mid-turn but needs to be closed with a flourish.  I’ll refer to this last as the rule of the terminal flourish.  Within a series of hatchmarks making up a turn as I’ve described, curvelets can be added to other curvelets, but minims—if there are more than one before the flourish—seem to require an initial curvelet to form the ligature a = e+i.  Examples: ees, ar, aiin.  There are almost always at least two minims in such cases, but in rare instances where there’s only a single minim bearing the flourish, it may be written separately, e.g., er (in cher), el (in qokeel).

Curvelet pairs can be linked together as well, usually as ch with a bar.  The conditions for this linkage aren’t immediately obvious; it’s not the case that the first two curvelets of a sequence are always bar-linked, for example, although that arrangement is especially common.  The linkage might help somehow with parsing, loosely analogous to commas in long numbers, but there are enough cases where it’s hard to decide whether it’s present or not that I’m skeptical about it conveying essential information (invoking the digital principle).  Curvelets are sometimes found linked at the bottom instead, in twos or threes, which may serve the same purpose or a similar one.  Meanwhile, I take the fact that similar bar-structures can be found linking other glyph types as an argument in favor of treating ch as at least containing ee, rather than as an independent glyph type that just happens to be made up of similar strokes.  For the moment, I’ll consider ch to be a scribal variant of ee, and other bar-linked or connected structures to be scribal variants of the glyphs that seem to appear in them, holding open the question of what the linkages are accomplishing.

Insofar as ch lacks a flourish, it shouldn’t constitute a valid turn by itself (as I’ve defined things so far), although the final curvelet of ch can receive a flourish to produce rare forms such as {co} and {cy}.  Otherwise, if ch were fully interchangeable with ee, the model I’ve been developing predicts it should always be followed by a curvelet.  That is, not only should we see che… rather than chi…, but also chy and chs rather than chl and chr.  In practice, chy and chs are significantly more common than chl and chr, but the latter sequences exist.  They don’t exist in quite the quantities the Zandbergen transcription would suggest; among other things, their kchrrr (f27v) is to my eye an unambiguous kchsss.  But chl and chr do exist, which might reflect an option of writing e+e+i as ch+i rather than e+a or {ca}.  If so, ch wouldn’t be fully interchangeable with ee, since the final of ee needs to combine into if it’s followed by i; but it could still be equivalent to ee.

If there are multiple successive curvelets and multiple successive minims within a turn, the rule—with few exceptions, to be considered later—is for curvelets to come first, e.g. cheear, cheaiin.  I’ll call this the rule of curvelet priority.

I suspect and may be equivalent.  That’s in part because they’re so often hard to tell apart graphically; in part because they turn up in some of the same positions, e.g. in {co} and {ca} and in oiin and aiin, making it hard to see how ambiguous cases could have been resolved from context; and in part because sometimes ends a word, which it shouldn’t be able to do—according to my model so far—if it’s simply a ligature of two hatchmarks.  One difference between words containing o and a is that they statistically favor different positions within lines, as I’ll discuss below; and there are other statistical differences besides, e.g., ot is far more common than at, which would still demand explanation.  But if they’re otherwise equivalent (and that’s the hypothesis I’m going to run with for the moment), that would seem in turn to make o, like a, equivalent to e+i in turn, with an anticurvelet being equivalent to a minim.  It would also extend flourish status to the closing stroke of a.  The consequent blurring of the boundary between hatchmarks and flourishes makes me uneasy: I’d be more comfortable if these were clearly separate categories.  One way to theorize the pattern would be to suggest that and function differently depending on whether or not they’re followed by minims, such that the closing stroke behaves as the first minim in a group when there are other minims immediately following, but as a flourish when there aren’t—and never as both a minim and a flourish simultaneously.  Another would be to treat the stroke consistently as extensible flourish: one that permits extension as the first member of a minim sequence but isn’t always actually extended.  Yet another would be to say that o/a is always a ligature of two hatchmarks but that the rule about a turn needing to end with a flourish doesn’t apply to it for some reason.

Since I’m unsure which interpretation (if any) will fare better over the long run, I’m going to sidestep the issue by referring to the closing stroke of o/a as a semiflourish, with its status at least temporarily unresolved.  Its shape often anticipates the form of the minims that follow, if any, which is usually what gives rise to a; but it isn’t required to do so.  A following minim might need only to be written in such a way as to suggest an association with a semiflourish, whether that means taking a typical minim shape or mimicking the quirky form of some more specific semiflourish.  That would make the following two words equivalent.


But where then to draw the line between a/o and ch?  Can these be found morphing one into the other by imperceptible degrees?  Is the sequence shown below also equivalent to the two preceding?

I’m torn over this point, but for the moment I’m going to posit that they’re not equivalent, and that what distinguishes ch is the presence of a bar that holds its two curvelets at arm’s length, as it were, while fuses its parts tightly together.  One of my reasons is that the rare forms {co}, {ca}, and {oh} seem to entail inserting a/o into a ch-like barred structure, which I find intuitively hard to square with ch itself being equivalent to a/o.  Larger barred structures imply that ai, oi, aii and aiii can also fill the same slot as the in ch.   But I’ll admit that these cases could reflect a recursive structure, with one linkage containing another equivalent linkage; and if ai were at least analogous to che, that peculiar double linkage seen on f115v might be understood as reflecting the same conditions (and the same ambivalence about precisely what’s connected to what) as a bar extending forward from ai or oi.


What else can be said about the behavior of the semiflourish?  Much as o/a is often followed by bare minims, it’s also often followed by bare curvelets—oe, oee, och, oeee, oche, oech, ochee, oeech, oeeee, oeche, ae, aee, ach, ache—topping out with the same quantity as we see with minims (maximum oiiii).  An o, it seems, can end a word (cho), or be extended by minims (choiin) or by curvelets (choees), and if a curvelet sequence after o also ends with o, the latter o can be extended too (oeos, oeain).  We sometimes also find o/a doubled (e.g., chool, oal, oaiin, ooeeor) or even tripled ({ck}ooaiin on f8r, oaorar on f16v).  Setting aside any distinction between o and a, sequences of bare curvelets seem to be able to alternate repeatedly and optionally with o, oo, and ooo, while at least one o is required before bare minims.

Many words contain more than one flourish, even if we don’t count semiflourishes.  One way to analyze them would be to treat them as multiple successive turns concatenated together: for example, chedy might be parsed ched+y: four curvelets with a loopback-flourish (first turn), then one curvelet with a tail-flourish (second turn).  This approach entails nothing more than dividing strings into chunks, which makes for gratifyingly straightforward computational attacks and is (perhaps for that reason) especially popular.  But chedy could also be analyzed as a one-turn word cheey with an “extra” flourish attached to one of its curvelets, or as chey with a whole d (curvelet+flourish) inserted, or as chee “closed” with two different flourishes rather than just one: che(+d)(+y) = ched+chey, or in any number of other ways that will suggest themselves.  For a variety of reasons, I think some of these other possibilities are worth investigating too, although it can be a bit more involved to do so.

Let’s consider an interesting behavior displayed by the head-flourish (seen in r and s).  Like all flourishes, it can end a turn, but if it’s added to a minim sequence (rather than to a single minim in isolation), the minim sequence can continue beyond it in seemingly uninterrupted fashion, e.g., dairin and chariin (f49v).

This is quite unusual—one of very, very few situations where we find bare minims following immediately after anything but o/a.  When is word-initial (and so not itself preceded by o or a), it’s never followed by bare minims; this only seems to occur with minim sequences that would also be valid without the head-flourish.  For example, dairin coexists with daiiin, as does dariin, which shifts the head-flourish to a different minim.  We have darin (f50r) and dairn (f102v1) complementing the ubiquitous daiin.  We also find dairl (f44v) and daiil (f78r); and dairg (f8r) and daiig (f52v).  Some words ending in rg don’t have equivalents attested with ig, such as sarg (but no saig) or org (but no oig); but the latter wouldn’t violate any expected patterns, and arguably similar words do exist (e.g., sain, aig).  These cases could also be parsed as insertions of the whole combination r (minim+flourish), e.g., dairin as a modification of daiin.  But all in all, I find it simplest to account for these cases in terms of something being added to an already-available word structure, and not in flowchart-like terms of rules governing which glyphs can appear before or after which other glyphs.

The model I’ve been piecing together so far requires one flourish to be attached to the last hatchmark in a turn.  But in the situation I’ve just described, if I’m sizing it up correctly, the turn has not only its expected flourish at the end, but another one too—a head-flourish—attached to an earlier hatchmark in the sequence.  Maybe it matters which hatchmark it occupies, such that darin and dairn have distinct meanings.  Or maybe it doesn’t, in which case adding a head-flourish to daiiin could produce dariin, dairin, or daiirn indiscriminately, with the hatchmark to be understood as “going” with the whole sequence.  The behavior of the head-flourish might parallel that of the similar-looking plume added to ch, the placement of which might be significant, but which might instead “go” with the ch curvelet pair as a whole, judging from the seemingly infinite continuum of possible positions.  The curveletar equivalent to is s, which might behave similarly as well.  For example, the unique words chesey (f30r) and cheesy (f105r) could result from adding a head-flourish to particular spots in cheeey.

Other types of flourish might be understood as displaying similar behaviors.  Consider the loopback flourish.  The minimar is very rare, so the fact that there seems to be no evidence for it comparable to dairin for r (which is itself relatively rare) might just come down to probabilities.  But the corresponding curveletar d could often be interpreted in terms of a flourish added midway through a curvelet sequence, suggesting cheeey as a potential underlying structure for chdeey (unique on f104v), chedey (unique on f111r), and cheedy (51 tokens).

The foregoing speculation led me to consider redefining a turn as any sequence of hatchmarks (including o/a as a curvelet-minim ligature) within a word that follow the expected order (curvelets, then minims), together with any other flourishes attached to them.  The turn would then have a skeleton such as cheeaiii (= eeeee+iiii) plus an ornamentation composed of complementary flourishes to flesh it out (with cheeaiii potentially becoming cheeaiin with its one obligatory final flourish, or chedaiin with a second flourish added in).  There might be one set of rules governing the construction of hatchmark sequences, and then another set of rules governing their ornamentation—for example, any skeleton that ends with a bare or might call for a flourish at the end, while skeletons ending in a semiflourish don’t need one.

Some word skeletons consist only of minims.  There are the self-standing words and r, as we’ve already seen.  But lr also turns up as a self-standing word (7 tokens), together with unique words rl, lm, llm.  There seems to be no problem with having multiple minims in a row, with no preceding curvelet, as long as all the minims have flourishes.  It’s only when there are bare minims that the rule requiring an initial curvelet appears to kick in.  Thus, rl is valid, but il would evidently not be without prefixing a curvelet as al or ol.  The more bare minims are present, the more likely the final flourish is to be a foot-flourish, as in aiiin.

Bearing such phenomena as split gallows in mind, I’ve been inclined to treat gallows in general as anchored to particular points in, around, and between words, rather than as wholly “parts” of words, such that inserting legs between the two curvelets in ch could create cTh, cKh, etc. without theoretically interrupting the adjacency of the two curvelets.  For example, we could parse the unique word cPhesaiin on f1r—


—as the skeleton cheeaiii (=eeeeeiii) with (1) a final foot-flourish: cheeaiin; (2) a medial head-flourish added to the last free curvelet before the a: chesaiin; and (3) a gallows inserted between the first two curvelets: cPhesaiin.  The placement of gallows can be analyzed further in various ways.  For example, is followed by much more often—15.2% of the time in paragraphic text, ignoring word breaks—than it is by any other gallows, with coming in next at 2.3%; and gallows types vary greatly in their likelihood of being followed by e, which is 39.4% for k, 28.1% for t, 0.9% for f, and 0.5% for p.  Those are pretty widely recognized patterns.  But we could also take a word without a gallows, e.g., chey, and see whether it can also be found with gallows attached or inserted in each possible position, e.g., kchey (17), cKhey (20), chkey (7), cheky (54), cheyk (no occurrences); and try to infer rules about gallows placement from that.

Such an approach could accommodate many words (though probably not all) with gallows at the beginning or in the middle of words, which would appear to be valid without the gallows (witness kchey, cKhey, chkey, cheky).  However, it works less well with the relatively few words that end in gallows, which are sometimes interpreted as closing brackets for “Neal Keys,” based on Philip Neal’s observation that these “tend to occur rather more than half way across the first line of a page.”  In some of the latter cases, a word-final gallows is found following a hatchmark with a tail-flourish (y, l) or a semiflourish (o, a), either of which could end a turn or word on its own—curiously, a final gallows never comes after r, s, or d, even though these glyphs may be found before gallows earlier in words.  But there are also cases in which the final gallows follows e, i, or ch, e.g., chef, dait, chk.  In those cases, removing the gallows would yield che, dai, ch.

Words that don’t end with flourishes—something which my rule of the terminal flourish predicts shouldn’t exist!

In fact, I’ve been holding out: there are, in fact, about as many words that end with these glyphs, without closing flourishes—e.g., qokeee, qokaiii, ypch—as end with these same glyphs plus a gallows.  Not many, mind you; we’re talking around 0.2% of words total in paragraphic text.  Both types of word rarely appear at line breaks and never at the ends of paragraphs, which might give me an excuse to dismiss them as the result of spurious word breaks, although the breaks seem graphically distinct.  Other researchers have seemed content to attribute similarly small percentages of inconvenient data to scribal errors or errors in transcription.  But for the moment I just want to acknowledge this troublesome phenomenon and give it a name for future reference.  I’ll call words such as qokeee, qokaiii, and ypch “plain danglers,” and words such as chef, dait, and chk “gallows-danglers,” the idea being that they leave a plain hatchmark “dangling” at the end of a word.  I hope that any mental images the term “gallows-dangler” might conjure up can be appreciated as a little gallows humor.

With this modified definition of a turn, the majority of words could be analyzed as single turns, progressing consistently from curvelets to minims and ending with flourishes, and so conforming to both my rule of the terminal flourish and my rule of curvelet priority.  Separating skeletons from “ornamental” flourishes would also invite types of analysis that might not otherwise have suggested themselves.  For example, each of the last few examples I’ve illustrated in facsimile (dairin, chariin, darin, cPhesaiin) terminates with a foot-flourish and has a head-flourish somewhere near the middle.  Is that a common pattern of ornamentation?  Could it mean the same thing in each case?  Is there a limited set of common patterns that would accommodate most words?

But the step of dividing the text for analysis into a skeleton layer, a gallows layer, and an ornamentation layer isn’t quite as straightforward as I may have made it seem.  For one thing, the semiflourish of o/a remains tricky to classify as either a hatchmark or a flourish.  I’m inclined to put it consistently in the skeleton layer, but I’m uneasy about this.  Regardless of where we assign it, though, its interplay with the gallows layer seems to conform at least loosely to expectations.   So, for instance, okool, otoar, okoaly, shotokody could be analyzed as oool, ooar, ooaly, shooody (further cases of triple o/a) with inserted gallows.  But I’m pretty sure quadruple o/a never, ever occurs—at least in the paragraphic text I’ve studied—either with or without one or more inserted gallows.

And then there’s the matter of q.  As I observed earlier, the bar after q exhibits behavior similar to that of the bars in other barred structures, including ch, such as allowing insertion of a gallows or plume.  The glyph at the right end of the bar is usually o, but even when it’s not, it’s always a curvelet or curveletar, which is another characteristic shared in common with other barred structures: qe (or e.g. qKh with inserted gallows) seems analogous to ch, qs to Sh, qch to {chh}, qa to {ca}, qy to {cy}, and the usual qo to {co}.  And yet there are some differences too.  In other cases where three elements are linked by a bar, the bar is generally continuous.  But the rare glyph combination qch doesn’t have a single continuous bar; instead, the bar from q strikes in the middle, and then a second bar links the top of to h.

Another difference is that the elements at either end of other barred structures (most often e) are easily recognizable as glyphs that can also be found commonly standing on their own.  Not so with q.  The most striking resemblance in its case is with the first leg of and k.  That element seems to be able to form part of other structures besides, generally with its “bar” terminating in some sort of more or less loopy flourish.

Although differs from these other cases in being written on the same line as the hatchmarks rather than above it, I’m inclined to treat it similarly as something inserted (almost always before initial o) in the same “layer” as gallows, such that oeedy, qoeedy, qokeedy, qoekedy, toeedy, oteedy, qoekedy, and so on differ from each other only in their gallows-layer markup, and from ochdy, qopchdy, qocKhdy, and so on only in additional variation between ch and ee.

I’ve defined a “turn” as beginning with curvelets (if any) and progressing towards minims (if any).  But words can also contain minims followed by curvelets, violating my rule of curvelet priority—at least if words are the unit to which that rule pertains.  I’m not sure they should be, and in fact I’ll argue further along that they shouldn’t; but for the moment, since this is after all supposed to be a “word grammar,” I’ll proceed as though they are.

Some words begin with one or more minims followed by one or more curvelets, with the minims almost always ornamented as or l, e.g., laiin, raiinldy, rchl; or as some combination of those glyphs if there are more than one, e.g., lram, rlaiin.  These might be analyzed as compounds of the self-standing words l, r, lr, and rl, and they also look suspiciously like sequences found written apart or with ambiguous spacing, e.g. o.l.r.aiin (f115v).  Indeed, the glyphs and seem to be magnets for spacing anomalies, as though something about them made it less clear than usual whether they should stand on their own or belong to the previous or following word.  I’ll have more to say about this later, and I believe it’s an important point.

Meanwhile, any minim with a flourish in mid-word has the potential to “reset” the hatchmark sequence to curvelets, effectively concatenating two or more turns together, each of which may still conform individually to the rule of curvelet priority, e.g. choraiin (chor, one turn + aiin, second turn).  Only a head-flourish (r) seems to allow bare minims to follow, as in dariin, and then only rarely.  Otherwise, individual minims or minim clusters in mid-word or at the end of a word are often preceded by a/o, but sometimes not—we find dr (pdrairdy), dl (dchodl), sl (qsls), yl (olyly), yr (chetyry)—and and may follow one another immediately or repeat as lr (e.g., cholraly), rl (airorlchy), ll (pollain), rll (orllory), while nl is occasionally found too (ainanl)—any hatchmarks follow beyond that invariably being curvelets.  When a gallows is inserted after a word-initial minim, l overwhelmingly predominates over r, and there are also several words beginning {iKh} or {iTh}, with a bar.  This distribution of ornamentation is similar to what we see with word-initial curvelets, where overwhelmingly predominates over and numerous words also begin cKh, cTh, etc., which hints that the same processes might be at work in both cases, regardless of whether minims or curvelets are involved.  Two ultra-rare cases of word-initial minims without ornamentation or bars appear in qopor.iirchal (f115v) and checKhed.iir (f111r).

We might speculate that word-initial ii is analogous to word-initial ee, which isn’t quite as rare but still far less common than initial ch.

It may also be significant that the apparently word-initial cases of ii are both preceded by words ending in specific glyphs that, on rare occasions, appear before bare minims in mid-word: (as we’ve already seen) and (as I’ll discuss below).

Occasionally transitions from bare minims to curvelets appear deeper within a word.  We often find these in configurations similar to those associated with word-initial minims that do have flourishes, i.e., with a gallows at the transition as in daiikey (f69r) taipar (f50r), raikchy (f107v); with a gallows and a bar at the transition; or just with a bar at the transition.

The first parts of these words resemble plain danglers and gallows-danglers.  Meanwhile, the linkage {ih} might reflect the same impulse to form ligatures at transitions between hatchmark types that gives rise to a.  Granted, {ih} can be found with a gallows inserted, while a doesn’t seem to invite this possibility; but we do find ultra-rare cases of {cTi} (f29v) and o{cKi} (f53r), which could be interpreted as attempts to insert gallows into a and oa.  Thus, one function of the bar might be to be to bridge transitions between unlike hatchmark types.

Sometimes we simply find a transition from minims to curvelets with no further ado—again, with a first part resembling a dangler.  It’s hard to assess this phenomenon statistically based on existing transcriptions because of their different standards for identifying glyph types.  In particular, cases of is reported in Zandbergen often turn out, to my eye, to be unambiguous cases of ir.  But there are also instances in which a sequence of minims definitely transitions to a curvelet shape before hitting final s, including a case we examined earlier from 107v:


As with is, reported cases of iy sometimes seem to be a consequence of defining based on flourish shape rather than on whether the flourish is added to a curvelet or a minim.  In many cases I’ve looked at, the hatchmark to which the tail-flourish is attached looks like a plausible repetition of the preceding hatchmarks, whether they’re solid minims or something more ambiguous, like the “third form.”  But not always.  And reported cases of ie generally look to be valid, with a sharp distinction between the two glyph types.  The final minim before a curvelet sometimes looks as though it’s linked to the curvelet at the bottom, with both ie and iy, which might serve the same function as a bar linkage, or at least a similar function.

Reports of io and id likewise tend to be borne out by further examination.  Most words containing these sequences are unique, but an exception is daiidy (6).

In connection with this last case, it’s worth observing that a loopback-flourish is hardly ever added to a minim.  It seems to favor curvelets (with d overwhelmingly predominating over j) to an even greater degree than a foot-flourish favors minims (with n overwhelmingly predominating over b).  Thus, some extenuating factor might be blocking the form daiijy, even if this would otherwise be expected to arise.  It’s also worth noting that we can find similar words to daiidy with a flourish added to the last bare minim: the word dairdy exists as a label (fRos), while daindl (f104v), daildain (f114r) and opdaild (f114r) are found in paragraphic text.  Similarly, the only thing separating oiees from olees, or daiiy from dairy, is a word-internal flourish.  Once again, I sense a possible connection here with danglers, as though something occasionally (albeit rarely) allows a flourish to be dropped from the end of a turn.

Every now and again, d is followed immediately by a sequence of bare minims, although there are only a handful of unambiguous cases: otaldiin on 55r, lklor.diiin,olkain (f105v), and then if we take word breaks with a grain of salt, qocKhey.d,iirain (f108r) and checKhed.iir (f111r).  In other reported instances, the first stroke after the is slightly curved, hinting at the possibility of a poorly-formed and disconnected a.

Since dain and daiin are so common, I’d be inclined to want to read these last cases that way were it not for the few truly unambiguous cases of di.

But there seems to be no getting around the existence of di, which implies to me that d, like a/o, satisfies some necessary condition for a bare minim sequence to follow.  One explanation might be that is an ornamented form not of e, but of a/o.  Among other points in favor of this view, shows a similar range of graphical variation to a/o; see J. K. Petersen on its “straight” and “rounded” variants.  Based on shape, one plausible guess would be that it combines s with an anticurvelet (or o with a head-flourish, which amounts to the same thing).  Imagine drawing and then adding an anticurvelet without lifting the pen and you’ll get the idea.  Following the thread further, we could posit that j similarly combines r with an anticurvelet, and that the relationship of d and j to o is analogous to the relationship of and g to y.

The last point would also hint that serves some parallel function to o/a, with which—as I’ve already noted—it can invite confusion.  The fact that o/a, y, and its minimar counterpart are the only glyphs that can appear before a word-final gallows that isn’t a gallows-dangler likewise hints at some common property that tail-flourishes and semiflourishes share.  I’ve been referring to the closing stroke of and as a semiflourish because it seems to do double duty as a minim when it’s followed by other minims, and otherwise to have a special (if somewhat mysterious) relationship with any hatchmarks that follow it.  The form is found mostly in cases where minims follow, while the form appears to be more neutral.  Perhaps the tail-flourish that closes y (and m and g) is an equivalent flourish that instead emphasizes its status as a flourish—that is, as a terminal flourish that should not be associated with whatever follows for purposes of parsing.  This would seem consistent with the well-known preference of and for ending lines, the fact that so many paragraphs end in y, and the prevalence of y (and its complementary minimar l) in segments that seem to behave as autonomous “prefixes.”

One final rare situation I’d like to mention here involves bare minims appearing after ok, olkot, etc.  These aren’t problematic in terms of their hatchmark order if we analyze gallows as inserted secondarily into existing skeletons, e.g., oti as t into oiolki (olkiir, polkiin) as into oli (skeleton oii), but doesn’t otherwise appear before bare minims, so that part is unexpected.

So how might the kind of approach I’ve been exploring (not yet very far, I’ll admit) fit in among other efforts to construct Voynichese grammars or word morphologies?  Is it compatible with other models that show promise?  Or would it not play well with anything else?

The prior analysis that seems most closely to resemble mine was put forward by Michael Winkelmann back in 2005 under the title “Die Harmonie der Glyphenfolgen” (which he renders in English as “The Harmonic Structure of the Glyph Sequences”).  He divides glyphs into an “i-class” (i, n, r, l, m), an “e-class” (e, c, h, o, d, s, g, b), “gallows,” two “exceptional glyphs” (q, a), and “ligatures” (ch, Sh, {ih}, cTh, etc.), and on that basis proposes the following seven “laws of harmony”:

  1. An e-class glyph can be followed by another e-class glyph, a ch ligature, a gallows, or (as a form of i).
  2. An i-class glyph can be followed by another i-class glyph or an {ih} ligature.
  3. An i-class glyph other than or l ordinarily ends a word.
  4. Words never end with a bare e or i, but “always with a more complex character from these glyph classes [immer mit einem komplexeren Zeichen aus diesen Glyphenklassen].”
  5. The “joker” character can behave as either an i-class glyph or e-class glyph.
  6. If an i-class glyph is followed by an {ih} ligature, the word can continue with e-class glyphs.
  7. Gallows “count” as e-class glyphs and are almost always either surrounded by e-class glyphs or embedded in an {ih} ligature.

I only discovered Winkelmann’s work after I’d already written up my own “word paradigm,” but I cheerfully acknowledge that he’d beaten me to some of these conclusions.  His rule #4 is equivalent in effect to my rule of the terminal flourish, although it’s differently theorized.  Several different anomalies seem to be subsumed in his rules #5 and #6, but these pertain mainly to situations I’d instead interpret as a “turn” ending and another “turn” beginning.  Winkelmann reports that 92% of words follow his harmonic rules.  He writes that he’d also considered an eighth rule to accommodate the common oi, which he otherwise interprets as an exception due to scribal or transcription error; I’ve taken things in a different direction, of course, with my proposed equivalence of and o, although I’m still uneasy about it.  Word-internal r strikes me as a fairly common exception to Winkelmann’s rule #3.

Another analysis in a similar spirit has been put forward by Brian Cham and David Jackson, who lay out the following rules in their “Introdution to the Curve-Line System” (2014):

  • Within a word, “curve-glyphs” (e, o, ch, Sh, cKh, cTh, cFh, cPh, d, g, y, s, b) and “line-glyphs” (i, r, j, m, n) can’t appear next to each other, except as further specified.
  • Curve-glyphs can be followed by line-glyphs only with a transitional a in between.
  • Any glyphs that aren’t curve-glyphs or line-glyphs are “invisible” or “transparent” to the aforementioned rules.
  • The bigrams ar, or, al, and ol behave as curve-glyphs.
  • Word-initial and behave as “invisible” characters.

Cham and Jackson report that 96.63% of the words they studied conform to these rules, but their published analysis stops there.  They don’t propose any further specifics for how words are put together within the constraints they propose, such that the curve-line system—like Winkelmann’s laws of harmony—doesn’t really aspire to be a full-fledged word morphology, despite the boldness of their claims: “It is not just pattern of the text, it is the pattern,” they write.  That said, their classification of glyphs generally matches mine, except that I’ve put in the same category with a.  It’s interesting to see that they concluded that ar, or, al, and ol should be treated as curve-glyphs, meaning that they appear next to other curve-glyphs.  In my own model, these are the most common ways in which a turn containing minims can end and “reset” the hatchmark sequence to curvelets.  Their treatment of word-initial and as “invisible” is also consistent with my own observation that one or more minims with flourishes can appear at the beginning of a word and can be followed by curvelets.  And their whole notion of “invisibility” meshes well with my treatment of gallows as inserted into existing word structures.

When it comes to more fully developed morphologies, Jorge Stolfi has proposed several, including one based on a planetary structure metaphor, such that glyphs belonging to given “densities” are usually strung together in the order in which someone would encounter them while passing through a planet from one side to the other.

  • Core: t, p, k, f, cTh, cPh, cKh, cFh
  • Mantle: ch, sh, ee
  • Crust: d, l, r, s, n, x, i, m, g, q

In building further on this model, Stolfi treats the glyphs a, o, and y as modifiers of the glyphs to their right, or as separate glyphs when they appear at the ends of words, and he groups (as distinct from ee) with whatever glyph is to its left, although he’s open about these choices being somewhat arbitrary, intended mainly to enable him to parse words for statistical analysis.  The word paradigm or grammar he came up with takes some careful study to grasp as originally presented, but here’s my attempt to summarize it (with apologies for anything I may have misrepresented or oversimplified)—

  • Attempt at an even more simplified graphic representation of the Stolfi word paradigm, with elements grouped for maximum overlap between alternative paths.

    Either crust only: {Q} + {1-5 × OR} + {{O}+Final}

  • Or a combination of crust with mantle and/or core:
    1. Crust prefix: {Q} + {1-2 × OR}
    2. Either:
      • A mantle only: {OE} + {MtS}
      • Or a core, or combination of mantle and core:
        • Mantle prefix: {OCH} + {OEE} + {OE}
        • Core: {Y} + [t, p, k, f, cTh, cPh, cKh, cFh] + {OE}
        • Mantle suffix: {MtS}
    3. Crust suffix: {1-4 × OR} + {{OR}+Final}

—where curly brackets indicate optional units, square brackets indicate a choice among elements, and parentheses merely indicate grouping; and where the capitalized elements are defined as follows:

  • Final = [Y, A+m, A+IN]
  • OR = {1-2 × O} + R
  • IN = (1-3 x i) + N
  • Q = {Y} + q
  • OCH = {Y} + ch
  • OE = {o} + e
  • OEE = {o} + ee
  • A = [a, o]
  • N = [n, r, l, m, s]
  • O = [o, a, y]
  • R = [d, l, r, s, n, x]
  • Y = [y, o]

Stolfi also provides frequency statistics for each of the alternatives within his paradigm as a measure of relative “normalness,” as well as classifying all words that don’t conform to expectations into categories:

  • Grove Words: words that are “normal” except for a gallows added at the beginning, reflecting John Grove’s observation that many paragraph-initial words can be analyzed this way.
  • Multiple: Words made up of two or more “normal” words concatenated together.
  • Weird: including iir, dairin, diin, reversions from minims to curvelets such as aiy, and the various types of word I call “danglers,” e.g., okai, aiip, dait.

The crust and mantle parts of Stolfi’s paradigm correspond to my hatchmark skeletons and flourishes, despite the difference in logic, and my general impression (without going into the weeds) is that they’re complementary rather than contradictory.  On the other hand, the complexity of the core-and-mantle part of Stolfi’s paradigm strikes me as due largely to its flowchart-like logic of glyph-by-glyph string concatenation, which forces it to handle the existence of (say) kcheycKheychkey, and cheky by stating that can be preceded by ch or che and/or followed by chey, ey, or y, and that cKh can be followed by ey, rather than by stating instead that can be attached to chey in four different positions (although, to be fair, Stolfi does separately analyze patterns in the latter way; they just don’t feed back into the paradigm itself).  The words that I wrestled with most are the same ones Stolfi also classifies as “weird.”  Other flowchart-like word generators have been proposed as well (by e.g., Elmar Vogt; Mike Roe; and Sean Palmer and Philip Neal, both as summarized by Nick Pelling, who also provides one of his own), but I haven’t looked at them as closely.  I probably should.

Some other important work in a similar spirit—but with important differences—has been presented by Emma May Smith on her “Agnostic Voynich” blog over the course of several years past.  In what follows, I’ll try to summarize some of her major conclusions, with apologies once again for any misrepresentation or oversimplification.  The thread begins with two hypotheses supported by a premise that common words extended by common “suffixes” ought to produce comparably common words.

  • “The Equivalence of [a] and [y]” (2015): The characters and y are equivalent but constrained by context.  At the end of words we find it as y, in the middle of words as a, and at the beginning of words as y unless it’s followed by i, l, m, r, or n, in which case it takes the form a, likely “conditioned by the presence of an [i] stroke in the character directly following.”  Thus, qoky and qokaiin are common, but not qok, because qokaiin would be not qok+aiin, but qoky+iin.
  • “The Existence of [y] deletion” (2015): Words that end in ey, if extended by an addition that wouldn’t cause the to take the form a, omit the y.  For example, chey+r = chear, chey+l = cheal, but chey+dy = chedy, chey+ty = chety.

These two hypotheses provide a thoughtful and appealing explanation for patterns of similarity among common words.  I believe my model can accommodate those patterns too, albeit somewhat differently, as reflecting various ways of completing words that would otherwise end in a bare curvelet (e).  Thus, chee could terminate with a tail-flourish as chey, or with a loopdown-flourish as cheg, or it could be set up for potential extension as chea(i….) or cheo(i….), or it could receive a different type of flourish as ches or ched and optionally continue from there, and a gallows might be inserted besides (chey maybe yielding chekychea- maybe yielding cheka; etc.).  This approach has the advantage, such as it is, of not needing to be deleted or changed into a.  A final curvelet (e) would be consistently present in each case, but with other graphemes attached to it.

Next on Smith’s blog comes a model for “Low Level Word Structure” (2015), aiming to describe typical or common words:

  • The glyphs y/a and are “primes,” and every “syllable” includes one of them by definition.  A word can be decomposed into one or more “sections” a.k.a. “bodies,” each consisting of a prime (which might be a deleted y) plus any preceding non-prime glyphs; and a “tail” consisting of any glyphs after the final prime.  A prime is the only character type that may be preceded by an sequence; if an sequence appears to be followed by anything else, it’s to be understood as implicitly followed by a deleted y.  The prime (plus any preceding e sequence) may also be preceded by ch or Sh.  The section as described so far can be extended further left by:
    • k, t, d, s, l, r before ch/Sh and/or an e sequence, or immediately before a prime
    • p, f before ch/Sh with or without an e sequence following, or immediately before a prime
    • cKh, cPh, cTh, cFh before an e sequence only (but not ch/Sh), or immediately before a prime
    • Another ch or Sh can then precede cKh, cPh, cTh, cFh; d and sometimes s; and rarely k, t, p, f (usually when these are not also followed by ch).
    • can substitute for an sequence after ch, Sh, cKh, cPh, cTh, cFh if the prime is y.
    • can appear at the start of a section before k, t, or d.
    • may appear before in the first section of a word.
    • The form of tail is constrained by the final prime:
      • is dropped when followed by or s; and it changes to when followed by l, r, m, n.
      • is followed by d, s, l, r, m.
      • An sequence can appear immediately before l, r, m, n (with second thoughts about l).
      • Occasionally there are two closing glyphs, mainly ls.

Smith turns to “primes” much as I originally turned to flourishes, i.e., as delimiters for breaking words into smaller units for analysis, a process she calls “syllabification.”  The majority of single-glyph words, including s, r, and l—which are central to my own model—don’t contain primes and so can’t be “syllabified,” but Smith observes that these “make up only a small number of the total words” and is content to set them aside as exceptions.  Both Smith’s model and mine shouldn’t allow words ending in e: mine because they don’t end with flourishes (making them “danglers”), hers because can appear only before a prime, and a following would be deleted only if followed by another glyph, which obviously doesn’t apply at the end of a word.  Smith’s rules for the placement of gallows relative to ch, Sh, and e sequences are complicated in the same way as Stolfi’s, but some of her other observations about the length of sequences (e.g., that e sequences tend to be shorter after ch, Sh, cTh) seem consistent with the possibility of analyzing underlying sequences separately from where gallows appear within them.  Smith’s word structure is mostly compatible with a curvelets-before-minims sequence insofar as bare minims usually follow one of Smith’s “primes” (o or a), but are rarely followed by one, which forces them into the final “tail” segment.  Meanwhile, and are the only minimars Smith associates with bodies rather than tails, and she provides some special rules for and d, flagging situations as tricky that have called for special treatment in my model as well.

Next comes a piece on “High Level Word Structure” (2015), proposing:

  • A word may consist of a single “Free” section (followed optionally by a tail) but may also have a “Fore” section and/or an “After” section.  After sections take only the prime y, with dy being most common (others are ly, ry, ldy, possibly chy).  They tend not to be followed by a tail, but occasionally they are, as e.g. with words ending with the After section dar (=dy+r).  Fore sections can take either y or o as a prime.  They may be preceded by an e sequence and/or ch/Sh, but this is common only with words that don’t also have after sections.  By the time of this article, Smith is positing deletion after ch/Sh as well as e.

The closest equivalent I have to this is my rule of curvelet priority, but it’s getting more difficult here to translate between the specifics of models based on different delimiters—primes for Smith, flourishes for me.  Smith’s point that After sections don’t usually appear together together with Fore sections containing sequences or ch/Sh would seem consistent with curvelet sequences having a finite overall length, independent from how they’re elaborated.  Unless I’m mistaken, there are words that an analysis on Smith’s terms would show to contain After sections with prime o, such as kararo (ky+ry+ro, f106v), but it’s true that none of them are common.

In another post from 2016, Smith introduces the “Weak String”—originally dubbed the “semi-circle string,” to judge from the URL—consisting of permutations of ch, Sh, e, ee, o, and y, in which or y appears furthest left (can be a, or deleted); then an optional or ee to the right; and then an optional ch or Sh to the right of all that.  The possibilities are thus o, y, eo, ey, eeo, eey, cho, chy, cheo, chey, cheeo, cheey, Sho, Shy, Sheo, Shey, Sheeo, Sheey.  In terms of my own model, these all appear consistent with the first pair of curvelets in a sequence being optionally linked as ch.  However, Smith would parse ech as ey+ch, and eech as eey+ch, such that the apparent inability of to precede ch is a necessary consequence of the “y deletion” hypothesis.  By the same token, nothing can follow after in Smith’s model either, with e.g. oeey being parsed as o+eey.  There’s certainly a pattern here in need of explanation, and Smith’s analysis seems to handle it comfortably, although I’m still unsure whether it brings meaningful order to the pattern or only provides a convenient rationale for reducing its complexity.

Smith’s paradigm accounts more neatly than mine for some situations, and mine (I think) accounts more neatly than hers for others.  We each started by choosing something in the script as a central organizing principle: the primes o, a, y in her case, flourishes and hatchmarks in mine.  We each then had to introduce some more abstract processes along the way, such as deletion in her case or the behavior of the “semiflourish” in mine, to bring order to the patterns we were seeing.  Together with Stolfi, we tend (mostly) to agree on which words stand out as anomalous, even if our reasons for identifying them as anomalous are different.

I don’t want to put my modest “hatchmark skeleton” hypothesis on a par with the models developed by Smith or Stolfi, which are far more thoroughly and methodically developed.  But neither do I find that analyzing Voynichese in terms of hatchmarks and flourishes is an obvious dead end, or that developing a model built upon them demands a greater degree of intellectual contortion than other models.  Patterns emerge that seem no more or less significant than those exposed by other models, and that have only about the same number of exceptions.  I don’t see the proposition that curvelets and minims are structurally significant elements as incompatible in practice with “serious” word paradigms, even if I’m not quite there yet.

Granted, I spent some time earlier mustering evidence that curvelets and minims are sometimes graphically indistinct from one another.  But my model doesn’t necessarily require that curvelets and minims need to be clearly differentiated.  It only stipulates that they make up a backbone for other elements of the script to flesh out, and that they usually follow a particular pattern (my rule of curvelet priority).  If we sometimes see a curvelet where we’d expect a minim, or something that isn’t clearly one or the other, we should certainly take note, but we can also quarantine the anomaly into one layer (the “skeleton” layer), rather than letting the difference between cheoiees (f24r) and cheoeees (f6r) complicate every part of our analysis at once.  For better or worse, other models don’t share this flexibility.

More recently, Smith has introduced “A New Word Structure” (2019), which is supposed to supersede much of her earlier models, and which can reportedly accommodate the 350 most common words in the Voynich Manuscript:

  •  A “tail” is now called a “coda,” but the main new development is “Body Rank Order.”  Bodies can be classified into three ranks, numbered 1-3, such that when two or three bodies are combined into a single word, each successive body must fall into a higher-numbered rank or (less often) stay the same.  The most common bodies are ranked as follows (remember that in Smith’s model will become or be deleted under certain conditions):
    • Rank 1: qo, o, y, so, cho, chy, Sho, Shy, cheo, chey, Sheo, Shey, cheeo, cheey, Sheey
    • Rank 2: do, lo, ro, sy, eey, ko, to, po, ky, ty, py, keo, teo, key, tey, kcho, tcho, pcho, kchy, tchy, pchy, kshy, keeo, teeo, keey, teey, keeey, kchey, kShey, tcheo, tchey, pchey, cTho, cThy, cKhy, cThey, cKhey, lky, lchy, lkeey, lchey, lShey
    • Rank 3: dy, ldy, ly, ry
    • Exceptions include chy and (1st category) at the end of a word; lo (2nd category) found out of order; or a coda found in the middle of a word.  In a follow-up piece (2019), Smith also expresses discomfort with the “stay the same” provision, which is needed mainly to accommodate words starting ych, ySh, which are predominantly line-initial.  She suggests that the addition of to existing words may be a “transformation” tied to some context that arises at line starts and in other limited conditions, in line with her broader Transformation Theory, which proposes that words routinely undergo contextual transformations that disguise them and that the discovery of these rules of transformation is a crucial task.

This new model identifies a restricted set of specific “bodies,” each consisting of a prime preceded by non-primes, into which many words can be broken.  These are exactly the kinds of building blocks many researchers have been hoping to identify, whether as the syllables of a linguistic plaintext, as Smith herself would prefer; as the stuff of a verbose cipher, as other authorities such as Nick Pelling suppose (although Pelling favors other groupings, such as or and ot); or as the meaningless building-blocks of a hoax à la Gordon Rugg.  My model doesn’t have anything quite like this.  But then again, I’m not convinced that all Voynichese actually invites this kind of “chunking.”

§ 4

Word Breaks, Line Breaks, Paragraph Breaks, Labels

An intriguing article by Emma May Smith and Marco Ponzi entitled “Glyph combinations across word breaks in the Voynich Manuscript” appeared in Cryptologia 43:6 (2019): 466-485.  It’s a study of the glyph pairs straddling word breaks—but not line breaks—in practice corresponding to glyphs surrounding dots in the Takahashi transcription.  So, for example, given the two adjacent words otchdy.qodor, the object of interest would be the “word-break combination” y_q.  Smith and Ponzi find that counts of actual word-break combinations differ significantly from the counts that would be expected between randomly shuffled words based on the frequencies of their beginning and ending glyphs.  Some combinations are far more frequent than expected, including r_a (308 expected vs. 940 actual), s_a (65 vs. 300), o_r (15 vs. 77) and o_l (43 vs. 144), while others are far less frequent, such as n_k (172 vs. 30) and r_k (168 vs. 51).  Smith and Ponzi further identify two groups of glyphs, Strong (k, t, p, d, s, r, q, n, plus possibly cKh, cTh) and Weak (y, o, a, Sh, ch), in which opposites attract, such that Strong-Weak and Weak-Strong combinations are more common than Strong-Strong or Weak-Weak ones, with the Strong glyphs rejecting each other more strongly and the Weak glyphs rejecting each other more weakly.  (This is the basis on which Smith calls the “Weak String” the “Weak String.”)  They report further that the glyph is Strong at the beginning of words and Weak at the end of words.

What we make of these findings will hinge on how we understand word breaks themselves, so I’d like to revisit that issue here, starting with the following question: Do spaces ever convey any information that could be required for interpreting the text, or are they wholly predictable?

A brief digression on computational methods is in order.  I’ve done most of the statistical analysis I’ll be presenting here using custom Python scripts I’ve written with specific questions in mind.  Typically I’ll import the Zandbergen transcription, and usually just the paragraphic text, resolving all glyphs flagged as ambiguous with whatever the first alternative is that isn’t ?—thus, [a:o] gets resolved as a, but [?:ch] gets resolved as ch.  I also replace curly brackets {} with capitalization for representing linked glyphs, substitute something else for Sh to avoid confusion with s, and make other minor practical tweaks.  It’s always important to bear in mind that the results of computational analysis will be more precise than accurate due to vagaries of transcription.  If I learn from a query that there are 1,430 occurrences of the glyph sequence daiin, I can be confident this is a common word, but I shouldn’t count on there being exactly 1,430 occurrences of it.  Computational attacks on a transcription can be relied upon for identifying common features, but not for distinguishing rare phenomena from phenomena that never, ever occur.  When questions of the latter sort are the focus of attention, I’ll generally search the transcription for any reported cases of an unusual or unexpected glyph combination (such as iy) and then check each one individually in facsimile.  But when I’m doing large-scale statistical analysis, I generally won’t.  So there’s always a margin of error to be considered.

Likelihood of a space (.) between adjacent glyph pairs, with rows representing first glyphs, columns representing second glyphs, and a scale running from +1 (always a word break) to -1 (never a word break).

Now back to the question of word breaks.  392 two-element permutations of the set {p, f, t, k, d, j, m, n, g, s, r, y, l, o, a, e, i, ch, Sh, cPh, cFh, cTh, cKh, q, x} can be found occurring in paragraphic text—not counting line breaks or breaks caused by intruding illustrations—comprising 149,999 total adjacencies of which 119,968 are word-internal; 27,578 are at word breaks; and 2,453 are ambiguous (shown with a comma).  For the moment, let’s disregard the comma breaks, leaving us with a total of 147,546 adjacencies with reportedly clear status.  If we do so, 44,296 adjacencies (30%) are between glyphs that only appear together, while 1214 (0.8%) are between glyphs that only appear apart, so for those glyph combinations we could predict actual word breaks with perfect accuracy.  Another 65,525 adjacencies (44%) occur between glyphs that appear together at least 90% of the time, while another 11,534 (7.8%) occur between glyphs that appear apart at least 90% of the time.  Only 27,388 adjacencies (18.5%) appear between glyphs for which we can be less than 90% certain as to whether they will occur at a word break or not.  There are 3,014 cases of word-internal adjacencies for glyph pairs that are apart a majority of the time, and there are 4,154 cases of word breaks between glyph pairs that appear together a majority of the time—or 7,168 total cases that run counter to expectations, comprising just 4.8% of unambiguously classified adjacencies.  There are also eight instances of glyph pairs that split evenly between cases with breaks and cases without breaks.  With a little further abstraction, the following rules for inserting spaces will still correctly predict the status of all but 7,196 (4.9%) actual adjacencies as reported in the Zandbergen transcription (excluding cases reported as ambiguous):

  • Before q
  • After g, m, n
  • After except before y, i
  • After before ch (with or without inserted gallows), Sh, d, l 
  • After except before t, k
  • After before r, o, d, ch (with or without inserted gallows), Sh
  • Between repetitions of o, s, l

If we were to remove all reported word breaks (.) from paragraphic text in the Zandbergen transcription, these rules would allow us to reconstruct them with slightly over 95% accuracy.  Even if we disregard adjacencies to and i as too “easy,” this figure drops only to 93.2%.

But even this may not fully reflect the lack of information being encoded by word spaces.  After all, the distinction between a definite word break (.) and an uncertain word break (,) is always a subjective judgment call on the part of the transcriber, and spaces in the manuscript itself can themselves be more or less pronounced, as I’ve pointed out.  Bearing that in mind, it’s likely no coincidence that the glyph pairs with the highest incidences of unexpected behavior also tend to have relatively high proportions of ambiguous word breaks represented in the Zandbergen transcription, i.e., what I call “comma breaks,” which I’ve excluded from my calculations so far, but which I’d now like to bring to the fore as an additional data point.  The following eighteen glyph pairings—which represent all those for which there are at least 100 cases of unexpected spacing—account for 71.2% of total unexpected spacing behavior and 59.8% of total ambiguous spacing, but only 15.7% of all adjacencies.

  • l_ch: 948 apart, 710 together, 132 (7.4%) ambiguous
  • r_a: 631 apart, 573 together, 248 (17%) ambiguous
  • l_o: 847 apart, 499 together, 72 (5%) ambiguous
  • y_k: 635 together, 431 apart, 96 (8.3%) ambiguous
  • l_d: 520 apart, 415 together, 82 (8%) ambiguous
  • r_o: 1180 apart, 324 together, 74 (4.7%) ambiguous
  • l_Sh: 509 apart, 293 together, 95 (10.6%) ambiguous
  • y_t: 530 together, 290 apart, 67 (7.6%) ambiguous
  • y_ch: 1500 apart, 229 together, 52 (2.9%) ambiguous
  • s_o: 336 together, 201 apart, 29 (5.1%) ambiguous
  • l_k: 1019 together, 194 apart, 147 (10.8%) ambiguous
  • s_a: 515 together, 184 apart, 97 (12.2%) ambiguous
  • r_y: 206 together, 180 apart, 34 (8.1%) ambiguous
  • y_d: 1101 apart, 152 together, 59 (4.5%) ambiguous
  • r_ch: 932 apart, 119 together, 53 (4.8%) ambiguous
  • l_s: 131 together, 108 apart, 25 (9.5%) ambiguous
  • o_l: 5158 together, 107 apart, 89 (1.7%) ambiguous
  • l_y: 384 together, 101 apart, 18 (3.5%) ambiguous

Just a few glyphs monopolize the initial position in this list (l, r, y, o, s), while other glyphs appear repeatedly in second position. However, specific glyph pairs also display tendencies distinct to themselves.  For instance, we can contrast the percentages of adjacencies with comma breaks for all glyph pairs ending in _t (2.3%), _o (1.1%), and _k (3.6%), with those for the specific pairs y_t (7.6%), y_o (0.6%), and y_k (8.3%).  From this, we can see that after y, spacing before and becomes more ambiguous than usual, but spacing before o becomes less ambiguous than usual.  Meanwhile, y_t and y_k both appear in the above list as pairings associated with a large number of unexpected word breaks, but the treatment of y_o is far more consistent (2259 cases apart, 38 cases together).

The bottom line is that there appears to be a correlation between ambiguous spacing and inconsistent spacing between given glyph pairs.  The more often we encounter graphically uncertain spacing between two glyph types, the more likely we are also to find a balanced mix of cases in which the same two glyphs are written clearly together and clearly apart.

From the foregoing, I draw several conclusions:

  1. Word breaks are mostly predictable based on simple rules of glyph adjacency.  Below is a comparison of actual reported word breaks (top) to predicted word breaks (bottom) for one more or less randomly chosen paragraph from f80r.  My rules predicted six word breaks not actually found in the manuscript and put one word break in a different place.
  2. Word breaks vary in their degree of certainty.  I don’t just mean that they vary in terms of how easy it is for transcribers to identify them.  I mean that the writing system itself appears to be more ambivalent about putting breaks between some glyph pairs than others.  The glyph pairings which my rules predicted wrongly in the above example are l_ch, l_k, l_o, l_sh, r_ch, and n_o.  All but the last appears in my list of the eighteen glyph pairings responsible for the most cases of “unexpected” treatment.  We might hypothesize that there are contexts in which these glyphs should be written together and others in which they should be separated, such that the issue is unpredictability.  But these same glyph combinations are also among those which the Zandbergen transcription flags most frequently with a comma as ambiguously spaced.  And there are other passages in which identical glyph sequences are reportedly spaced just as my rules predicted: r.chedy (f81v), dain.ol (f75r), okal.ol (f84v), sol.chey (f80r).  I infer that the writer somehow found it more compelling to leave spaces between some pairings of glyphs than others, and that ambivalence manifested itself not only through inconsistency in the total presence or absence of spaces, but also through indecisive partial spacing.

    Percentage of ambiguous spaces (comma breaks) shown in the Zandbergen transcription for adjacent glyph pairs with rows representing first glyphs, columns representing second glyphs, and all pairings with fewer than ten reported occurrences excluded.

  3. I believe a reasonable hypothesis about the distribution of word breaks in paragraphic text would be that the data is continuous in principle but has been subjected to a set of rules in which certain combinations of glyphs demand spacing, other combinations resist it, and yet other combinations fall into a liminal category.  As Zandbergen observes, “It is as if this were similar to the rules in the Arabic script that certain characters cannot be connected to the next character.”  And yet there’s nothing obvious about the forms of the Voynich glyphs that would appear to be preventing linkages, as is the case with the Arabic script.  If certain glyph combinations require or resist spacing, then, it seems probable that spacing rules are based on distinctive roles of glyphs in the logical structure of the writing system, and that the writing system itself features a hierarchy of primary and secondary groupings among elements that may or may not correspond to chunks of underlying “content.”  It’s notable that many of the most ambiguous spacing contexts follow a head-flourish (r, s) or tail-flourish (l, y), with the minimars (r, l) tending towards greater ambiguity than the curveletars (s, y).  Could this statistic reflect—and so help to reveal—something about the functions of head-flourishes, tail-flourishes, minims, and curvelets within the writing system?
  4. If the above hypothesis is correct, then the “word-break combination” statistics studied by Smith and Ponzi should reflect the interplay of the rules for space insertion (including degree of ambiguity) with the frequencies of particular glyph adjacencies in the text irrespective of breaks.  Thus, if n and k are rarely adjacent in the text, then n_k will be infrequent irrespective of how many words might otherwise end in n or begin with k.  Meanwhile, if and are frequently adjacent in the text, then r_a will be frequent—to whatever degree this glyph combination calls for a space—irrespective of how few words might otherwise end in or begin with a.  Taking all paragraphic text and disregarding word breaks, I count 1,454 tokens of ra and 33 of nk.  The probability of any randomly-chosen glyph being followed by is 8.6%; and by k, 6.1%; but for any given token of r, the probability of it being followed by is 23.6%, while for any given token of n, the probability of it being followed by is only 0.6%.  I don’t see any compelling reason to treat such probabilities within words and across word breaks as separate phenomena.

Although individual glyph adjacencies seem to be a good predictor of word breaks, it’s worth looking as well at the ambiguity and inconsistency associated with specific word pairs.  I count fifty-two specific word pairs that appear at least three times joined together and three times separated by a space (as always, in the Zandbergen transcription):

  • o_l (354 together, 5 apart, 5 ambiguous)
  • o_aiin (23 together, 3 apart, 3 ambiguous)
  • y_daiin (17 together, 3 apart)
  • r_aiin (53 together, 9 apart, 7 ambiguous)
  • r_ain (18 together, 6 apart, 4 ambiguous)
  • r_ol (16 together, 4 apart, 4 ambiguous)
  • r_al (14 together, 3 apart)
  • s_aiin (117 together, 25 apart, 23 ambiguous)
  • s_or (43 together, 9 apart, 1 ambiguous)
  • s_ar (54 together, 7 apart, 5 ambiguous)
  • s_chey (6 together, 3 apart)
  • s_chol (3 together, 5 apart, 1 ambiguous)
  • l_chedy (104 together, 5 apart, 1 ambiguous)
  • or_aiin (29 together, 26 apart, 23 ambiguous)
  • ar_aiin (7 together, 12 apart, 9 ambiguous)
  • or_ain (11 together, 4 apart, 7 ambiguous)
  • ar_ain (3 together, 5 apart, 2 ambiguous)
  • or_ar (6 together, 8 apart)
  • ar_ar (3 together, 6 apart, 1 ambiguous)
  • ar_am (6 together, 4 apart)
  • or_ol (7 together, 10 apart)
  • ar_ol (6 together, 7 apart)
  • or_al (8 together, 5 apart, 2 ambiguous)
  • ar_al (15 together, 10 apart, 2 ambiguous)
  • or_chey (3 together, 5 apart, 2 ambiguous)
  • ol_r (4 together, 3 apart)
  • ol_s (17 together, 6 apart)
  • al_s (3 together, 4 apart)
  • ol_ol (13 together, 4 apart)
  • al_ol (7 together, 3 apart)
  • ol_aiin (41 together, 5 apart, 6 ambiguous)
  • ol_oiin (3 together, 4 apart, 1 ambiguous)
  • ol_daiin (9 together, 8 apart, 3 ambiguous)
  • ol_kaiin (27 together, 3 apart, 6 ambiguous)
  • ol_chey (22 together, 6 apart, 2 ambiguous)
  • ol_chdy (6 together, 4 apart)
  • ol_cheol (7 together, 4 apart, 3 ambiguous)
  • ol_cheey (11 together, 9 apart, 2 ambiguous)
  • ol_chedy (32 together, 16 apart, 6 ambiguous)
  • ol_cheedy (3 together, 3 apart, 2 ambiguous)
  • qol_chedy (11 together, 12 apart, 1 ambiguous)
  • chor_aiin (3 together, 3 apart, 2 ambiguous)
  • chor_or (3 together, 3 apart)
  • chol_daiin (3 together, 24 apart, 4 ambiguous)
  • chol_or (5 together, 4 apart, 1 ambiguous)
  • dol_or (3 together, 3 apart)
  • dal_dy (17 together, 3 apart)
  • dar_y (17 together, 3 apart)
  • sol_chey (3 together, 3 apart, 1 ambiguous)
  • sar_aiin (3 together, 3 apart, 3 ambiguous)
  • sar_ol (3 together, 3 apart)
  • otar_ar (4 together, 7 apart)

I’ve spot-checked these findings against the Takahashi transcription, and although it doesn’t flag ambiguous word breaks as the Zandbergen transcription does, the together-versus-apart ratios appear to be comparable, so the points I’ve been raising shouldn’t be due to any idiosyncratic judgment about word breaks on Zandbergen’s part.  Only two of the word pairs listed above contain gallows, otar_ar and ol_kaiin, and these barely meet the three-tokens-each threshold (other pairs with gallows can also be identified below the threshold).  Numerous pairs vary only in alternating between and o.   The first element always either ends in or or consists of a single glyph r, l, s, o or y.  The pairs with a single-glyph first element tend strongly to favor appearing together without a break.  My impression—not quantitatively confirmed—is that the longer an element, the more likely it is to be separated from its neighbor by a space.

Except for cases where the first word is s, o, or y, each of the word pair sequences listed above is split along what my hatchmarks-and-flourishes paradigm would identify as a turn boundary: the first word invariably ends in a minim with a flourish, and the next word invariably begins with a curvelet.  By contrast, Smith’s primes-based paradigm would often decompose the “together” versions into bodies and codas that don’t coincide with the breaks in the “apart” versions, e.g., do+lo+r versus do+l and o+r, cho+lda+iin versus cho+l and da+iin, and o+lka+iin versus o+l and ka+iin.  Now, if word breaks are an arbitrary artifact of the writing system, then it’s entirely possible that there’s a meaningful unit lo in do.lorlda in chol.daiin, and lka in ol.kaiin.  But if the paradigm itself presupposes that words are coherent units, then such an argument might not be attractive.

The main argument I’ve seen put forward against word breaks being arbitrary rests on a purported similarity between paragraphic words and labels.  Zandbergen writes:

[A]n argument in favour that the word spaces are real is formed by the labels. The label words, which are found standing alone, also occur in the running text separated by spaces, and only very rarely with a space in between. This qualitative argument still needs to be confirmed quantitatively, though.

Smith and Ponzi echo this view in their Cryptologia article (p. 481):

[S]ingle-word labels in diagrams and next to illustrations follow the same general structure as the main text. They often begin with [o] and end with [y] and other parts of word structure, such as words beginning [ok, ot] or ending [dy], are shared between labels and the main text. As labels are strings of glyphs that have no immediate relationship with other text, they must be regarded as complete in themselves and not divisions of longer text.

Percentages of beginning and ending glyphs for different text units.  Single-glyph labels and words are excluded.  Mid-line statistics are for mid-line words, not mid-line word breaks, and first and last words of paragraphs are excluded from the statistics for lines.  Comma breaks (,) have been counted as word breaks. Twenty-four single-word lines at ends of paragraphs are reported only among line-initial words but would never affect paragraph-final statistics by more than half a percent.

A close correspondence between labels and the words in paragraphic text would indeed be compelling evidence that the latter represent discrete chunks of content.  But labels actually display quite different statistical properties from those of words in paragraphic text, making the average word (in the latter context) significantly unlike the average label.  Differences of a similar kind were the original basis for Prescott Currier’s well-known claim that the “line is a functional entity,” which he justified mainly on the grounds that “[t]he frequency counts of the beginnings and endings of lines are markedly different from the counts of the same characters internally” (here, on p. 5); and yet frequency counts for labels diverge from those for line-internal words just as starkly as those for the beginnings and endings of lines.  The most common beginning glyph for both labels and mid-line words is o, but it begins 62.5% of labels and only 21.4% of mid-line words.  The figures for other initial glyphs are likewise highly divergent—see the chart to the left, which presents statistics for labels, mid-line words, lines, and paragraphs.  Each of these units displays its own unique pattern, and the quantities of data involved seem too great for the discrepancies to be dismissed as mere noise.

Currier’s suggestion that lines are functional entities draws further support from Elmar Vogt’s study of word lengths within them.  Vogt reports that the first word of a line tends to be longer than average (the “first word effect”), while the second word of a line tends to be shorter than average (the “second word effect”).  I ran my own statistics to double-check Vogt’s findings, calculated in terms of EVA characters and with all comma breaks counted as word breaks.  (I haven’t explored whether ambiguity varies by line position, but that might be worth doing.)  The result was that the first words in lines (but not in paragraphs) have an average length of 5.25 characters, mid-line words have an average length of 4.96, and last words in lines (but not in paragraphs) have an average length of 4.61.  These figures resemble those reported by Ger Hungerink, albeit with some variation due possibly to use of different transcriptions.  Notwithstanding Vogt’s description of a “second word effect,” I came up with an average length of 4.98 for second words in lines and an average length of 4.96 for all other mid-line words, which doesn’t seem statistically significant.  Meanwhile, the first words in paragraphs have an average length of 6.42 characters, while the last words in paragraphs have an average length of 5.49 characters—higher than the average for mid-line words in both cases.  And paragraphs have their own distinctive patterns of beginning and ending glyphs besides, most famously their tendency to begin with gallows.

Some Voynichologists treat mid-line words and labels alike as “normal” words, speculating that these are somehow being modified at paragraph beginnings and the beginnings and ends of lines to produce the statistical differences we see in those contexts.  Relevant hypotheses include the putative addition of a gallows to the first word in a paragraph (creating “Grove Words”) and of extra glyphs to the first words of lines (see, e.g., Emma May Smith’s suggestion that strong line-initial preferences of ych, ySh, dch, and dSh could point to some process whereby y or d is added to line-initial occurrences of words ordinarily starting in ch and Sh).  But if mid-line words and labels aren’t really all that similar to each other, then it becomes less defensible to treat such words as a norm by which other words can be evaluated.

The most frequent self-standing label words can also be found in paragraphic text, but not in remotely proportionate quantities.  Below I list the twenty-six most common label words (i.e., all the ones that turn up as labels three or more times) together with the number of tokens of each found in paragraphic text.  There’s not much of a correlation, and the similar structure of most of the common label words is noteworthy as well, with words beginning ot- or ok- forming the overwhelming majority (on which feature of “labelese” see also Nick Pelling).  It’s hard for me to imagine there being much more information here than “Figure 1,” “Figure 2,” and so forth.

  • 8 label tokens: otaly (10), otedy (140)
  • 6 label tokens: okal (133), okary (6), oky (91)
  • 5 label tokens: okaly (16), okeody (24), okol (62), otchdy (23)
  • 4 label tokens: okedy (108), okody (5), oteedy (90), otol (65), otoly (4)
  • 3 label tokens: dary (17), okaiin (204), okaldy (7), okam (29), okchdy (17), okeos (4), okor (29), otal (125), otaldy (4), otody (12), otoldy (6), qokal (191)

Curiously enough, the structure of these labels is a good predictor of whether they will also appear frequently as paragraphic words or not.  Words that are common both as labels and as paragraphic words tend to end in -edy (otedy, okedy, oteedy) or -l (okal, okol, otal, otol, qokal), while words that are common as labels but less often found as paragraphic words tend instead to end in -ly (otaly, otoly, okaly), -ldy (okaldy, otaldy, otoldy), -ody (okeody, okody, otody), -chdy (otchdy, okchdy), or -ry (okary, dary).

Meanwhile, the most frequently-encountered words in paragraphic text never appear as labels in any way that could confirm their status as discrete units.  Thus, daiinol, chedy, aiin, or, chol, and chey—the seven most common words in paragraphic text—don’t exist as self-standing labels, although there are plenty of labels of comparable length, and although these same glyph sequences can be found as parts of longer labels, often unique, such as daiinoty, choleey, and cheys.  Other labels contain word breaks that, in the case of paragraphic text, would fall into the ambiguous/inconsistent category, such as okal.aryokeey.ary,, otal.arar, okol.shol.dy,, and otaim,dam.alam.  I suppose it’s conceivable that daiin, ol, chedy, and so forth encode common words that wouldn’t be expected to stand by themselves as labels, such as “and,” “if,” and “is,” although if so, it would also be necessary to account for the structural similarity of the words that do frequently appear as labels.  But in general, I don’t find that labels confirm the “real-word-like” status of mid-line words nearly as much as has been supposed.

Another argument I’ve seen put forward in favor of identifying mid-line words as “real” words is that they’re so often found repeated in different contexts.  But if we disregard spaces, the most common five-EVA-character sequence is actually the non-word edyqo (1511), not the word daiin (1364); and the non-word dyqok (1261) is likewise more common than the next most common five-EVA-character word, chedy (1224).  Granted, both edyqo and dyqok could also be part of the larger non-word sequence edyqok (970); but daiin itself could be assimilated to ydaiin (401), odaiin (325), daiinch (264), daiino (288), ldaiin (223), daiind (162), etc., and if we treat the terminal flourish as a separable element, daiin is less common than daiii (which, in my own analysis, also forms part of daiir, daiil, and daiim).  I’m not sure it’s the case that glyph or grapheme sequences with spaces around them repeat more consistently on the whole than other glyph or grapheme sequences—that might be true, but I’d want to see some confirmation of it if so.  (For what it’s worth, the longest exactly-repeating glyph sequence, irrespective of word breaks, appears to be olchedyqokainolsheyqokain, which turns up once as ol.chedy.qokain.olshey.qokain on f80v and once as olchedy / qokain.olshey.qokain across a line break on f75v.)

One possibility I’ve considered briefly is that labels might be following the same rules as lines, and that those rules manifest themselves differently based on the length of the text to which they’re being applied, with labels being generally shorter.  However, the twenty-four single-word lines found in paragraphic text don’t show the same statistical properties as labels.  Most notably, only four of them (16.7%) begin with o, which comes close to the expected 15.4% proportion for lines but is nowhere near the 62.5% expected of labels.  Lines, mid-line words, and labels all behave quite differently from one another.

If we question the status of mid-line words as “normal” words, that might lead us to reevaluate the “line as a functional unit” proposition, which is so deeply ingrained among Voynichologists as to have spawned the initialism “LAAFU.”  It may be that lines are simply more prone to break at meaningful division points than words are.  I tried applying the same method to line breaks that Smith and Ponzi applied to mid-line word breaks, comparing actual line-end/line-start combinations with what we’d expect from randomly shuffled lines.  There are far fewer line breaks than word breaks, so there’s more random noise, but the finding that stood out as most likely to be significant was that the combination m_q was only 58.3% as common as predicted.  Thus, if one line ends in m, the next line is rather less likely to begin with qo than we’d expect from a random shuffling of actual paragraph-internal line ends and line beginnings.  The most extreme pair in the opposite direction is n_Sh, which is 140.6% as common as expected.  These divergences are less extreme than many of those Smith and Ponzi identified for word breaks.  Overall, then, line breaks conform to predictions more closely than word breaks do, which suggests that individual lines might be self-contained units to a degree that individual words are not.

Rows of self-standing labels or continuous text?

The textual unit that seems safest to regard as self-contained is the paragraph.  Lines can follow other lines, words can follow other words, and even labels sometimes look as though they could be continuous with each other (see illustration on right).  But it’s generally pretty clear where paragraphs begin and end, with seemingly little chance of them “spilling over” into anything else.  So the ways in which paragraphs begin and end could potentially be more revealing than the ways in which words or lines begin and end.

For example, only one single paragraph seems to end with a gallows—the final two words of its bottom line are cheedy.ot, on f105r.  Meanwhile, that same paragraph also has a peculiar partial right-justified line written above it, and the paragraph that follows next appears to have been written in a different ink.  Could it be that the next paragraph was written first, with the preceding paragraph filled in later, such that the writer ran out of space and needed to place the final words of the paragraph above it, in which case the gallows wouldn’t truly end the paragraph?  Or does this odd example offer unique evidence that it’s possible for a paragraph to end with a gallows, forcing us to grapple with whatever implications that might have for how we understand gallows and paragraphs more generally?

We can also make a quick case study of the beginning glyphs of the first paragraphs on pages reported in the Zandbergen transcription.  These start with (42.2%), (20.4%), (17%), (6.3%), rare but unsplit gallows types (2.4%), q (1.9%), split gallows with first leg preceded by or embedded in ch (3.9%), and “ordinary” gallows preceded by or embedded in ch (3.4%).  In the last category, the cases reported to begin ok- seem legitimate on f30r (which has two paragraphs, the second beginning op-), f38v, and f66v, as does the one beginning cPhy on f65v, while those reported to begin ot- all involve potentially extenuating circumstances: f70r2 (text at far right of a fold-out plate), f84v (could be interpreted as a top-of-page label, while the first full row of text begins kor); and fRos (embedded in a rosette, not actually page-initial).  The four cases of page-initial q are tightly segregated: two are on the same fold-out leaf in the pharmaceutical section (qoar on f89r1, qokcheody on f89r2), while the other two are in the biological or balneological section (qetedy on f77v, qosheedy on f82r); and all but the last are preceded by rows of labels.  Once we’ve eliminated all the aforementioned cases, we’re left with just five reported exceptions (2.4%).  Three of the pages are devoted more to charts or drawings than to paragraphic text, and two of those are also second parts of fold-out leaves (f68r1, f85r2, 67r2), which whittles down our list of straightforward exceptions to just two pages of paragraphic text, both in the pharmaceutical section.  These are the examples from f88r and f99v shown in facsimile above, and preceded by what have been assumed to be rows of labels, the first of which begins ot- in both cases.  Could these “labels” really be the first lines of the paragraphs?   Maybe not, since page-initial ot- is always a bit doubtful elsewhere, and not all other top rows of labels in the pharmaceutical section follow a similar pattern.  But overall, the deeper we dig into individual cases and exclude those with extenuating circumstances, the stronger the pattern seems to be that if a page begins with paragraphic text, a gallows needs to be involved in launching it.

§ 5

Patterns of Similarity and Repetition

If there’s an elephant in the Voynich Room, it’s the existence of such conspicuously repetitive passages as qokeedy.qokeedy.qokeedy.qotey.qokeey.qokeey.otedy,qotaiin (f108v) or chol.cphol.shol.shol.qockhol.chor.chol.sho (f100r).  Some have speculated that these passages encode content that is itself repetitive in structure, with proposals ranging from Roman numerals to glossolalia to charm formulae.  But whatever its cause, the repetition of “similar” words is so pervasive in the Voynich Manuscript that it’s hard to imagine how it could not be fundamental to its nature.

One investigator who has looked this repetition square in the eye without flinching is Torsten Timm, whose intriguing hypothesis—first presented here, with further statistical support here, and elaborated yet further as the “self-citation hypothesis” in a Cryptologia article coauthored with Andreas Schinner here—is that the Voynich Manuscript was created mostly by copying words from earlier points, often altering them according to one of the following processes:

  • Varying the number of hatchmarks (e.g., in, iin, iiin; e, ee, eee).
  • Substituting a for o (except after q or before k), or substituting o and y as the first and last sign only.
  • Substituting gallows glyphs for one another.
  • Substituting ee or sh for ch.
  • Substituting among chk, cKh, eke.
  • Substituting among r, m, and either n or l.
  • Substituting r or d for s.
  • Adding a prefix such as qo, l, or ch (and then often changing following ch or d to k).
  • Dropping a character.
  • Inserting or removing a space.

As Timm observes, the dynamics of copying could account for vertical patterning (with words being copied from one, two, or three lines above the current line, often offset slightly in the direction of writing) and for the appearance of similar but unusual words near each other (such as those containing x).  Timm proposes daiin, ol, and chedy as the three basic words from all other words derive, such that the greater the “edit distance” of any word from one of these, the less common it will be.  However, resemblances of other kinds can be found among words similarly positioned relative to each other.  Consider this example from f7r, in which the top and bottom words don’t conform to any of Timm’s specific rules for transformation.


Or this example from f83v, where an atypical hatchmark appears twice in a row in similar contexts:

Or this vertical pair in which it’s difficult to decide whether we’re seeing ed or with a loopback flourish (f75r):

Some might take these last bits of evidence as clinching Timm’s case, concluding that not only were words copied, but even the idiosyncratic style in which they were written.  But I’m not so sure.  Here’s another case (on f103r) in which two oddities turn up in close proximity:

Both words begin qok, but the first has ee joined to chy in a peculiar manner reminiscent of a, while the second appears to add a tail-flourish directly to k.  It seems reasonable to speculate that these two oddities stem from the same cause, whatever that may be, and that they make whatever sense they make together rather than singly.  It would difficult—though maybe not impossible—to explain the juxtaposition of the two anomalies in terms of copying.  Rather, I want to suggest that there’s likely to be something significant about the patterned structure qokX.qokY, in which both the repeated qok and an interdependence between and are somehow implicated.  And if a structure like that can be significant here, why not elsewhere too?  Here’s another comparable case in which the element che repeats twice (cheX.cheY), once ending with an unusual placement of p, the other terminating in a version of the uncommon glyph x.

It strikes me that there are at least four distinct contexts in which “similar” words could be investigated:

  1. Similarity among horizontally adjacent words, i.e., “repetitions.”
  2. Similarity among vertically adjacent words.
  3. Similarity among words that otherwise appear “near” each other, whether in the same line, on the same page, on nearby pages, or in the same “section.”
  4. Similarity among all extant words without regard for where they appear.

Timm addresses all four situations, but he pays particular attention to categories two and three, perhaps because he considers them especially supportive of his hypothesis.  He devotes only a brief section of his study to the first category, i.e., “similarly spelled consecutive glyph groups” (p. 23), although he’s very interested in examining similar groupings of consecutive words.  Similarity among consecutive words may interest him less because they wouldn’t require copying from previously written text to explain them; the writer could simply have held the first word in memory while writing the second one.

There has been some effort to lay out a typology for manifestations of “similarity” in Voynichese.  In a blog post of 2015, David Jackson proposed to define Jackson sequences as “fragments of sentences in which a word appears to be repeated several times with slight modifications,” in contradistinction to Timm’s Pairs, or “two very similar words appearing within the same paragraph, usually with an additional suffix or prefix.”  On the other hand, the phrase “Timm’s Pair” or “Timm Pair” has also been used to refer specifically to cases of vertical word similarity or to pairs of similar word sequences, such as shedy.qokedy.qokeedy.qokedy.chedy.okain.chey and shedy.qokedy.qokeedy.qokeedy.chedy.raiin.chey (both on f84r), so this terminology could get a bit confusing.  Moreover, I’m not convinced that Jackson’s characterization of his eponymous sequences as an individual word “evolving” or “morphing” quite captures the essence of what’s going on, although it’s certainly not a bad one.  To me, other sequences I’d place in the same category Jackson seems to have in mind—he cites, e.g., oteedy.okeey.qokeedy.olkeedy.oteey—look much more like components of words repeating or alternating or succeeding one another at different intervals within larger structures that span multiple words, and not like a word repeatedly being transformed.

If we take a line such as (f86v, my own reading from the facsimile) and stack the words vertically—see the figure on the right—the pattern of repeating elements becomes easier to grasp.  Let’s call this a “cycle chart.”  (I’ve been filling handwritten pages with charts like these during odd moments for at least a decade, but I don’t think I’ve seen one published anywhere before.)  There are four tokens here of aiin, four tokens of y, three tokens of ar, and one token of amwith t marking the start of the whole sequence (word-initially) and then each division between and aiin/ar elements (word-internally).  The aiin segments cluster at the beginning, the ar segments come afterwards, the am segment follows at the end, and the segments are spread out—albeit unevenly—across the sequence.  There are clearly at least three “columns” here, which we might say correspond to a nucleus that’s always present, a gallows that’s sometimes present, and a prefix that’s also sometimes present.  Treating this as a sequence of modifications of the opening word taiin wouldn’t do justice to the patterning of the other elements.  (The “columns” of the cycle chart would, I think, fit decently into most word models; in Stolfi’s paradigm, for instance, the prefix would correspond to Q, the gallows to its unique position, and the nucleus to an amalgamation of MtS+OR+Final.)

The line we’ve just examined provides an unusually clear example of a pattern that can also be recognized elsewhere (at least subjectively).  Below are cycle charts for three more entire lines—and, needless to say, I’m only guessing at which parts of words belong to which parts of a cycle.  I’ve occasionally split a word into two lines or placed two words on the same line, and I’ve color-coded a few segments to highlight elements that seem at first glance to diverge from the usual pattern, but in ways that may themselves be patterned. (f70r2)
(f40r) (f55r2)

The prefix alternates among y, o, qo, and—in one case—lo, although the l might conceivably carry over from a preceding cycle.  The gallows is or k, but consistent throughout each example, and it almost always comes between the prefix and the suffix.  The nucleus is most often ai or oi plus a flourish, but sometimes additional hatchmarks are introduced (aiii, oeee), and it’s sometimes preceded by a segment (d, ched, chd).  The place in this pattern of y—when not followed by a gallows—is less clear from the evidence immediately at hand, although it seems to occur where the cycle “loops around,” such that the cycle itself might not have a sharply defined beginning or end.  In passing, I want to draw attention to the juxtaposition of oees and aiir, which vary only in alternating between curvelets and minims.

These few examples aren’t fully representative of their type.  Other similar sequences conspicuously repeat or alternate segments (e.g., qokedy.qokeedy.qokedy.lkedy.lSheedy, f76r), while others conspicuously repeat the element cho (e.g, otchol,ocThol.chol.chol.chody, f15v), and gallows can vary as well (e.g., oty.okar.otar.olkain.oltedy, f75r).  I assume David Jackson would accept all these as examples of Jackson sequences, on the grounds that they consist of what could be taken as “similar” words.

But an equally distinctive piece of the pattern seems to be the clustering of nucleus-only cycles, as in or (probably)  I doubt we should draw a sharp distinction between sequences in which these appear as separate words and others such as araral (f113v), otarar (f116r), kar.arar (f43r), aiin.arary (f58r), chor.or.oro.r,aiin (f15v), and daiin.ololdy.dal (f23r), in which similar components—often displaying similar repetitions—are written together without word breaks.  These appear, after all, to be elements of the same kind I identified earlier as most likely to appear with inconsistent or ambiguous spacing—or_ar, r_aiin, and the like.  Consider a word such as araral: the space-insertion rules I presented earlier would break araral into, but the glyph pair r_a is 17.1% ambiguously spaced, and we find a high level of inconsistency with the specific word pairs ar_ar (3 together, 6 apart, 1 ambiguous) and ar_al (15 together, 10 apart, 2 ambiguous).  For my cycle charts, I’ve already been disregarding—or at least mistrusting—some word boundaries (specifically, in the cases of yol, aloees, and otaldal).  But at the same time, nucleus-only cycles seem to belong to the same larger pattern as the other repetitions in Jackson sequences, as though they result when the prefix and/or gallows elements “drop out” for a few cycles, often to reappear later.  We might be able to attribute much of the inconsistency over spacing we’ve seen to ambivalence specifically over whether nucleus-only cycles should or should not be separated.

On this basis, I want to suggest that Jackson sequences and repeated elements within individual words such as araral are likely to be two manifestations of a single underlying pattern.  The cyclical nature of word structure is already evident from many word paradigms, as is the point that some slots are optional—filled in some words, but empty in others.  Not only individual words, but also whole lines of text could be analyzed in terms of such cycles; and one curious feature is that elements that fill slots in the cycle often fill them not just for one iteration of the cycle, but for several in close succession.  They might repeat identically twice or more in a row, or two or more “similar” elements might alternate.  The nucleus slot might be filled several times with nothing in the prefix or gallows slot; but when the latter slots are finally filled, it will often be with one or both of the same elements that had filled them before the interruption.   The elements that fill different slots appear to repeat or alternate independently from one another.  They might all synchronize the same way multiple times to produce identical words, but often they don’t.  As a result, some cases of patterning will be more obvious than others.  It’s hard to miss the repetition with something like oteedy.okeey.qokeedy.olkeedy.oteey or araral.  But it might also be worthwhile to create and study cycle charts for sequences such as the following, where patterns of some kind seem to be present but stand out less clearly:

okaiin.shar.yky.oky.kair.daldy.dalor.cheol.dal (f45r){i’h}y.otam (f58v, my transcription from facsimile)
okeal.okar.aralor,om (f108v)

Some of the Latin-alphabetic text on f116v and f17r shares a remarkably similar structure as well (six+marix+morix+vix, etc.), although as far as I can tell, it doesn’t recapitulate any specific patterns from elsewhere in the manuscript, as it might if it were the work of someone long ago vainly attempting to decipher a particular passage through simple substitution.  I can’t help but wonder whether these mysterious snippets might have been drawn up to illustrate some tangled logic of Voynichese text structure for someone not yet familiar with Voynichese: “See, if you were doing this same clever thing with ordinary Latin script, you’d start by….”

I’ve expressed my doubts about labels being straightforwardly comparable to mid-line words, but they do sometimes seem to contain cyclical structures and might usefully be analyzed as self-standing versions of those.  Some consist of unusually long sequences of elements that look like nucleus-only cycles without spaces, e.g., ararchodaiin (f89r1), adaShgasain (fRos), and chosaroShol (f100r); while others resemble miniature Jackson sequences, e.g., otaim,dam.alam (f65r); ykchy.kchey.ykchys (f67r); oteoeey.otal.okealar (f70v1); ykar.ykaly (f67r); otaiin.otain (f72r1).  I drew a contrast above between common label words that also turn up frequently in paragraphic text and others that don’t.  A common denominator among most of the ones that don’t is that they feature an “extra” nucleus-only cycle at the end, as with otoly (otol+y), okaldy (okal+dy), otary (otar+y).

On top of all this, there’s also the phenomenon of vertical patterning to be reckoned with.  Here’s a cycle chart for the last two lines on f104v, placed side by side, to illustrate how one case of subjectively noticeable vertical patterning (highlighted in green) fits into its context.  I’m least sure what to do with the word ycheeo—maybe that should be on a line by itself, aligned with the other tokens of y—but seems to be chronically difficult, which hints that its positional ambiguity could be a significant data point rather than a mere analytical nuisance.

Analyzed in this way, the pattern looks as though cyclical elements might just be synchronizing similarly twice at the same position in two adjacent lines, which could be coincidental.  But Timm presents numerous examples of similar vertically-positioned similarities, and if we’re going to take patterns of similarity seriously at all, I guess we ought to take these seriously too.

The kinds of patterns a cycle chart reveals might be challenging to detect by computer, but one thing that’s downright easy to identify computationally is identical side-by-side word repetitions.  And one thing we can say about those is that they favor a middle position in a line over the beginning and end.  Disregarding comma breaks, I count 43 cases spanning the second and third non-final words of a line, 45 cases spanning the third and fourth non-final words, and 37 cases spanning the fourth and fifth non-final words; but only 14 cases spanning the first and second words, 21 spanning the two last words in a line, and 6 spanning line breaks.  Triple and quadruple identical word repetitions are always found in mid-line, and never right at the beginning or end:

  • chol.chol.chol (f8v, words 2-4 of 6)
  • okaiin.okaiin.okaiin (f40r, words 6-8 of 9)
  • chol.chol.chol (f47r, words 4-6 of 8)
  • qokedy<->qokedy.qokedy.qokedy (f75r, words 3-6 of 8, split by an illustration)
  • qokedy.qokedy.qokedy (f79v, words 3-5 of 8)
  • ytaiin.ytaiin.ytaiin (f86v, words 2-4 of 8)
  • sheol.sheol.sheol (f104v, words 3-5 of 12)

But sometimes these strings of identical repetitions are parts of larger strings of not-quite-identical repetitions, such as chol.chol.chol.chor.  If we search for side-by-side word pairs in which part of a word is repeated (as reckoned in EVA characters), we can find quite a lot more of them.  Two figures are given for each generic pattern below, the first ignoring all comma breaks as spurious, the second treating them as actual word breaks; and in both cases allowing pairs to cross line boundaries but not paragraph boundaries.  Repeated elements are a, b, while unrepeated elements are X, Y.

  • Identical word repetitions: a.a (258 / 292)
  • One word has a prefix: a.Xa (332 / 485); Xa.a (327 / 419)
  • One word has a suffix: a.aX (128 / 144); aX.a (90 / 140)
  • One word has an infix: ab.aXb (213 / 224); aXb.ab (224 / 242)
  • Two different prefixes: Xa.Ya (2574 / 2815)
  • Two different suffixes: aX.aY (1936 / 2130)
  • Two different infixes: aXb.aYb (505 / 483)

Collectively, these seven patterns account for around 20% of total word adjacencies.  It’s true that the repeated part is often smaller than the variable part; thus, some examples of aX.aY are okchdy.otar, dcheckey.daiin, and ydam.yteoor, with each pair sharing only a single glyph in common—arguably not all that “similar.”  Even so, some potentially interesting patterns emerge from the data.  Among other things, pairs of adjacent words are almost exactly three times as likely to vary by one adding something extra at the beginning than by one adding something extra at the end.  But if we analyze such phenomena at the word level, our treatment of ambiguous spaces can have a big impact on our results.  For example, if we disregard comma breaks, it’s about 1.5 times as common for something to be added to the end of the second word in a pair (128 cases) as to the end of the first word in a pair (90 cases), but if we treat all comma breaks as real word breaks, the difference vanishes, presumably due to the frequency of such sequences as okal.o,kaiin and lchedy.l,keedy.

Within generalized patterns of variation such as ab.aXb, specific values of X also stand out as particularly common, largely independent of the values for a/b.  I count twenty-one specific patterns with at least twenty-five occurrences (disregarding all comma breaks, but otherwise proceeding as before):

  • One word inserts e, e.g., choty.chotey (100): ~*.~e* (46); ~e*.~* (55)
  • One word begins o, the other qo, e.g., okedy.qokedy (84): ~.q~ (44); q~.~ (40)
  • One word ends r, the other l, e.g., chor.chol (69): ~r.~l  (43); ~l.~r (26)
  • One word contains k, the other t, e.g., otar.okar (64): ~t*.~k* (33); ~k*.~t* (31)
  • One word inserts d, e.g., chedy.chey (53): ~d*.~* (31); ~*.~d* (22)
  • One word begins qok, the other ch, e.g., qokol.chol (52): (24); ch~.qok~ (28)
  • One word begins qok, the other Sh, e.g, qokol.Shol (52): qok~.Sh~ (15); Sh~.qok~ (37)
  • One word begins qok, the other ot, e.g., qokol.otol (43): qok~.ot~ (26); ot~.qok~ (17)
  • One word inserts i, e.g., dain.daiin (42): ~*.~i* (23); ~i*.~* (19); all have n for * except for four tokens of ar.air 
  • One word ends iin, the other l, e.g., daiin.dal (41): ~iin.~l (24); ~l.~iin (17)
  • One word begins o, e.g., okedy.kedy (39): o~.~ (23); ~.o~ (16)
  • One word ends iin, the other r, e.g., daiin.dar (38): ~iin.~r (17); ~r.~iin (21)
  • One word ends aiin, the other y, e.g., daiin.dy (33): ~aiin.~y (16); ~y.~aiin (17)
  • One word begins qoke, the other ch, e.g., qokeedy.chedy (33): (15); ch~.qoke~ (18)
  • One word begins ch, e.g. or.chor (31): (15); ch~.~ (16)
  • One word begins d, e.g., ol.dol (31): ~.d~ (13); d~.~ (18)
  • One word begins ch, the other Sh, e.g., chol.Shol (31): ch~.Sh~ (15); (16)
  • One word begins ot, e.g., chol.otchol (30): ~.ot~ (14); ot~.~ (16)
  • One word begins qoke, the other Sh, e.g., qokeedy.Shedy (27): qoke~.Sh~ (7); Sh~.qoke~ (20)
  • One word ends ol, the other y, e.g., chol.chy (26): ~ol.~y (20); ~y.~ol (6)
  • One word begins ch, the other cTh (i.e., inserting t), e.g., chol.cThol (25): ch~.cTh (17); (8)

One advantage in trying to detect such things algorithmically is that it helps expose recurring patterns in which adjacent pairs of words may not look particularly “similar” but still display identical relationships to each other as other pairs of adjacent words.  One such example is ~*aiin.~c*hy, with four cases: chotaiin.chocThy (applied to cho/t); chekaiin.checKhy (applied to che/k); okaiin.ocKhy (applied to o/k); and Shekaiin.ShecKhy (applied to She/k).

But after taking the trouble to slog through all that, I suddenly found myself wondering whether there was actually any statistical significance to side-by-side word similarities, or whether the patterns I was finding simply reflected patterns of similarity among all words regardless of position.  In an effort to answer that question, I came up with a not-very-sophisticated metric for word similarity by taking the quantity of shared EVA characters, minus one for each matching “chunk” beyond one, divided by the mean quantity of EVA characters per word.  So, for example, dain and daiin share four EVA characters distributed over two matching chunks dai and n (4-1=3), and the average word length is 4.5, so 3/4.5 = ~0.6667.  This may not be as robust a metric as Timm’s measure of “edit distance,” but it should still track similarity to a reasonable degree.  For adjacent words in paragraphic text (n, n+1), the average similarity calculated in this way—disregarding comma breaks—is 0.2737.  For all word types in paragraphic text, it’s 0.2239, and for all word tokens, it’s 0.2371.  Thus, adjacent paragraphic word tokens are systematically more similar to each other than they are to the universe of paragraphic word tokens and word types throughout the whole manuscript.  But if we compare the similarity of side-by-side words with that of other nearby words, a more nuanced picture emerges, as shown in the graph below.

Measurement of average similarity between wordsand n+x for x in the range 1-20.  Blue: comma breaks disregarded (= fewer but longer words).  Orange: comma breaks treated as real word breaks (= more but shorter words).

There’s a sharp peak at n+2, which shows that horizontal pairs of words separated by one intervening word are actually more similar on average than words that are immediately next to each other.  This is followed by a periodic rise and fall at an interval close to the average length of a line.  Such a period would be consistent with vertically patterned word similarity within lines, and if we extend our x axis out to n+300, the periodic rise-and-fall quickly fades away, as we’d expect from cumulative inconsistency in line lengths blurring the pattern.

Measurement of average similarity between words n and n+x for x in the range 1-300.  Same key as before for blue and orange.

Meanwhile, we also see a gradual decrease in similarity that continues until it eventually reaches the background level we’d expect for any two randomly-chosen word tokens.  Thus, words tend to be more similar to words that are only hundreds of words away from them than they are to words that are thousands of words away from them—a finding consistent with Timm’s.

But similarity doesn’t taper off at a uniform rate based on the distance between individual words; it’s linked further to page and section boundaries.  Below is a graph in which both the and axes represent the pages of the Voynich Manuscript laid out in order and pixel brightness represents the average similarity of all words on the page represented by the column to all words on the page represented by the row (with some contrast enhancement to make the patterns easier to see).

The bright diagonal line from upper left to lower right represents the similarity among the words on each page, which is stronger than could be accounted for just by the fact that each word is being matched once against itself.  The two darkest lines in the middle of the range correspond to f57v with its circles of odd and single glyphs and f65r, which contains just three words.  The bright more-or-less-solid square in the upper left corner represents the initial part of the herbal section, and the smaller bright square about a third of the way from the center into the lower right quadrant represents the biological or balneological section.  Prescott Currier famously identified two different “languages” in the Voynich Manuscript with different characteristics, now known as Currier A and Currier B.  The larger bright square in the upper left corner of my graph corresponds to fairly pure Currier A, while the smaller bright square in the lower right quadrant corresponds to fairly pure Currier B.  The intervening bands of brightness and darkness correspond to individual pages that lean strongly towards Currier A or Currier B in contrast to the surrounding environment.  This presents another vantage point on similarity by page and by section.  And if you don’t want to put too much stock in my measurement of similarity, here’s a similar graph showing the proportion of identical words shared between each pair of pages.  The patterns seem roughly the same, though a bit harder to make out.  (A somewhat similar figure by René Zandbergen may be found here, as well as a more recent one based on a decomposition of the text into bigrams here.)

Meanwhile, some asymmetries in the order of words in general patterns can be found echoed in asymmetries among individual word pair frequencies.  For example, the broader asymmetry in ~r.~l  (43) versus ~l.~r (26) appears to manifest itself consistently in chor.chol (7) versus chol.chor (4), (11) versus (4), or.ol (9) versus ol.or (2), and so on.  In such cases, the order of the variable part in a pairing of similar words appears to be significant.  Moreover, this may not just be merely a matter of adjacent word pairs.  If (*) represents an intervening word, ~r.(*).~l (27) is also more common than ~l.(*).~r (19).  The numbers of tokens involved are low enough that the difference might admittedly just be noise.  But in cases of asymmetrical similarities like this one, it would still be worth examining whether certain elements are more likely to come after certain other elements than before them even when one or more words intervene.

When an element is repeated across a number of cycles, this can look like a longer sequence of words all sharing the same prefix or suffix, which gives us another relatively easy-to-detect phenomenon to track.  I don’t want to make too much here of exact counts, since much hinges here on vagaries of word division and glyph differentiation; this is, to be sure, another situation where precision outpaces accuracy.  But with that caveat, here’s a compilation of the longest strings of sequential words I was able to find (while ignoring comma breaks) ending with:

  • -y = 18 (f76v): salkchdy.chey.lcheedy.lchedy.qoteedy/sol,shey.qotedy.chey.dytey.teedy.lchey.qokedy.chedy.lal,chedy.lchedy/dchedy.qokeedy.qotchy
  • -n = 10 (f25r): chan.chaiin.qotchain/qotcheaiin.dchain.cthain.daiin.daiin.cthain.qotaiin
  • -in = 9, as above minus first word
  • -l = 7 (f42r): chol.chhol.chal/shol.chol.chol.shol
  • -r = 5 (four cases); –= 5 (five cases); –= 3 (three cases, two on f90r); –= 3 (f104v: qoteeo.lo.ycheo); –d = 2 (15 cases); –= 2 (f39v, shek.chek); –= 2 (f27v, pochof.chof)
  • -edy = 10 (f76v): cheedy.otedy.cThedy.otedy.qoteedy.shcThedy.qoeekeedy.deedy/tchedy.lsheedy
  • -ey = 6 (f107v): y,cheey.qokeey.okeoteey.qokey.qokey.qokeey
  • -ol = 6 (f88r): qoekol.qoekol.qocKhol.okol.cheol/dsheol
  • -ar = 5 (f115r):
  • -or = 4 (f15v): otchor.chor.chor.ytchor
  • -al = 4 (f33r): okal.otal.chdal.shekal; (f66v): dal.ko,dal.chekal.dal

And here are the longest strings of sequential words I was able to find beginning with:

  • ch- = 11 (f49v, disregarding vertical column of glyphs to left of main text): chey.chor.chokaly.chor,dal<->chaiin,d<->/chol.chor.ches.chkalchy.chokeeoky<->chokoran
  • o- = 9 (f24v): otol.otaiir<->otchos.okchom.okcho/ocThol.odchees.oesearies.okam
  • d- = 7 (f37v): daiin.dy/dshor.dytory.dshor.daiin/dchor
  • qo- = 7 (f111r): qokeedy.qokedy.qoteedy.qod/qocKhedy.qokeechy.qotchey
  • Sh- = 5 (f44v): Shaiin.Shor.Shorody.Shky.Sho
  • yt- = 4 (f67r): ytor.ytar.ytchor.ytaiin
  • ot- = 4 (f112r): oteedy.oteedy.otaiin.oty
  • lk- = 4 (f113r): lkeches.l,keeol.lkcheol.lkchedy
  • ol- = 4, two cases, (f81r): ol.ol.olaiin.ol; (75v): ol.olain.olkey.olshed
  • a- = 4, three cases, (f86v3):; (f104r):; and possibly a longer sequence on f112r, depending on how we resolve ambiguous cases of o/a:
  • da- = 4, two cases, (f38v): daiin.daiiin<->dain.dain; (f89r1): daiin.dam/dals,al.dal
  • ok- = 4, two cases, (f99v): okeeol.okeor.okal.okaiin; (f40r): okaiin.okar.oky.okoldy
  • yk- = 3 (f86v5): ykaiin.ykeey.ykal

The beginnings and endings with the highest sequential repetition counts tend also to have numerous other high sequential repetition counts as well.  So, for example, there’s just one eleven-word sequence in which every word begins with ch-, but there are also two more eight-word sequences, chokeos<->chees.chr.cheaiin/chokoiShe.chor.cheol.chol (f2v) and{ck}y<->chor.cheain.char.cheeky.chor (f8v); one other seven-word sequence, chy/chy.cho,keesy.chy.chy.cheky<->chochy (f47v); and four other five-word sequences.  Similarly, there’s just one eighteen-word sequence in which every word ends with –y, but there’s also a separate sixteen-word sequence qokeedy.oteey.qokedy.qokeedy.cheedy.qokeedar,oldy/tolshey.ochey.qokeey.qotedy.choteey.qokey.cholkeedy.lkedy.lchedy.qokeey (f108v); two more fourteen-word sequences shedy.dal,keedy.rshedy/sokeedy.qokeedy.oteedy.qoky.dy, (f75r) and se[?]y.qokedy.ychdykchdy.<->qokedy.qokedy.shekchy.cheky.daly/chdchy.ytcheeky.ypchedy.schdy.<->ytedy.{cThh}y (f41r); three more thirteen-word sequences; three more twelve-word sequences; and so on.

Here it’s worth pausing to ask whether such sequences might just be coincidental, and not due to any interdependence among words with shared characteristics.  Given how common it is for words to begin or end with particular glyphs, and given the length of the Voynich Manuscript, should we expect occasionally to see several such words turn up in a row just by chance?  If the probability that any individual word will begin ch- were 0.16, independent from whatever the surrounding words might be, then the probability of two words beginning ch in a row should be that value squared, or ~0.026; the probability of three words in a row starting that way should be that value cubed, or ~0.004; and so on.  The probability of eleven words in a row beginning ch would be 0.16 to the eleventh power, or ~0.00000000175.  Determining how likely this would be to occur over the course of the Voynich Manuscript is analogous to the problem of calculating the odds of getting twenty heads in a row in a million coin tosses, discussed here by Mark Nelson (see also here).  The relevant equation appears to be pw×fibn(w+2) where is the proportion of words with a given feature; is the number of words in a row with that feature; is the total number of sequential words; and fibn(x) is the xth n-step Fibonacci number.  If we were to take eleven sequential repetitions of ch- (each at 16.0%), and a rough figure of 32,500 for the total number of words (although we should perhaps factor in page and paragraph breaks), we’d end up needing to divide two integers each thousands of digits long.  Those numbers are a little too daunting for me to crunch.  Maybe one of my readers will be kind enough to do so.

But there’s another way to assess whether repetition is significant: i.e., calculating whether actual repetitions deviate from figures we’d expect from random shuffling of words, much like the Smith-Ponzi study of word breaks discussed earlier.  I count 31,863 word adjacencies in paragraphic text (ignoring comma breaks and line breaks).  Since first and last words in paragraphs have only one adjacency rather than two, statistics for the first and second words in adjacency pairs will vary.  So, for example, the first word begins with in 1,295 cases (~4.1% of the total), while the second word begins with in 1,332 cases (~4.2% of the total)—not quite the same.  Still, these figures are reasonably close, such that we can say that around 4.1% of words involved in these adjacencies begin with a (the average (1,295+1,332)/2 = 1,313.5 divided by 31,863).  But there are 170 cases in which both words begin with a, which means that if one word begins with a, the likelihood of the other word beginning with a is 170/1,313.5 = 12.9%.  In other words, the odds of any given word beginning with are about 4.1%, but if one word begins with a, the odds of the next word also beginning with are over three times higher.  Here’s a list of comparable figures for all glyphs with which two sequential words begin in at least 35 instances (a threshold I’ll be applying in a number of cases to follow as well):

  • a-: 4.1% versus 12.9% (=314%)
  • ch-: 16.4% versus 18.1% (=110%)
  • d-: 9.5% versus 14.2% (=150%)
  • l-: 3.9% versus 10.9% (=282%)
  • o-: 21.4% versus 25.8% (=120%)
  • q-: 16.4% versus 20.4% (=124%)
  • s-: 3.5% versus 5.4% (=155%)
  • Sh-: 8.9% versus 10.1% (=125%)
  • y-: 5.1% versus 7.7% (=153%)

In every case, the likelihood of a word being followed by another word beginning with the same glyph is greater than the overall percentage of words beginning with that glyph.  Here are similar figures for word-ending glyphs, again presenting all word-initial glyphs with at least 35 pairings of two adjacent words:

  • -l: 14.7% versus 18.6% (=127%)
  • -m: 2.7% versus 5.1% (=189%)
  • -n: 17.4% versus 17.3% (=99.5%)
  • -o: 2.4% versus 6.9% (=285%)
  • -r: 14.4% versus 18.3% (=127%)
  • -s: 3.0% versus 6.7% (=226%)
  • -y: 42.5% versus 47.6% (=112%)

Words ending in –are somewhat unusual in not seeming to have any effect on each other: a word ending in -n is no more or less likely to be followed by another word ending in -n.  And yet this ending is one of the most prolific repeaters, with an instance of ten words ending with -n in a row.

As we’ve seen, Smith and Ponzi found that words appear together preferentially based on combinations of the last glyph of the first word and the first glyph of the second word.   But it appears from the foregoing that words also appear together preferentially based on the first glyphs of both words being the same, as well as on the last glyphs of both words being the same.  Moreover, these latter patterns of preference can be shown to extend even deeper into words.  For example, each of the following ending glyph pairs shows a significantly stronger inclination to repeat than the single final glyph by itself (-r, -l, -y), with figures presented here in the same way as above:

  • -or: 5.7% versus 10.6% (=184%)
  • -ol: 9.0% versus 15.9% (=177%)
  • -ar: 6.8% versus 11.0% (=163%)
  • -al: 5.1% versus 8.2% (=161%)
  • -dy: 19.0% versus 31.0% (=163%)
  • -ey: 11.3% versus 18.3% (=163%)

Nor are these patterns limited to glyphs or glyph combinations that repeat across consecutive words, as in the cases we’ve examined so far.  Pairs of different glyphs can be shown to behave similarly.

To investigate, I wrote a script that extracts the first and last glyphs of words, including the combinations ch, sh, cTh, cKh, etc., and that flags single-glyph words with an asterisk, e.g., r*, as an index of each word’s “type,” e.g., whether it belongs to the category of words beginning ch, the category of words ending y, and so on.   It then calculates the likelihood of each second word type occurring after any given token of each first word type, as well as the likelihood of each first word type occurring before any given token of each second word type: for example, for any word beginning with a, it reports what the odds are that the preceding word, or the following word, will begin with d or r or whatever.  I’ve also been applying my usual 35-token threshold when assessing results.  I’ve tried to limit the pool of words being studied in each case to just those words that have the appropriate relationships—so, for instance, if I want to calculate how often words starting with “should” appear after other words, I won’t include paragraph-initial words starting with p in my totals, since those aren’t part of the pool of words that are preceded by other words.  Otherwise, my figures are based on all paragraphic text; include pairs of words separated by line breaks; ignore comma spaces; and count single-glyph words separately rather than as words that “begin” or “end” with s, o, etc.

I have statistics on hand for all extant combinations, but here I’d like to draw out just two representative examples, one for word-initial glyphs, one for word-final glyphs.

  • 16.5% of words begin with q.  But we can contrast the probability of a word beginning with after a word beginning with Sh (30.2%), q (20.4%), l (19.2%), ch (16.4%), o (14.7%), t (14.4%), y (14.3%), p (14.2%), d or k (13.3%), cTh (12.0%), r (11.7%), s (9.2%), or a (6.6%); or before a word beginning with (27.1%), q or l (20.3%), ch (18.6%), Sh (17.3%), (16.7%), (15.3%), (14.9%), (11.2%), or (10.1%), or or cTh (8.9%).  Or considering matters in terms closer to the Smith-Ponzi study, we could say that—for example—pairs of words in the form Sh…_q… are 183.7% as common as expected based on overall word frequencies, while pairs of words in the form a…_q… are only 40.2% as common as expected.
  • Just over 42% of words end with y.  But we can contrast the probability of a word ending with y after a word ending with (47.7%), (43.7%), (41.8%), (39.5%), r (37.3%), (34.6%), (31.1%), (29.9%), or the single-glyph word (24.9%); or before a word ending with (47.4%), (46.4%), (41.4%), (38.5%), (36.5%), (36.1%), (36.0%), (35.6%), (34.4%), or the single-glyph word (32.2%).

In general, the frequencies of pairings of word-initial glyphs seem to diverge more widely from expectations than pairings of word-final glyphs do, or at least the extremes are greater.  But once again, these patterns can also sometimes be shown to extend deeper into words.  Here are a few representative cases in which words that differ by a glyph one space in from the end or beginning show notable statistical differences.  (I’ve counted one- and two-glyph words separately from longer words beginning with a glyph pair, e.g., the word ar is not considered to begin or end with ar.)

  • Just under 16% of words begin with qo.  But we can contrast the probability of a word beginning with qo appearing after a word beginning with She (37.2%), che (21.2%), Sho (18.1%), and cho (11.2%); or before a word beginning with che (20.1%), She (18.6%), cho (14.9%), and Sho (11.6%).
  • Just over 18.5% of words end with dy.  But we can contrast the probability of a word ending in dy after a word ending with ar (18.9%) or or (11.9%); or before a word ending with ar (17.4%) or or (11.0%).

Patterns among word-break combinations of the kind studied by Smith and Ponzi likewise seem to extend deeper into words.  To examine this, I first recalculated the Smith-Ponzi word-break combination statistics for single glyphs and came up with extremely similar results—not identical, but I’m using the Zandbergen transcription rather than the Takahashi transcription as they did, so minor differences are only to be expected.  Then I ran a similar set of statistics for glyph pairs, i.e., the last two glyphs before a break and the first two glyphs after a break, comparing actual pair combinations against quantities we’d predict from random shuffling.  As before, I count single and two-glyph words separately from longer words beginning with a glyph pair.  For each of the following selected word-break combinations, I’ve listed the total quantity of tokens and its percentage of the predicted quantity.

  • l_d (532) = 147.0%: ol_do (53) = 262.8%, al_da (119) = 156.0%, ol_da (214) = 151.7%
  • y_t (288) = 146.1%: ey_te (46) = 287.1%, dy_te (39) = 153.5%
  • y_r (226) = 161.0%: ey_ra (51) = 256.6%, dy_ra (40) = 126.9%
  • y_k (419) = 126.1%: ey_ke (79) = 250.3%, ey_ka (35) = 159.7%, dy_ke (38) = 75.9%
  • y_l (740) = 152.0%: ey_lch (83) = 241.8%, ey_lk (119) = 241.5%, dy_lch (91) = 167.1%, dy_lk (82) = 104.9%
  • n_a (221) = 94.3%: in_ai (36) = 35.6%, in_al (43) = 143.3%
  • y_ch (1542) = 69.8%: dy_cho (137) = 52.0%, ey_che (171) = 59.7%, ey_cho (106) = 63.8%, dy_che (357) = 78.5%, dy_chc#h where # is a gallows (63) = 83.5%, ky_che (45) = 103.0%, chy_cho (39) = 106.0%, dy_chd (48) = 106.4%, ky_cho (38) = 149.9%
  • y_q (3440) = 170.1%: dy_qo (1882) = 213.6%, ey_qo (912) = 164.2%, Shy_qo (50) = 142.3%, cKhy_qo (88) = 131.6%, cThy_qo (62) = 125.9%, chy_qo (143) = 116.3%, ty_qo (43) = 88.2%, ky_qo (74) = 87.4%

In each of these cases, the single-glyph word break combination differs in frequency from expectations by some amount (as Smith and Ponzi will have calculated), but double-glyph word break combinations that “contain” the same single-glyph combinations show mutually divergent or even opposed tendencies.  It’s true that the smaller the quantity of tokens, the larger the impact random noise in the data can have, and the less reliable the percentages will be as evidence of patterning, although my 35-token threshold serves as a modest safeguard.  But when we see a substantial contrast between dy_qo based on 1882 tokens and ey_qo based on 912 tokens, there can be little doubt about these figures exposing something realand in need of explanation.

Smith and Ponzi state (on p. 484 of their Cryptologia article) that their favored explanation for preferential single-glyph word-break combinations lies in a phonological process such as Celtic initial-consonant mutation or the elision of word-final vowels before word-initial vowels in Italian.  It becomes more difficult, though perhaps not impossible, to account for preferential double-glyph word-break combinations in the same way: for example, the endings -ey and -dy would need to have different phonological implications.  And when we see comparable patterns of preference among words that both begin or end with particular glyphs, a straightforward phonological explanation for those strikes me as untenable, short of anagramming or some other dubious complication.  Which isn’t to say that word-break combinations and these other combinations necessarily stem from the same cause, or that some grammatical principle couldn’t be in play in the latter.  But phonology doesn’t seem likely as a comprehensive explanation for all preferential patterns among words, which makes it less attractive as an explanation for some of them.

It appears that every word has a number of simultaneous statistical “entanglements” (for want of better terminology):

  • The first glyph is entangled with the last glyph of the preceding word.
  • The last glyph is entangled with the first glyph of the following word.
  • The first glyph is entangled with the first glyphs of the preceding and following words.
  • The last glyph is entangled with the last glyphs of the preceding and following words.
  • The above observations apply to glyph pairs, and perhaps even larger groupings, as well as to single glyphs.

Granted, these findings all pertain to words, and I’ve argued above against taking words too seriously as discrete objects of study.  But if Voynichese text is instead structured cyclically, with a penchant for “holding over” certain elements from cycle to cycle, that might explain the common statistical preference of the same beginning and ending glyphs to repeat from word to word, e.g., two or more words in a row beginning with qo or ending with y.  Meanwhile, the other preferential patterns we’ve been examining could illuminate as-yet-unexplored relationships among different elements that are more or less likely to fill slots near each other in the cycle.

And if we’re sufficiently doubtful about the autonomy of words, so might preferential combinations of beginning and ending glyphs within the same word—a factor I don’t think I’ve seen addressed in any word grammar.  If we study these using the Smith-Ponzi method of comparing predicted frequencies against actual frequencies, applying the usual 35-token threshold, we once again find that some combinations are more common than expected, the extremes being a…m (259%), sh…o (234%), d…n (206%), and a…n (198%); while others are rarer than expected, the extremes being a…y (31%), sh…n (39%), s…y (50%), and q…m (51%).  These figures don’t show any obvious correlation with those for corresponding word-break combinations—e.g., the “inverse” of a…m (259%) is m_a (49%)—but I suppose there’s no reason why they should, given that elements seem to vary considerably in their tendency to repeat from word to word.  So this provides yet another set of unwieldy data points to be reckoned with.

§ 6

Currier Languages and Line Positions

There are, I’ll admit, other factors that could have a bearing on which words appear next to which other words with what frequencies.

As I mentioned earlier, Prescott Currier identified two “languages” on different folios of the Voynich Manuscript (or, more properly, on different vellum membranes, since a single folded membrane typically makes up two folios or four pages) now conventionally referred to as Currier A and Currier B, with distinctly different statistical patterns of distribution for words and other characteristics.  Nick Pelling gives a nice overview of properties associated with them here.

For one of my own first attempts to study the differences between the two languages, I generated word counts from the Zandbergen transcription for all paragraphic text, organized into Currier A and Currier B groups based on the transcription’s own classifications, excluding any words affected by comma breaks or intruding-illustration breaks.  29% of the total word count appears in Currier A sections, and 71% in Currier B sections.  Thus, any word that occurs with equal frequency in both languages should have 71% of its tokens appear in Currier B sections.  If this figure is greater than 71%, a word is disproportionately more common in Currier B; if less than 71%, it is disproportionately more common in Currier A.

The nascent word grammar I outlined earlier makes much of single-glyph words, so one point that immediately caught my attention was the strong association of some of these with one or another Currier language.  The figures are:

  • l (86%), y (66%), (61%), (64%), (31%), (7%)

Also central to my word grammar is the “turn” consisting of curvelets culminating in o/a followed by one or more hatchmarks with a closing flourish.  So I was interested to see that the two languages consistently favor such “turns” beginning in different ways.  An asterisk below marks a word that appears exclusively in the language specified, and I’ve also indicated which words have only a few tokens.

  • ch = Currier A: cho (27%), chan (19%), chain (50%), chaiin (35%), choiin (19%), chor (21%), char (58%), chos (44%), chom (17%), chol (33%), chy (33%)
  • d = Currier A: do (19%), dan* (8 tokens), dain (62%), daiin (44%), doiin (19%), daiiin (42%), dor (38%), dair (61%), doiir* (5 tokens), dam (54%), dom* (6 tokens), dal (63%), dol (42%), dy (59%)
  • s = Currier B: sain (90%), saiin (75%), sar (75%), sair (85%), sol (79%)
  • r = Currier B: raiin (92%), rar (93%), ror*, ram (83%), ral*, rol (93%)
  • l = Currier B: lo*, laiin*, lar* (6 tokens), lor (91%), lol (96%)
  • [0] = Currier B: ain (95%), aiin (83%), aiiin (93%), am (82%), ar (86%), aiir*, al (89%), ol (82%)

I’m struck also by the extent to which these figures run parallel to the preferences of single-glyph words, which—as I’ve pointed out—are often linked to instances of ambiguous or inconsistent spacing.  Thus, l as a self-standing word is found mostly in Currier B, and so are words in the series beginning with l: lo, laiin, etc.  Conversely, as a self-standing word is found mostly in Currier A, and words in the series beginning with favor Currier A as well.  The frequencies of s / s- and r / r- fall in between, in the same order, although s by itself favors Currier A while s- favors Currier B.  Meanwhile, other series show comparable preferences, with cheo-, chod-, cTh-, cKh-, dch- favoring Currier A and ted-, ked-, chd-, ched-, lch-, lk- favoring Currier B (and these are by no means exhaustive lists).

All this might give us pause to reevaluate some of what I discussed earlier.  Most of the other statistics I’ve presented so far treat the manuscript as though it were a homogeneous whole.  But since it’s admittedly not homogeneous, and certain words or features are known to be more common in parts written in one or the other Currier language, we should expect some combinations to be more or less common in practice than overall quantities would predict, based on that factor alone.  Could the distinction between Currier A and Currier B end up explaining away all the elaborate statistical anomalies I described above?  For instance, might words beginning cTh be more likely to appear next to each other only because they overwhelmingly favor Currier A?

I don’t think so.  Smith and Ponzi have shown that the preferential word-break combination patterns they studied occur in both Currier A and Currier B, although not identically in both; for example, n_ch deviates from expectations in Currier B but not Currier A (see pp. 471-472 in their Cryptologia article).  And when I’ve made my own studies based exclusively on sections classified in the Zandbergen transcription as Currier A or Currier B, I’ve found likewise that the anomalies persist, even though they can vary in magnitude between Currier languages.  For example:

  • Pairs of words in the form Sh…_q… are 183.7% (but 160.9% in Currier A and 190.0% in Currier B) as common as expected based on individual word frequencies, while pairs of words in the form d…_q… are only 80.9% (but 109.0% in Currier A and 77.1% in Currier B) as common as expected.

A difference is present in both languages, but it’s stronger in Currier B.  Or let’s examine a few word-break combinations:

  • y_q: Currier A (425) = 152.6% vs. Currier B (3015) = 165.1%; overall (3440) = 170.1%
  • dy_qo: Currier A (104) = 215.2% vs. Currier B (1777) = 189.8%; overall (1882) = 213.6%
  • ey_qo: Currier A (134) = 158.0% vs. Currier B (778) = 160.2%; overall (912) = 164.2%

I’m not sure why the overall total for dy_qo (1882) is one higher than the sum of Currier A and Currier B (1777+104=1881)—some rarely-triggered defect in my Python scripting?  Even so, in this case, the frequency of dy_qo appears to exceed expectations more than the frequency of ey_qo in both languages, but more strongly in Currier A.  It seems that preferences among word-break combinations extend beyond single glyphs at the ends and beginnings of words even when only material from one or the other language is being analyzed.

But the manuscript can be broken down into yet more granular groupings than Currier A and Currier B, a point Gordon Rugg illustrates here by applying Search Visualizer to display the distribution of dam, qo, dy, and ol.   Zandbergen has broken the text down here into groupings Herbal-A, Pharma-A, Herbal-B, Stars-B, Stars-Bio, Biological-B; and more recently here into ten distinct categories based on bigram distribution, including an identification of languages “C” and “D” as well the intriguing observation that the Currier A and Currier B “clusters” are bridged by material on foldout bifolios.

So could the statistical anomalies we’ve been exploring be accounted for by variations among these more granular groupings?

To explore this possibility, I turned to what I gather consensus holds to be the most homogeneous section in the whole manuscript—the Biological-B section comprising Quire 13, spanning f75 through f84—and ran some of the same analyses as before, keeping the 35-token threshold (although this presents a higher “bar” the fewer overall word tokens we’re dealing with).  Word breaks à la Smith and Ponzi still show definite preferences, with r_a (389.0%), n_ch (208.4%), r_Sh (203.5%) at one extreme and y_ch (51.5%), y_Sh (44.7%), l_q (39.6%) at the other.  Pairs of word beginnings and word endings continue to show preferences as well, although oddly enough, the tendency of the same glyphs to recur most preferentially at the end or beginning of consecutive words virtually disappears.  A couple case studies:

  • A little over 26% of words begin with q.  But we can contrast the probability of a word beginning with after a word beginning with Sh (46.1%), y (33.6%), l (30.6%), ch (30.3%), q (24.6%), d (22.4%), o (20.8%), s (17.1%); or before a word beginning with r (36.7%), ch (35.3%), Sh (29.4%), l (26.4%), d (26.3%), o (25.5%), q (24.6%), (23.5%).  Sh and r continue to occupy the respective top positions, with Sh- before q- continuing to be the most strongly divergent from expectations.
  • About 55.3% words end with y.  But we can contrast the probability of a word ending with y after a word ending with d or n (58.7%), r (56.7%), l (56.0%), y (55.1%); or before a word ending with (63.9%), (62.3%), (57.7%), s (55.4%), (54.9%), (54.6%), r (53.9%).  This may not seem like much of a spread, but it still shows features in common with the results of the earlier studies, e.g., and n continue to be be the most common other glyphs pairing with y, while scores higher in such pairings as the ending of the first word than the second word.

Here, for good measure, is another set of results based on an analysis of just the first three quires of the Voynich Manuscript, which make up a solid block of Currier A.

  • 8.8% of words begin with q.  But we can contrast the probability of a word beginning with after a word beginning with q (14.6%), d (11.6%), o (8.4%), ch (7.0%); or before a word beginning with q (14.6%), ch (11.0%), d (9.3%), o (7.1%).
  • Just over 32% words end with y.  But we can contrast the probability of a word ending with y after a word ending with y (39.6%), n (33.2%), o (29.5%), (28.3%), l (26.4%); or before a word ending with y (39.3%), n (33.1%), o (32.2%), r (26.7%), l (24.6%).

Once again, we see distinct preferences.  But they differ from the preferences we find when we analyze the whole manuscript, or when we analyze just the Biological-B section.  So variation among sections of the manuscript doesn’t explain away the patterns of preference we’ve been examining, but at the same time it still appears to affect them and, hence, to complicate their study.

Another complicating factor might be variation among words based on their positions within lines.

As can be seen in one of the charts I shared above, lines are more likely to begin with d, y, o, and s, and to end with m, than mid-line words are.  The glyph is about equally common at the beginnings of lines and mid-line words, but as I observed earlier, lines ending with are only about half as likely to be followed by lines beginning with as we’d expect from their respective quantities—based on a study of the manuscript as a whole, and without paying attention to any of the distinctions between sections we’ve just been examining.

But it’s not merely that particular glyphs are more commonly found next to line breaks, or in preferential line-break combinations; line-initial words also have other distinctive properties besides.  As we’ve already seen, they’re longer on average (5.25 EVA characters) than mid-line words (4.96 EVA characters).  There has been some effort to analyze the first glyphs of lines as separate elements that aren’t equivalent to the first glyphs of mid-line words (a line of investigation associated with Philip Neal, although I’ve only seen his observations reported at second hand and so don’t feel I’m in a position to do justice to them).  On the other hand, the word-length statistics for line-initial and mid-line words differ by only 0.29 of an EVA character, which wouldn’t be consistent with a breakdown of all line-initial words as mid-line words plus extra “line prefixes.”  Meanwhile, line-final words (with an average length of 4.61 EVA characters) differ by 0.35 in the opposite direction, which just about makes up the difference, although there appears to be no correlation between specific instances of shorter line-ending words and longer line-initial words.  Closer examination shows that two- and three-character words in particular make up a considerably higher proportion of line-final words than they do of words in other line positions.  Single-character words seem to make up an equal proportion of all line positions, while words with eight or more characters seem slightly to favor beginnings and ends of lines over mid-line positions.

Word length proportions by line position, as reckoned in EVA characters

I wasn’t able to confirm Elmar Vogt’s “second word effect,” according to which the second words of lines are shorter than average.  But that’s not to say there aren’t distinctive patterns associated with the second words in lines.  Emma May Smith reports that—in Quire 20, comprising the Stars section at the end of the manuscript—there are not only certain words that strongly favor a line-initial position (sain, saiin, sar, etc.), but certain other words that strongly favor the second position (Shey, Sheol, ain, etc.).

One difficulty in the way of studying word positions within lines is that the quantity of words per line varies quite a lot, so that the fifth word (for example) will be near the beginning of some lines and near the end of others.  So to factor out variation in line lengths, I associated each word with a fractional value between 0 (first word in line) and 1 (last word in line).  If I then take the average value for some word or other feature, a higher number will indicate a preference for the ends of lines, while a lower number will indicate a preference for the beginnings of lines.  For this particular study I’ve ignored comma breaks.

Some initial glyph combinations turn out to show a preference for the beginnings of lines, as with ySh- (0.049), ych- (0.099), dSh- (0.134), and oy- (0.194); while others show a preference for the ends of lines, as with ld- (0.850), ll- (0.768), and lt- (0.769).  The single-glyph words g and m are always line-final, and words ending –average 0.870, while words ending -m average 0.859—no great surprise there.

Meanwhile, some morphologically related series of words show consistent deflections in average line position, such as this:

  • Sheol (0.378), cheol (0.421)
  • Sheor (0.361), cheor (0.444)
  • Shey (0.451), chey (0.466)
  • Sho (0.328), cho (0.438)
  • Shol (0.373), chol (0.414)
  • Shor (0.318), chor (0.421)
  • Shear (0.493), chear (0.522)
  • ShecKhy (0.417), checKhy (0.513)
  • Shdy (0.491), chdy (0.581)
  • Shedy (0.481), chedy (0.555)

The variant with Sh consistently has an earlier average position within lines than the variant with ch.  And I haven’t cherry-picked results; these are all the cases I’ve examined.

J. K. Petersen has observed that words ending in an, ain, aiin, aiiin tend to appear at a later position in lines than words ending in on, oin, oiin, oiiin.  My measurements confirm this, presenting a distinct contrast between words ending inor consisting wholly ofaiiin (0.586) and oiiin (0.412); aiin (0.487) and oiin (0.331); ain (0.489) and oin (0.363); and an (0.693) and on (0.433).  But a like contrast extends further to words ending inor consisting wholly ofar (0.502) versus or (0.403); al (0.548) versus ol (0.440); as (0.562) versus os (0.412); am (0.884) versus om (0.778); ay (0.532) versus oy (0.365); and a (0.564) versus o (0.391).  Words ending oiiin, oiin, oin, on, or, ol, os, om, oy, and o tend to appear closer to the beginnings of lines than words ending aiiin, aiin, ain, an, ar, al, as, am, ay, and a—not necessarily by much, but with remarkable consistency.  Individual words follow the same pattern, e.g., char (0.479) versus chor (0.421); chal (0.557) versus chol (0.414); chaiin (0.542) versus choiin (0.446); dar (0.540) versus dor (0.408); dal (0.642) versus dol (0.488); daiin (0.498) versus doiin (0.312), etc.  The same distinction appears to affect appearances of a/o deeper within words: thus, the words ry and ary appear exclusively at the ends of lines, and the average for all other words ending –ary is 0.858; but the word ory averages 0.812, and the average for all other words ending -ory is 0.786.  We can even contrast longer words beginning al- (0.640) and ol- (0.560); ar- (0.719) and or- (0.609).  The variant always has a later average position than the variant, however slight the difference.

Another curious pattern is that common elements tend to have later average positions in lines as self-standing words than they do as the endings of other words.  Thus, we can contrast the word dy (0.716) with words ending -dy (0.506); ar (0.566) with –ar (0.494); al (0.589) with –al (0.543); ol (0.504) with -ol (0.431); or (0.455) with -or (0.395); chy (0.504) with -chy (0.453); chol (0.414) with -chol (0.373).  Much of the ain/oin series follows this pattern as well, but with the rare exceptions ain (0.456) versus -ain (0.491), and aiiin (0.560) versus -aiiin (0.597).  By the same token, words beginning with tend to have earlier average positions in lines than their q-less equivalents.  Witness the averages of all words beginning withor consisting wholly ofqol (0.451) versus ol (0.545); qod (0.461) versus od (0.539); qok (0.472) versus ok (0.534); qot (0.512) versus ot (0.590); qop (0.575) versus op (0.608).  This same distinction also plays out through numerous specific word pairs, such as these:

  • qokain (0.483), okain (0.521); qotain (0.541), otain (0.628); qodain (0.389), odain (0.610)
  • qokaiin (0.457), okaiin (0.482); qotaiin (0.515), otaiin (0.598); qodaiin (0.451), odaiin (0.517)
  • qokal (0.498), okal (0.557); qotal (0.589), otal (0.625); qodal (0.422), odal (0.439)
  • qokar (0.470), okar (0.528); qotar (0.593), otar (0.611); qodar (0.494), odar (0.595)

Meanwhile, individual word pairings don’t seem much affected; for example, qokain.okain (5) and okain.qokain (6) occur with near-equal frequency.  These patterns emerge only when we view things in the aggregate.  Some mysterious factor in the text creation process has resulted in these words beginning with qo- appearing, on average, a little earlier in lines than the equivalent words beginning with o-; and in the other similar patterns noted above; and doubtless in others besides.

If we dig deeper into individual cases, we find that differences in line-position preference can play out differently.  In each of the following bar charts, the leftmost column represents first words in lines, the rightmost column represents last words in lines, and the columns in between represent groupings of fractional values (the first and last of which are usually poorly represented because only lines with more than ten words can yield fractional values between 0 and 0.1 or between 0.9 and 1).  Sometimes the most conspicuous difference lies in a stronger tendency of one word to appear line-initially, as with qokaiin (blue) versus okaiin (orange):

And sometimes it lies in a stronger tendency of one word to appear line-finally, as with qodain (blue) versus odain (orange):

But sometimes the difference hinges more on the interior of the line, as with qodaiin (blue) and odaiin (orange), where the latter word shows a stronger tendency to appear very nearbut not quite atthe end of a line (as well as to begin or end a line, but those two extremes cancel each other out when it comes to calculating the overall “score”):

Here’s a similar chart for ar (blue) and or (orange), where the first two-thirds of the line mostly favors or, while the last third favors ar:

And here’s a chart for all words beginning with—or consisting wholly of—qot (blue) and ot (orange), in which ot- is more common in every position but the first, but in which its proportion is conspicuously higher in the latter half of the line:

We can also plot absolute word numbers instead of groupings of fractional positions relative to line length.  The results tend to look smoother that way, and if any significant patterns are associated specifically with the second, third, etc. words in lines, rather than with proximity to the beginning or end, this will do a better job of exposing them.  Here’s a chart for qot- and ot- organized in that alternative way, and I’m not sure at this point which approach is more informative.

In any case, the statistical differences I’ve been describing don’t seem to be merely a consequence of differences in line-initial and line-final word frequencies.  Rather, their effects can be found spread out across whole lines.  And the existence of such line-position preferences adds more fodder to the LAAFU (“line as a functional unit”) fire, although the techniques I’ve outlined so far are admittedly better at detecting anomalous patterns than they are at analyzing or characterizing them further.

Still, what could be causing line-position preferences?  Timm would presumably argue that they arose because words in particular line positions were routinely being copied and modified from words in similar line positions.  However, some of the preferences involve word elements he associates with the modifications themselves.  As far as I’m aware, the self-citation hypothesis doesn’t provide a clear mechanism for that.  I suppose we could hypothesize that both the “main” words and the “modifications” to them were being copied from similar line positions simultaneously, or that modifications were constrained somehow by position.  But that starts to sound rather cumbersome for a hoaxing strategy, which is what Timm concludes the self-citation process must have been.

I’ve speculated that lines are more self-contained than mid-line words but less self-contained than paragraphs.  I’ve also speculated that Voynichese text is structured cyclically; that the elements in given slots can repeat or “drop out” over the course of multiple cycles; and that that different slots within the cycle operate somewhat independently of each other, as can be illustrated by “cycle charts” like the ones I shared earlier or by other visualizations such as these:

This kind of structure can be fairly obvious when it gives rise to repetitive-looking word pairs, extended Jackson sequences, or internally repetitive words such as araral.  But I suspect it could be the organizing principle behind all Voynichese text, and the underlying cause of the word structure.  If so, the line might be a “functional unit” in the sense that it forms a larger meta-cycle of this kind, made uplike a Jackson sequenceof smaller word-sized cycles, such that similar situations tend to arise at similar points within its arc, giving rise to positionally distinctive word structures.

When I laid out my hatchmark-centric word paradigm above, I proposed that o and a were equivalent, with the latter more common before minims (e.g., aii…).  Since then, we’ve seen a number of contexts in which these two glyphs could plausibly be interchangeable, e.g., common label-words such as okal / okol and otaly / otoly, glyph combinations with frequently ambiguous or inconsistent spacing such as r_a / r_o, s_a / s_o.  But words containing and also display different line-position preferences; specifically, prefers an earlier position than a.  Maybe that shows that I’m wrong after all about a and o being equivalent.  But the two glyphs truly can be hard to tell apart, so perhaps the statistical difference instead reflects something in the dynamics of composing a line of text.  My proposed rule of curvelet priority stipulates that minims should follow after curvelets; and while I introduced it as part of a “word grammar,” it could just as well be proposed as a rule of cycles within lines.  I’ve also suggested that words were sometimes written in multiple stages, as for example with the skeleton choi being augmented by a flourish (maybe upon further reflection) into the finalized word chor.  As a corollary to the rule of curvelet priority, minims might have been less anticipated earlier in a line, such that the writer would more often have written o there by default without yet knowing what would come next, even if it might turn out to be ii…; but more anticipated later in a line, such that the writer would more often have written aii… there fluently from the start.  Or something in the process of composing each line might have become easier as it went along, building up momentum as with a partly-solved puzzle, such that its pieces could have been written in more fluent chunks, and less haltingly, right up until the last handful of glyphs fell into place at the right margin.

If words with particular features favor particular positions in lines, that would naturally result in preferential patterns of word pairing as well, although when I briefly tried analyzing word-pair combinations in just the middle parts of lines in an effort to remove line position as a factor, quirky patterns continued to appear.

The foregoing may have some further implications for understanding labels.  As I observed earlier, words that are common as labels but less common as paragraphic words tend to end in -ly (otaly, otoly, okaly), -ldy (okaldy, otaldy, otoldy), -ody (okeody, okody, otody), -chdy (otchdy, okchdy), or -ry (okary, dary).  The endings -ody and -chdy show no strong line-position preferences, but the other endings in this group all show a decisive preference for the ends of lines (perhaps from the same cause that puts them at the ends of labels):

  • ry as self-standing word, 1; ory, 0.813; all words ending –ry including the foregoing, 0.790
  • ldy as self-standing word, 0.924; oldy, 0.787; all words ending –ldy including the foregoing, 0.721
  • ly as self-standing word, 0.806, oly, 0.869; all words ending –ly including the foregoing, 0.723

Meanwhile, words that are relatively common both as labels and as paragraphic words tend to end in -edy (otedy, okedy, oteedy) or -l (okal, okol, otal, otol, qokal).  The first of these endings, -edy, is notorious as a distinctive feature of Currier B, and Currier A folios never contain labels of this type.  There don’t appear to be any comparably common Currier A label types that are absent from Currier B—or at least I don’t spot anything obvious.  However, within Currier A, words that are common both as labels and as paragraphic words would seem to be few indeed.

Another positional detail that’s sometimes commented upon is Philip Neal’s observation, quoted earlier, that words ending in a gallows “tend to occur rather more than half way across the first line of a page.”  The text sequences between paragraph-initial gallows and these gallows are sometimes referred to as Neal Keys, although Neal himself writes that he’s “not committed to the idea that such gallows end a key sequence” but believes “that these gallows are evidence that the text displays a kind of periodicity above the level of individual words.”   How strong is this pattern?  I count 126 paragraphic words ending in gallows, of which 24 (about 19%) appear in the first line of the first paragraph.  Only about 5% of all paragraphic words appear in the first line of a first paragraph on a page, so 19% does indeed constitute a noteworthy preference.  On the other hand, the line position average for these examples is 0.380, with some being line-initial but none being line-final (also true of 14 additional cases in the first lines of later paragraphs); and even if we disregard line-initial cases, the line position average is 0.480; none of which translates into “rather more than half way across the first line.”  So I’m not sure about that part.

§ 7

Numerical Manipulations

I haven’t tried developing my hatchmark-centric word paradigm any further than I have because I just don’t feel the analysis of discrete words is all that promising.  But I’m not sure what the best alternative to a word paradigm would be.  I’ve been toying with the idea of using the annual cycle as a metaphor for the structure of Voynichese text, much as Stolfi used a core-mantle-crust metaphor for one of his.  Curvelet sequences could be associated with planting, and minim sequences with harvesting.  The loopdown flourish (y, l) could be associated with winter.  Much as the new year falls during the winter season, word breaks occur in winter, so that “winter glyphs” tend to appear near word boundaries, either at the beginnings or ends of words.  Gallows could be associated with the summer solstice, which may occur before, during, or after planting.  Successive seasons can see the same events happening for two or more years in a row, or events can alternate as with the rotation of crops, or maybe following an even longer schedule as with cicada broods.  Sometimes there’s a bad year with no harvest, but there’s scarcely ever a harvest without any planting.  Sometimes the summer solstice isn’t celebrated, in which case the following new year is less likely to be acknowledged as such.  It’s a fun metaphor, and it might even work decently in practice, exposing “entanglements” across cycles even more starkly than word-based analysis has.

Of course, I still couldn’t say what any of this means.  But I’d like to wrap up this post by indulging in a bit of more out-on-a-limb speculation, even if that means showing my hand as far as what kind of solution I’d prefer.

I suggested earlier that Voynichese feels textured to me much like the sums of money I’ve seen written in medieval and early modern Roman numerals, and after all the foregoing, I still think it’s likely that there’s something fundamentally numerical about it.  Among other things, some sort of numerical scheme would play well with Stolfi’s observations that the lengths of Voynichese wordsnot of word tokens, but of distinct word typeshave a binomial distribution that “suggests that the length of a word chosen at random from the lexicon is 1 plus the sum of nine random binary variables,” and that the probabilities of elements such as gallows or ch in any given word seem to be independent of one another.  It would also fit nicely with Timm’s observation (here, p. 6) that “in most cases all conceivable spelling permutations of a glyph group exist,” since it’s arguably more plausible for the existence of 260 reliably to predict the existence of 261, 262, 263, 264, 265, 266, 267, 268, and 269, than it is for that the existence of up, sup, cup, cut, shut, but, hut, nut reliably to predict the existence of ut, sut, shup, bup, hup, nup.  Others have suspected a numerical basis for Voynichese as well; for one nice example, see J. K. Petersen’s reasoning here.

The numerical solutions I’ve seen proposed elsewhere have centered on speculation about code-books or tables in which words, syllables, or whatever could have been looked up.  However, the Wikipedia article on the Voynich Manuscript states that “book-based ciphers would be viable for only short messages, because they are very cumbersome to write and to read.”  That statement is followed by a “citation needed” notice, so perhaps it’s not trustworthybut it rings true, at least for anything with more independently-assigned values than a numerical cipher syllabary.

Medieval finger-counting as depicted in Filippo Calandri, Aritmetica (Florence, 1491/92), copy auctioned by Sotheby’s.

That issue of cumbersomeness is worth taking seriously.  But some other processes that could appear cumbersome to us in the twenty-first century might not have seemed so in the fifteenth, when prevailing methods for manipulating numbers were quite different from anything we’re familiar with today.  It’s worth considering what skill sets and points of reference would have been commonly available for crunching numbers quickly and efficiently at the time the Voynich Manuscript was created.

Finger-counting then followed protocols inherited from classical antiquity that may now seem unintuitive—see the plate to the left—but that functioned in tandem with widespread methods of finger-computation that are now apparently difficult to reconstruct.  Or at least they are for ancient times, to judge from secondary literature; I’m not sure whether better documentation might be available for late medieval or early modern finger-calculation techniques.

From Gregor Reisch, Margarita Philosophica (1503), Houghton Library, Harvard University.

Meanwhile, more elaborate calculations were made by moving coin-like jetons around counting-boards marked out with horizontal lines that stood respectively for ones, tens, hundreds, and thousands, while a single jeton in the spaces between would stand for five, fifty, or five hundred—see the illustration to the right.  Note that the lines and spaces in this scenario correspond neatly to Roman numerals, the lines representing I, X, C, M, and the spaces V, L, D.  However, a similar arrangement could be used for reckoning sums of money with non-decimal bases, as with 1 livre or pound = 20 sous or shillings, 1 sou or shilling = 12 deniers or pence.  Before jetons, pebbles (calculi) had been used as counters instead, giving rise to the very term calculation.  Sections of the table separated by vertical lines could be assigned variously to operands or results for addition, subtraction, multiplication, or division, with more or less elaborate procedures taught for carrying out each operation.  During the same period, some mathematically sophisticated games were also played for learning or leisure, such as the board game Rithmomachy.  We shouldn’t underestimate the ease with which at least some fifteenth-century writers or readers would have been able to carry out basic mathematical operations on their hands, on counting-boards, or in their heads.

During the same period, everyday usage of Roman numerals differed from both ancient practice and the “restored” practice that came out of the Renaissance.  Although and were generally used for 5 and 50, the numerals and were often deployed as though they were units on a par with livres or sous.  Thus, the number eighty was conventionally written iiijxx in late medieval and early modern French documents, reflecting the linguistic expression quatre-vingt, i.e., “four score.”  In page numbering, I’ve seen iiijxxix (99) followed by jc (100), while years were routinely written beginning mil iiijc for 1400 or mil vc for 1500.  In these contexts, iiij behaved less like a “classic” Roman numeral than like the Arabic numeral 4, with its “place value” depending simultaneously on its position in the sequence and on the following superscript.  One occasionally finds secondary references to other “medieval Roman numerals” besides, such as A = 5, S = 7, O = 11, but I never seem to run into those in practice.

Now-familiar conventions of mathematical writing such as plus (+) and minus (-) signs had yet to take root in the early fifteenth century, but there were some other specialized notations that handled these operations in ways other than spelling them out in words or drawing on the conventions of addition and subtraction built into Roman numerals themselves (XI = X plus I, while IX = X minus I).  David A. King briefly describes one in The Ciphers of the Monks (2001), at pp. 335-6: the “Röderzeichen” used since at least the fourteenth century to mark volumes on wine barrels, as reckoned in primary units (such as aimes) and secondary units (such as setiers), where sub-units to be added were written as attachments below the main notation and sub-units to be subtracted were written as attachments above:

Unless I’m missing something, there doesn’t seem to be any straightforward glyph-by-glyph mapping between Voynichese and “standard” Roman numerals that makes any kind of consistent sense.  But even so, Voynichese could still reflect fairly ordinary techniques of calculation and number-writing.

Jetons: Burgundian Netherlands, ca. 1400; Tournai, second half of 15th century; Nuremberg, ca. 16th century (author’s collection).

Consider a hypothetical system in which vowels were assigned to the base-seven numbers 1 = a, 2 = e, 3 = i, 4 = o, 5 = u, 6 = y, and consonants to the base-seven numbers 10 = b, 20 = d, 30 = f, 40 = g, 50 = h, 60 = k, 100 = l, 110 = m, 120 = n, 130 = p, 140 = r, 150 = s, 160 = t, 200 = z.  An initial ciphertext number 53 would represent HI; the reader would place five jetons in a sevens (71) row and three in a ones (70) row on a counting-board.  Say the next ciphertext number is 25; the reader might add two jetons to the sevens row and five to the ones row, resulting in a total of seven jetons in the sevens row and eight in the ones row.  Six of the eight jetons in the ones row would be converted into one jeton in the sevens row, which would now contain eight jetons; and seven of those jetons would be converted into one jeton in the forty-nines (72) row, resulting in the base seven number 112, representing ME.  This process may sound complicated, but anyone who used a counting-board on a regular basis should have been able to do it almost without thinking.  The techniques and tools involved are precisely the same ones somebody would have used to add sums of money; no special device such as the Alberti cipher disk of 1470 would have been needed.  It’s easy to imagine the scheme being elaborated further to allow for subtracting rather than adding, clearing the table and starting over, enciphering just a consonant or a vowel, reversing the order (ME to EM), adding a final consonant to the syllable from a limited set of options, and so on.  The ciphertext notation would have needed conventions for representing quantities of jetons, their place values (corresponding to lines on the board), whatever operations were permitted (addition, subtraction, clearing, etc.), and breaks between expressions.  Instead of five i‘s making a and two v‘s making an x, the notation might instead have seven i‘s make an e, closely reflecting the structure of the cipher itself.

In actual Voynichese, we find four bare hatchmarks in a row nine times with curvelets, including oeeees (f7r) and okeeeey (f86v6), and once with minims, in oiiiin (f77r).  If we count the final glyph in each case as a hatchmark plus a flourish, these words would actually contain five hatchmarks in a row; and if we interpret o as containing the first hatchmark, then there would be six in a row.  In connection with ch, Sh, it seems as though the same upper limit might pertain to curvelets on one or another side of the bar; thus, we find cheee (as in cheeey), but never cheeeeSheee (as in Sheeey), but never Sheeeeeeech (uniquely in qokeeechy), but never eeeech.  More curvelets can also appear in a row if flourishes are added internally (e.g., dchedshey, chey,dychy), and more yet if gallows are inserted (e.g. s,ypchey,pchedy, dchedy,tchddy); and that’s based only on an analysis of words (disregarding comma breaks), not text irrespective of word breaks.   But wherever we conclude that the number of hatchmarks in a group tops out, that number could be significant as one less than the number of smaller units required to make up one larger unit (which would then have been written in some other way).

Or it might just correspond to the number of entries or columns (or whatever) in some kind of key.  But the points I’d like to make are (1) that Voynichese could be fundamentally numerical even if it doesn’t use a standard number system; and (2) that it’s worth bearing in mind the standard methods for manipulating numbers in the fifteenth century, such as performing calculations on a counting-board, regardless of whether there’s any evidence that they were ever applied to cryptography.

I’m torn over whether Voynichese could plausibly have been written “cumulatively”—that is, with the interpretation of each new segment depending on what had gone before, and—in the case of numbers—with a need to keep a running tally.  The counting-board scenario I described above fits this description, as do the checkerboard ciphers I wrote about in an earlier post.  Whatever system was used would also have had to accommodate isolated labels with nothing preceding to build on, but maybe that could account for the distinctive properties of “labelese” (ok-, ot-, etc.).  Columns (f49v, f76r) and rings (f57v) of isolated glyphs are admittedly another kettle of fish altogether, and I don’t have any related suggestions for what to do with those.  But a cumulative process could also have caused patterns within longer segments, such as paragraph-initial gallows or the various line-position preferences we’ve looked at, especially if an effort were made to zero things out between lines to mitigate the impact of mistakes.  It might have taken more time on average to build up to certain word elements, for example.

On the other hand, all that pervasive repetition we examined earlier doesn’t seem to be a very good fit for a scenario in which the reader needs to add each new element in the text to a running tally.  Instead, it often looks more as though the text itself could be presenting a running tally from cycle to cycle, such that the signal might lie in the differences between the contents of each cycle—something I suppose could also have been worked out easily enough with jetons on a counting-board by following the usual protocols for subtraction.  (I see that there’s been some prior discussion of the differential encoding idea here.)  Repeating one element of a cycle identically, or dropping it from one or more consecutive cycles, could have conveyed a distinctive meaning as well.  That would be one way to explain, and through it, perhaps the logic of the system as a whole.

I’m not completely happy with either hypothesis, but for me the patterns described above still hint at something vaguely along these lines.

6 thoughts on “Ruminations on the Voynich Manuscript

  1. Hi, Patrick:

    Wanted to make sure to thank you for sharing all of these ideas. It is obvious that this was a ton of work and having it available as a resource is much appreciated. After I get through all of it, I’ll come back with any questions — but wanted to be sure to express my thanks now so you could know that others working on the Voynich appreciate your observations and contributions.



  2. Hello,

    This is Brian Cham, one of the authors of the Curve-Line System article. This is a really great and comprehensive article, though I think it would have been easier to digest if it were posted as separate parts. You have really gone into a lot of fine details and interesting speculation. You have covered a lot of things that overlap with our article but you have also gone further in some areas. You seem to have rediscovered some of the patterns of our system but expressed in a different way, for example the idea that certain glyphs can “reset” the word back to curve bases, or that “o” and “a” are equivalent.

    About your comments on our system: Our bold claim (“THE system”) refers to how the curve-line pattern seems to be intentionally designed rather than an arbitrary statistical finding. It doesn’t refer to the comprehensiveness of the system, which we admit is very loose. This is both good and bad: The laws we proposed are extremely elegant in their simplicity (it’s by far the shortest list of rules out of the word systems you mention) but it also doesn’t go very far in predicting entire word construction. We actually had further thoughts to build an entire model in a sequel article but life got in the way so these thoughts remain in a rough brainstorm. A lot of these are similar to your thoughts here, like the frequencies of certain glyphs, bases and tails (to use our terminology) in certain positions and in relation to each other, and we had tentative answers to some of your questions as well.

    If you’re interested, what you call a “cycle chart” is close to what we had in mind. Our approach is actually what David Jackson had in mind when he coined the “Jackson sequence” and wrote about paper volvelles (have you seen that post? They look quite similar to your charts). So it seems like our paths have converged again. We used an unsupervised algorithm that builds up something like your “cycle chart” but using the entire manuscript instead of just specific lines, and it could also distinguish functional bigrams/trigrams and context-specific glyphs (e.g. word-initial “ol-” is distinct from word-final “-ol”). We tested and compared with natural languages and it correctly identified things like the Spanish “ll”, the fact that natural languages have two functional glyph categories (vowels and consonants), which glyphs go in which category, and the fact that “y” in English can be both. It mainly confirmed that “Voynichese” is far more systematic than any natural languages, it seems to have *five* functional glyph categories, and many glyphs are essentially distinct when in different contexts, which is what trips up other word models because they get conflated every time. Alas, our algorithm and analysis is still too tentative to publish any reliable details (for example we needed to investigate Currier Language A/B), but that’s where we were going in 2015.

    An interesting feature of your article is that it focuses on the weirdest, least common and most anomalous words and glyphs in the manuscript to think about a “big tent” model that can explain absolutely everything. Some feel that this is the best way to go because the strangest features demand the most rigorous explanations. There were times when we were at a loss to explain certain words. The problem is that every conceivable pattern or finding has so many exceptions, so poking around anomalies can be quite insightful. Some of your findings are fascinating but also troubling to me, as it’s difficult to speculate about any system that can actually explain everything at once. The big ideas about cycle- or state-based systems are interesting to ponder though.

    Kind regards,
    Brian Cham

    • Many thanks for reaching out! I realize that what people publish about their investigations into the VM is only the tip of the iceberg, which is why I was careful to write only in terms of where your “published” analysis ended. I’m intrigued by the sneak preview of your follow-up work, and I hope you’ll have the opportunity to fill in whatever gaps need filling before you feel it’s ready for public consumption.

      I did see David Jackson’s blog post of 2014 suggesting the paper volvelle as a mechanism for generating Voynichese:

      Of course, my “cycle chart” isn’t a generative mechanism like this—it’s meant to expose relationships among sequential words, more like diagramming a sentence. But in the scenario Jackson describes in that post, I agree that the columns of a “cycle chart” would correspond to the rings of the volvelle. A volvelle hypothesis would also seem to me to fit the odd cyclical patterns we see better than Rugg’s Cardan grille hypothesis does, particularly if some rings could stay in place while others moved and not all values had to be copied during each “turn.” Still, I think some of my objections to the Cardan grille hypothesis could also apply to the volvelle hypothesis, if I understand the latter correctly (recognizing that it might have nuances I don’t suspect). One is that a significant amount of Voynichese resists “chunking” into discrete glyphs or glyph groups that could have been copied straightforwardly from the contents of cells. The “entanglement” of elements also seems hard to explain in terms of independently-moving rings, although Jackson seems to suggest that a volvelle was being used in tandem with Timmian self-citation, which might let it off the hook for explaining every pattern on its own.

      I’m not sure what kind of “hoax” solution I’d find convincing, since just about any randomizable factor I can think of could also carry real information. But if I had to propose a hoaxing algorithm of my own, I think it would involve generating random content first and then transcribing it into Voynichese graphemes as a second step. In my post, I mentioned “protocols for recording the results of dice throws and such.” What I had in mind was a process something like this: (1) roll several dice; (2) record the outcome of the roll according to some mysterious convention involving hatchmarks and flourishes; (3) depending on the outcome, re-roll some of the dice, but not necessarily all of them; (4) depending on the outcome of the second roll, record part of the cumulative result, but not necessarily all of it; (5) repeat steps 3 and 4 as many times as desired. This approach might seem overly complicated, but I suspect it could have been faster and less tiring in practice than generating gibberish with a Cardan grille or volvelle. It’s also just one possibility, of course; some comparable scenario in which a whole line was set up at once (maybe stored using jetons on a table), transcribed, and then modified to generate additional lines might account better for Timmian phenomena and line-position preferences. But if the script meant something “real” to at least this extent, then the unusual (non-“chunkable”) written forms we see could be explained as a consequence of “real” unusual outcomes that strained the limits of the notation. Otherwise I wouldn’t see any alternative but to treat these as willful exceptions—cases where the writer departed from standard procedure just to mix things up. And doing too much of that would have defeated the presumptive point of resorting to a hoaxing algorithm in the first place.

      • Hello Patrick,

        “The “entanglement” of elements also seems hard to explain in terms of independently-moving rings, although Jackson seems to suggest that a volvelle was being used in tandem with Timmian self-citation, which might let it off the hook for explaining every pattern on its own.” – Yes, we think that would explain near everything.

        “I’m not sure what kind of “hoax” solution I’d find convincing, since just about any randomizable factor I can think of could also carry real information.” – One of my upcoming ideas is an investigation into this very problem. In fact, I think it’s more lucrative than the curve-line system, which is another reason that was put on hold.

        By the way, have you ever read Green Eggs and Ham? That’s a real-life example of a meaningful text that contains the repetition and cycling patterns….

        Kind regards,
        Brian Cham

      • One problem I have with combining the volvelle and self-citation hypotheses is that, to my mind, they seem to offer competing explanations for many of the same phenomena rather than each filling in gaps left by the other — but I’ll withhold judgment until such time as all the details of your work are “out there.” (I would read them in a box, I would read them with a fox, I would read them in the rain, I would read them on a train!)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.