simoncozens.github.io

Simon Cozens technical blog

How I implemented the Universal Shaping Engine in 200 lines of code

I’ve recently been working on implementing an OpenType shaping engine in Python. I’ll explain more about why another time but the short answer is (1) for Flux to visualize layout rules (including activating individual lookups and rules) without writing everything out to disk each time, and (2) so FEE plugins can reason about the effect of previous rules while working out what they want to do.

My shaping engine is partially complete, passing 78% of the Harfbuzz test suite (many of the remaining fails are niggly things like whether to hide or delete ignorable glyphs). But for the past couple of days I have been implementing the complex shapers, and I want to tell you a bit about what that means.

Simple and complex shaping

OpenType shaping is the way that a computer takes a font and a piece of text and marries them together: working out what glyphs to use for each of the Unicode codepoints in the text, and then applying the substitution and positioning rules in the font to effect ligatures, kerning, and so on.

That’s a very high level overview and there are lots of little details, but for simple scripts (like the one you’re reading), that’s as far as it goes.

Most scripts, however, need a bit of help when moving from the Unicode world of characters to the OpenType world of glyphs. This “help” is the logic provided by complex shapers; there are a bunch of “complex shapers” as part of an OpenType shaping engine, each handling text in a different script or family of scripts.

So, for example, if your text is in Arabic, it will come in as a series of codepoints which don’t contain any “topographic” information: the string ججج is made up of the same Unicode code point three times (U+062C ARABIC LETTER JEEM). But it needs to comes out as three different glyphs, one for “initial jeem”, one for “medial jeem” and one for “final jeem”. In this case, there’s a part of the shaping engine which specifically knows how to help process Arabic, and it goes through the Unicode input annotating it with what position the glyphs need to be in. It knows how Arabic “works”: it knows that if you have جاج (JEEM ALIF JEEM), the first JEEM goes in initial form because it links to the ALIF but the second JEEM stays how it is because the letter ALIF does not join to its left.

Other scripts require different kinds of help to move from the Unicode world to the OpenType world. The way that Unicode defines the encoding of scripts is sometimes a little bit different from the order that those scripts are written in. As a simple example, the Devanagari sequence कि (“ki”) is encoded with the consonant ka (क) first and then the vowel i (ि) second. But visually - when you type or print - the vowel needs to appear first. So the shaping engine has to again “help” the font by reordering the glyph sequence. It’s a lot easier to lay out iMatra-deva ka-deva than it would be to be handed ka-deva iMatra-deva as a straight Unicode-to-glyph conversion, and then be left having to shuffle the glyphs around in your font’s OpenType rules.

Notice also that when I showed you the vowel i on its own like this - ि - it was displayed with a dotted circle. The vowel mark can’t normally appear on its own - it needs to be attached to some consonant - so I have typed something that is orthographically impossible. To denote the missing consonant and to try and display something sensible, the shaping engine has inserted the dotted circle; that’s another job of the complex shaper. It knows what is a valid syllable and what isn’t, and adds dotted circles to tell you when a syllable is broken. (So if you ever see a dotted circle in the printed world, the input text was wrong.)

Syllabic shapers

Three of the complex shapers - Arabic, Hebrew and Thai - have very specific script logic embedded in them. But the others - the Indic shaper which deals with the scripts derived from Sanskrit taken as a whole (Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, Sinhala), the Myanmar shaper and the Khmer shaper - have the same basic structure:

  • Find any sequences which are orthographically totally incorrect, such as if the user tried to add a vowel mark onto an independent vowel.
  • Break the incoming text into valid syllables according to the orthographic rules of the script, as best we can; create “invalid syllables” for any bits of text that didn’t fit.
  • Add dotted circles to the start of the invalid syllables.
  • For each syllable, reorder the individual glyphs from Unicode encoding order to visual display order.
  • Apply the appropriate set of OpenType features from the font in the right order.
  • Possibly do another round of reordering at the end.

To break the text into syllables, you look at the Unicode Character Database for each codepoint, and specifically at the field called Indic Syllabic Category, which tells you what type of “thing” each codepoint represents. (consonant, vowel, mark, halant/virama, etc.) You may need to do one or two fixups for some characters - Unicode calls U+1CF5 VEDIC SIGN JIHVAMULIYA a Consonant With Stacker but for our purposes it’s just a plain old Consonant. Then you use a state machine to build these syllabic categories into a full syllable:

(This is the state machine for a Khmer consonant cluster. I cheated a little by not expanding the matra group. The real thing’s even more complicated…)

As an old-school Perl hacker, I’m used to making state machines out of regular expressions, so that’s what we do here: build up a series of tests to see whether a combination of syllabic categories matches our syllable types.

The reordering code is a very hairy breeding ground for bugs and off-by-one errors, but that’s the main part of the complex syllabic shaper. To work out how to reorder things, you look at the Indic Positional Category (which tells you where each thing lives - before the base consonant, under the base consonant, after the base consonant, etc.) - turn that into a number, follow a few language-specific rules and exceptions (some scripts have the base moved to the end, some at the beginning, etc.), and then sort the elements of each syllable by their positional numbers.

The Universal Shaping Engine

So yesterday I used the UCD and a set of regular expression state machines to implement Indic, Khmer and Myanmar complex shaping, and today I worked on the Universal Shaping Engine. What’s that? Well, Andrew Glass at Microsoft realised that it was inconvenient to keep adding new complex shapers encoding particular script knowledge every time that a new script was encoded in Unicode, and worked out a sort of hyper-generic syllable pattern that works for a wide variety scripts, driven by another category entry in the UCD.

That takes care of the syllable-finding code, but what to do about the reordering logic? How do we do that in a script-agnostic way? This is where it gets really clever: the shaper asks the font which characters should be reordered. What do we mean “ask the font?” Well, the engine shows the font some characters from the syllable and tests whether particular features would apply. For example, if there’s a pre-base substitution rule in the pref feature which matches a vowel (even if it’s sub foo by foo;) then that vowel is recorded as being a pre-base vowel, and that gets moved to before the base consonant. There’s a slightly more complex rule for moving reph characters around, but the whole thing turns out not to be too difficult at all.

And because it follows the same basic functionality as the other syllable-based complex shapers, much of what it does is handled by a common base class. This means that my implementation of the USE complex shaper is fairly small - about 190 lines of Python - and hopefully not too difficult to understand.

I’ve learned a lot implementing these shapers, and hopefully now you have too!

Written on December 29, 2020