A Duffer's guide to Fontconfig and Harfbuzz

I’m working on switching the font shaping part of SILE to use Harfbuzz instead of Pango because reasons, and have found myself a bit hampered by the lack of useful documentation. To be fair, if you actually build HB from source you get an auto-generated API reference, but there’s nothing really explaining how to go from a string of characters to a set of glyph positioning information, which is a shame because that is what Harfbuzz is for.

The first obstacle I hit when moving off Pango is that Pango allows you to talk about font names, whereas when dealing with Harfbuzz directly you have to locate and open the font file yourself; the usual way to do this is with Fontconfig. The Fontconfig developer documentation similarly provides a reference to the API functions, but nothing about going from a font description to a file name, which is a shame because that is what Fontconfig is for.

So I had to go around and gather various bits of information from mailing list posts and StackExchange questions and… here is one solution to the problem of turning text into positioning information, and turning font descriptions into filenames. I don’t claim it’s the best, but it works.

First let’s work on the Fontconfig side. Fontconfig works by means of patterns; you specify the features that you want to find in your font database, and it goes and finds fonts which match. The most obvious features you’ll want are the font’s family name, but you might want to search with reference to other things as well. In order to help us, I’m going to declare a structure which encodes all the font-related options we want to specify, and we’ll use the members of this structure to drive Fontconfig and Harfbuzz:

typedef struct {
  char* family;
  char* lang;
  double pointSize;
  int weight;
  int direction;
  int slant;
  char* style;
  char* script;
} fontOptions;

We’ll go through the members in turn later on but for now, here’s an example of a font described in that structure:

fontOptions f = {
  .pointSize = 12,
  .lang = "en",
  .family = "Gentium Book Basic",
  .script = "latin",
  .direction = HB_DIRECTION_LTR,
  .weight = 200,
};

</source>

Now let’s start writing a function to turn a font description, from the above structure, into a font pathname.

#include <fontconfig/fontconfig.h>
static char* get_font_path(fontOptions f) {
  FcResult result;
  FcChar8* filename;
  char* filename2;
  int id;
  FcPattern* matched;

The first thing we do is to create a new Fontconfig pattern which is going to store our match information. We load it up with the font family name and point size. Fontconfig has a family of typed functions for adding clauses to a match. We have to convert our strings to special Fontconfig strings, but otherwise this is straightforward:

  FcPattern* p = FcPatternCreate();

  FcPatternAddString (p, FC_FAMILY, (FcChar8*)(f.family));
  FcPatternAddDouble (p, FC_SIZE, f.pointSize);

Now we will add the slant (roman/italic/etc.) and weight requirements to the pattern:

  if (f.slant)
    FcPatternAddInteger(p, FC_SLANT, f.slant);
  if (f.weight)
    FcPatternAddInteger(p, FC_WEIGHT, f.weight);

Possible values of FC_SLANT are FC_SLANT_ROMAN, FC_SLANT_ITALIC and FC_SLANT_OBLIQUE. Possible values of FC_WEIGHT will do your head in. Here is a conversion table between CSS and Fontconfig font weight constants:

	CSS	Fontconfig
Thin	100	FC_WEIGHT_THIN (0)
Ultralight	200	FC_WEIGHT_ULTRALIGHT (40)
Light	300	FC_WEIGHT_LIGHT (50)
Normal	400	FC_WEIGHT_NORMAL (80)
Medium	500	FC_WEIGHT_MEDIUM (100)
Demibold	600	FC_WEIGHT_DEMIBOLD (180)
Bold	700	FC_WEIGHT_BOLD (200)
Ultra bold	800	FC_WEIGHT_ULTRABOLD (205)
Heavy	900	FC_WEIGHT_HEAVY(105)

So just divide by five and… no, wait.

Anyway, now we have a pattern which matches the font that we want: its name, weight, point size, and slant. Next, what we will do is ask Fontconfig to fall back to some default fonts if it doesn’t find the one that we’re after. We do this by adding more patterns. Fontconfig finds the first match, so if we don’t match “Gentium Book Basic”, it will find:

  FcPatternAddString (p, FC_FAMILY,(FcChar8*) "Times-Roman");
  FcPatternAddString (p, FC_FAMILY,(FcChar8*) "Times");
  FcPatternAddString (p, FC_FAMILY,(FcChar8*) "Helvetica");

For my purposes this is enough to ensure a match; for yours it might not be. Now we have a pattern, let’s match against our font database:

  matched = FcFontMatch (0, p, &result);

matched is also an FcPattern, but will be filled with information about the matched font. We can get the information out with similar FcPatternGet... functions:

  if (FcPatternGetString (matched, FC_FILE, 0, &filename) != FcResultMatch)
    return NULL;

We could have set the FC_FILE property in our pattern, but that would be dumb because that’s what we’re trying to find out. Instead, we get it, into the &filename pointer. This pointer is allocated by Fontconfig and lasts for the lifetime of the pattern, so we’re going to make a copy of it, and then release the patterns we allocated:

  filename2 = malloc(strlen(filename));
  strcpy(filename2, (char*)filename);
  FcPatternDestroy (matched);
  FcPatternDestroy (p);
  return filename2;
}

So at this point we can go from our font description structure to a filename. Hooray! Except—Harfbuzz expects that fonts come from Freetype, so you need to get Freetype up and running. We’ll then turn the font description into a filename, turn that into a Freetype font structure, then turn that into a Harfbuzz font structure:

#include 
#include FT_FREETYPE_H
#include FT_GLYPH_H
#include FT_OUTLINE_H

#include 
#include 

    int device_hdpi = 72;
    int device_vdpi = 72;
    FT_Library ft_library;
    FT_Face ft_face;
    hb_font_t *hb_ft_font;

    assert(!FT_Init_FreeType(&ft_library));
    font_path = get_font_path(f);
    printf("Found font: %s\n", font_path);
    assert(!FT_New_Face(ft_library, font_path, 0, &ft_face));
    assert(!FT_Set_Char_Size(ft_face, 0, f.pointSize * 64, device_hdpi, device_vdpi ));

    hb_ft_font = hb_ft_font_create(ft_face, NULL);
</pre>

Freetype, bless its heart, uses 1/64th of a font as its fundamental unit of type size. You also need to tell it what DPI your output device is going to be at. I'm using printer's points, so I configure for 72dpi square pixels.

Next up, we create a buffer for Harfbuzz to do its string work in, and set that up the various properties we know about the text:

    buf = hb_buffer_create();
    if (f.script)
      hb_buffer_set_script(buf, hb_tag_from_string(f.script, strlen(f.script)));
    if (f.direction)
      hb_buffer_set_direction(buf, f.direction);
    if (f.lang)
      hb_buffer_set_language(buf, hb_language_from_string(f.lang,strlen(f.lang)));


Harfbuzz would like to know: what script this is, so that it can use script-specific shaping where necessary; what direction the script goes in; what language the text is written in. There are a lot of potential values here. Language should be one of the ISO639 language tags from here; direction should be either HB_DIRECTION_LTR,
HB_DIRECTION_RTL, HB_DIRECTION_TTB (top to bottom), or HB_DIRECTION_BTT.

There are huge number of Harfbuzz scripts, but the one you're going to most use is "Latin" or HB_SCRIPT_LATIN if you want to pass that to hb_buffer_set_script directly instead of using hb_tag_from_string. For completeness, a full list of script strings and tags follows at the end of this post.

Now the buffer knows what it's dealing with. Let's get to the meat of the work: laying out the UTF-8 string into glyphs and then shaping those glyphs for a given font:

    hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
    hb_shape(hb_ft_font, buf, NULL, 0);


Everything stays in the buffer, but we can extract it like so:

    glyph_info   = hb_buffer_get_glyph_infos(buf, &glyph_count);
    glyph_pos    = hb_buffer_get_glyph_positions(buf, &glyph_count);


glyph_info and glyph_pos are arrays of glyphs from 0 to glyph_count. The thing you'll want to get out of glyph_info[i] is the codepoint member, which is the glyph's ID in the font, which undoubtably you'll be passing to whatever is rendering this text. Now you also probably want to know how to render it: glyph_pos gives you x_advance and y_advance, which are how the rendering pen should move after rendering this glyph, and x_offset and y_offset which is where the glyph should be positioned relative to the pen. (usually zero) These are given in Freetype units, 64ths of a point.

If you need height and depth information for the glyph, then you need to go back to Freetype and ask it:

void calculate_extents(box* b, hb_glyph_info_t glyph_info, hb_glyph_position_t glyph_pos, FT_Face ft_face) {
  const FT_Error error = FT_Load_Glyph(ft_face, glyph_info.codepoint, FT_LOAD_DEFAULT);
  if (error) return;

  const FT_Glyph_Metrics *ftmetrics = &ft_face->glyph->metrics;
  b->width = glyph_pos.x_advance /64.0;
  b->height = ftmetrics->horiBearingY / 64.0;
  b->depth = (ftmetrics->height - ftmetrics->horiBearingY) / 64.0;
}


That's everything I needed to put text into glyphs and glyphs into boxes. The collected code can be found here.


And now, the Harfbuzz script list:

Zyyy: HB_SCRIPT_COMMON
Zinh: HB_SCRIPT_INHERITED
Zzzz: HB_SCRIPT_UNKNOWN
Arab: HB_SCRIPT_ARABIC
Armn: HB_SCRIPT_ARMENIAN
Beng: HB_SCRIPT_BENGALI
Cyrl: HB_SCRIPT_CYRILLIC
Deva: HB_SCRIPT_DEVANAGARI
Geor: HB_SCRIPT_GEORGIAN
Grek: HB_SCRIPT_GREEK
Gujr: HB_SCRIPT_GUJARATI
Guru: HB_SCRIPT_GURMUKHI
Hang: HB_SCRIPT_HANGUL
Hani: HB_SCRIPT_HAN
Hebr: HB_SCRIPT_HEBREW
Hira: HB_SCRIPT_HIRAGANA
Knda: HB_SCRIPT_KANNADA
Kana: HB_SCRIPT_KATAKANA
Laoo: HB_SCRIPT_LAO
Latn: HB_SCRIPT_LATIN
Mlym: HB_SCRIPT_MALAYALAM
Orya: HB_SCRIPT_ORIYA
Taml: HB_SCRIPT_TAMIL
Telu: HB_SCRIPT_TELUGU
Thai: HB_SCRIPT_THAI
Tibt: HB_SCRIPT_TIBETAN
Bopo: HB_SCRIPT_BOPOMOFO
Brai: HB_SCRIPT_BRAILLE
Cans: HB_SCRIPT_CANADIAN_SYLLABICS
Cher: HB_SCRIPT_CHEROKEE
Ethi: HB_SCRIPT_ETHIOPIC
Khmr: HB_SCRIPT_KHMER
Mong: HB_SCRIPT_MONGOLIAN
Mymr: HB_SCRIPT_MYANMAR
Ogam: HB_SCRIPT_OGHAM
Runr: HB_SCRIPT_RUNIC
Sinh: HB_SCRIPT_SINHALA
Syrc: HB_SCRIPT_SYRIAC
Thaa: HB_SCRIPT_THAANA
Yiii: HB_SCRIPT_YI
Dsrt: HB_SCRIPT_DESERET
Goth: HB_SCRIPT_GOTHIC
Ital: HB_SCRIPT_OLD_ITALIC
Buhd: HB_SCRIPT_BUHID
Hano: HB_SCRIPT_HANUNOO
Tglg: HB_SCRIPT_TAGALOG
Tagb: HB_SCRIPT_TAGBANWA
Cprt: HB_SCRIPT_CYPRIOT
Limb: HB_SCRIPT_LIMBU
Linb: HB_SCRIPT_LINEAR_B
Osma: HB_SCRIPT_OSMANYA
Shaw: HB_SCRIPT_SHAVIAN
Tale: HB_SCRIPT_TAI_LE
Ugar: HB_SCRIPT_UGARITIC
Bugi: HB_SCRIPT_BUGINESE
Copt: HB_SCRIPT_COPTIC
Glag: HB_SCRIPT_GLAGOLITIC
Khar: HB_SCRIPT_KHAROSHTHI
Talu: HB_SCRIPT_NEW_TAI_LUE
Xpeo: HB_SCRIPT_OLD_PERSIAN
Sylo: HB_SCRIPT_SYLOTI_NAGRI
Tfng: HB_SCRIPT_TIFINAGH
Bali: HB_SCRIPT_BALINESE
Xsux: HB_SCRIPT_CUNEIFORM
Nkoo: HB_SCRIPT_NKO
Phag: HB_SCRIPT_PHAGS_PA
Phnx: HB_SCRIPT_PHOENICIAN
Cari: HB_SCRIPT_CARIAN
Cham: HB_SCRIPT_CHAM
Kali: HB_SCRIPT_KAYAH_LI
Lepc: HB_SCRIPT_LEPCHA
Lyci: HB_SCRIPT_LYCIAN
Lydi: HB_SCRIPT_LYDIAN
Olck: HB_SCRIPT_OL_CHIKI
Rjng: HB_SCRIPT_REJANG
Saur: HB_SCRIPT_SAURASHTRA
Sund: HB_SCRIPT_SUNDANESE
Vaii: HB_SCRIPT_VAI
Avst: HB_SCRIPT_AVESTAN
Bamu: HB_SCRIPT_BAMUM
Egyp: HB_SCRIPT_EGYPTIAN_HIEROGLYPHS
Armi: HB_SCRIPT_IMPERIAL_ARAMAIC
Phli: HB_SCRIPT_INSCRIPTIONAL_PAHLAVI
Prti: HB_SCRIPT_INSCRIPTIONAL_PARTHIAN
Java: HB_SCRIPT_JAVANESE
Kthi: HB_SCRIPT_KAITHI
Lisu: HB_SCRIPT_LISU
Mtei: HB_SCRIPT_MEETEI_MAYEK
Sarb: HB_SCRIPT_OLD_SOUTH_ARABIAN
Orkh: HB_SCRIPT_OLD_TURKIC
Samr: HB_SCRIPT_SAMARITAN
Lana: HB_SCRIPT_TAI_THAM
Tavt: HB_SCRIPT_TAI_VIET
Batk: HB_SCRIPT_BATAK
Brah: HB_SCRIPT_BRAHMI
Mand: HB_SCRIPT_MANDAIC
Cakm: HB_SCRIPT_CHAKMA
Merc: HB_SCRIPT_MEROITIC_CURSIVE
Mero: HB_SCRIPT_MEROITIC_HIEROGLYPHS
Plrd: HB_SCRIPT_MIAO
Shrd: HB_SCRIPT_SHARADA
Sora: HB_SCRIPT_SORA_SOMPENG
Takr: HB_SCRIPT_TAKRI

Written on September 11, 2014

simoncozens.github.io

Simon Cozens technical blog

A Duffer's guide to Fontconfig and Harfbuzz