Unicode character categories: HarfBuzz Manual

Unicode character categories

Shaping models are typically specified with respect to how scripts are defined in the Unicode standard.

Every codepoint in the Unicode Character Database (UCD) is assigned a Unicode General Category (UGC), which provides the most fundamental information about the codepoint: whether the codepoint represents a Letter, a Mark, a Number, Punctuation, a Symbol, a Separator, or something else (Other).

These UGC properties are "Major" categories. Each codepoint is further assigned to a "minor" category within its Major category, such as "Letter, uppercase" (Lu) or "Letter, modifier" (Lm).

Shaping models are concerned primarily with Letter and Mark codepoints. The minor categories of Mark codepoints are particularly important for shaping. Marks can be nonspacing (Mn), spacing combining (Mc), or enclosing (Me).

In addition to the UGC property, codepoints in the Indic and Southeast Asian scripts are also assigned Unicode Indic Syllabic Category (UISC) and Unicode Indic Positional Category (UIPC) properties that provide more detailed information needed for shaping.

The UISC property sub-categorizes Letters and Marks according to common script-shaping behaviors. For example, UISC distinguishes between consonant letters, vowel letters, and vowel marks. The UIPC property sub-categorizes Mark codepoints by the relative visual position that they occupy (above, below, right, left, or in multiple positions).

Some complex scripts require that the text run be split into syllables. What constitutes a valid syllable in these scripts is specified in regular expressions, formed from the Letter and Mark codepoints, that take the UISC and UIPC properties into account.