My heartfelt thanks to all the colleagues who replied to my query.

Best wishes,

Marco Franceschini

Il giorno 14/gen/2014, alle ore 14.10, Dominik Wujastyk ha scritto:

Although the Unicode standard describes both forms as canonically normalized,* I would recommend precomposed (or NFC, Normalization Form Canonical Composition).  At the top of my XeLaTeX files, for example, I routinely say "\XeTeXinputnormalization=1" which means the output PDFs contain precomposed characters, whatever I do in my input file.  I think you should not pay too much attention to the current (dis)abilities of various word-processors.  In the Big World, and in the future, precomposed is what makes sense.  The task of intelligent printing, searching and sorting -- of searching for ss and also finding ß, for Mueller and finding Müller --  is most appropriately located in the rendering and search/sort routines, not the encoding of the text.  Actually, properly Unicode-compliant text processing utilities are required to handle all NF(K)C and NF(K)D forms without blinking.  Also, W3C normalization requires NFC. So if a text is going to be rendered on a website, it should be in NFC (or in a character reference entity, which looks nice but is normally horrible to work with).

See also question two, in the Unicode normalization FAQ: "NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings."



Dr Dominik Wujastyk
Department of South Asia, Tibetan and Buddhist Studies,
University of Vienna,
Spitalgasse 2-4, Courtyard 2, Entrance 2.1
1090 Vienna, Austria
Adjunct Professor,
Division of Health and Humanities,
St. John's Research Institute, Bangalore, India.
Project | home page | HSSA | PGP

On 14 January 2014 01:01, Marco Franceschini <> wrote:

Dear friends,

I’m devising a keyboard layout (on OS X) for the Italian "physical" keyboard, that allows the user to type all the combinations of a base character with one or more diacritics that are used for the transliteration of many Indian scripts as well as Arabic and Perso-Arabic scripts, in conformity with the main standards and transliteration schemes used in scholarly publications. I’m using Ukelele for this purpose.

My keyboard layout makes extensive use of dead keys: it allows the user to combine up to three diacritics to one base character, in order to let her/him to add Vedic tone signs (represented by grave/acute or vertical stroke above/underbar) to the transliterated text. Diacritics can be typed in any order, and the base character must be typed after them. The complete list of the allowed combinations is available here:

My question is: should I encode the output as precomposed characters (or as combinations of a precomposed character plus added diacritics –as far as precomposed characters are available, of course) or should I use combining characters throughout (that is: sequences of the codes of all the glyphs that constitute the final character)?

My keyboard is based on the “Italiano - Pro” keyboard layout that comes with OS X, in which just a few combinations of a base character+diacritic are provided. With a few exceptions, they are not used in the transliteration of Indian/Arabic scripts, but they are widely used in Italian language (e.g.: è é ì ò ù etc.). All of these combinations are encoded by the “Italiano - Pro” keyboard layout as precomposed characters.

I’m tempted to use combining characters throughout (and to convert the encoding of the combinations inherited from the “Italiano - Pro” keyboard accordingly). But I hesitate, because I know that only a few word processors (e.g. Nisus, which I'm using) are able to recognize the two different encodings (precomposed and combining characters) as equivalent for Finding/Replacing and Sorting purposes, while the most widespread softwares are not (Word for Mac, Neo Office, Open Office); and this fact would create problems if one adds/mixes text typed with my keyboard layout to an old file typed with the “Italiano - Pro” keyboard layout.

Precomposed characters or combining characters? This is the dilemma. Has any of you already faced such a quandary?


Marco Franceschini


INDOLOGY mailing list