Although the Unicode standard describes both forms as canonically normalized,* I would recommend precomposed (or NFC, Normalization Form Canonical Composition). At the top of my XeLaTeX files, for example, I routinely say "\XeTeXinputnormalization=1" which means the output PDFs contain precomposed characters, whatever I do in my input file. I think you should not pay too much attention to the current (dis)abilities of various word-processors. In the Big World, and in the future, precomposed is what makes sense. The task of intelligent printing, searching and sorting -- of searching for ss and also finding ß, for Mueller and finding Müller -- is most appropriately located in the rendering and search/sort routines, not the encoding of the text. Actually, properly Unicode-compliant text processing utilities are required to handle all NF(K)C and NF(K)D forms without blinking. Also, W3C normalization requires NFC. So if a text is going to be rendered on a website, it should be in NFC (or in a character reference entity, which looks nice but is normally horrible to work with).
See also question two, in the Unicode normalization FAQ: "NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings."Best,
On 14 January 2014 01:01, Marco Franceschini <firstname.lastname@example.org> wrote:
I’m devising a keyboard layout (on OS X) for the Italian "physical" keyboard, that allows the user to type all the combinations of a base character with one or more diacritics that are used for the transliteration of many Indian scripts as well as Arabic and Perso-Arabic scripts, in conformity with the main standards and transliteration schemes used in scholarly publications. I’m using Ukelele for this purpose.
My keyboard layout makes extensive use of dead keys: it allows the user to combine up to three diacritics to one base character, in order to let her/him to add Vedic tone signs (represented by grave/acute or vertical stroke above/underbar) to the transliterated text. Diacritics can be typed in any order, and the base character must be typed after them. The complete list of the allowed combinations is available here:
My question is: should I encode the output as precomposed characters (or as combinations of a precomposed character plus added diacritics –as far as precomposed characters are available, of course) or should I use combining characters throughout (that is: sequences of the codes of all the glyphs that constitute the final character)?
My keyboard is based on the “Italiano - Pro” keyboard layout that comes with OS X, in which just a few combinations of a base character+diacritic are provided. With a few exceptions, they are not used in the transliteration of Indian/Arabic scripts, but they are widely used in Italian language (e.g.: è é ì ò ù etc.). All of these combinations are encoded by the “Italiano - Pro” keyboard layout as precomposed characters.
I’m tempted to use combining characters throughout (and to convert the encoding of the combinations inherited from the “Italiano - Pro” keyboard accordingly). But I hesitate, because I know that only a few word processors (e.g. Nisus, which I'm using) are able to recognize the two different encodings (precomposed and combining characters) as equivalent for Finding/Replacing and Sorting purposes, while the most widespread softwares are not (Word for Mac, Neo Office, Open Office); and this fact would create problems if one adds/mixes text typed with my keyboard layout to an old file typed with the “Italiano - Pro” keyboard layout.
Precomposed characters or combining characters? This is the dilemma. Has any of you already faced such a quandary?
INDOLOGY mailing list