[INDOLOGY] Precomposed characters vs combining characters

Marco Franceschini franceschini.marco at fastwebnet.it
Wed Jan 15 16:59:58 UTC 2014


My heartfelt thanks to all the colleagues who replied to my query.

Best wishes,

Marco Franceschini
---

Il giorno 14/gen/2014, alle ore 14.10, Dominik Wujastyk ha scritto:

> Although the Unicode standard describes both forms as canonically normalized,* I would recommend precomposed (or NFC, Normalization Form Canonical Composition).  At the top of my XeLaTeX files, for example, I routinely say "\XeTeXinputnormalization=1" which means the output PDFs contain precomposed characters, whatever I do in my input file.  I think you should not pay too much attention to the current (dis)abilities of various word-processors.  In the Big World, and in the future, precomposed is what makes sense.  The task of intelligent printing, searching and sorting -- of searching for ss and also finding ß, for Mueller and finding Müller --  is most appropriately located in the rendering and search/sort routines, not the encoding of the text.  Actually, properly Unicode-compliant text processing utilities are required to handle all NF(K)C and NF(K)D forms without blinking.  Also, W3C normalization requires NFC. So if a text is going to be rendered on a website, it should be in NFC (or in a character reference entity, which looks nice but is normally horrible to work with).
> 
> See also question two, in the Unicode normalization FAQ: "NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings."
> 
> ​Best,
> Dominik​
> 
> --
> ​See also.
>> 
> --
> Dr Dominik Wujastyk
> Department of South Asia, Tibetan and Buddhist Studies,
> University of Vienna,
> Spitalgasse 2-4, Courtyard 2, Entrance 2.1
> 1090 Vienna, Austria
> and 
> Adjunct Professor, 
> Division of Health and Humanities,
> St. John's Research Institute, Bangalore, India.
> Project | home page | HSSA | PGP
> 
> 
> 
> 
> 
> On 14 January 2014 01:01, Marco Franceschini <franceschini.marco at fastwebnet.it> wrote:
> Dear friends,
> 
> I’m devising a keyboard layout (on OS X) for the Italian "physical" keyboard, that allows the user to type all the combinations of a base character with one or more diacritics that are used for the transliteration of many Indian scripts as well as Arabic and Perso-Arabic scripts, in conformity with the main standards and transliteration schemes used in scholarly publications. I’m using Ukelele for this purpose.
> 
> My keyboard layout makes extensive use of dead keys: it allows the user to combine up to three diacritics to one base character, in order to let her/him to add Vedic tone signs (represented by grave/acute or vertical stroke above/underbar) to the transliterated text. Diacritics can be typed in any order, and the base character must be typed after them. The complete list of the allowed combinations is available here:
> https://www.dropbox.com/s/6k057ksula49zqf/TABELLA.pdf
> 
> My question is: should I encode the output as precomposed characters (or as combinations of a precomposed character plus added diacritics –as far as precomposed characters are available, of course) or should I use combining characters throughout (that is: sequences of the codes of all the glyphs that constitute the final character)?
> 
> My keyboard is based on the “Italiano - Pro” keyboard layout that comes with OS X, in which just a few combinations of a base character+diacritic are provided. With a few exceptions, they are not used in the transliteration of Indian/Arabic scripts, but they are widely used in Italian language (e.g.: è é ì ò ù etc.). All of these combinations are encoded by the “Italiano - Pro” keyboard layout as precomposed characters.
> 
> I’m tempted to use combining characters throughout (and to convert the encoding of the combinations inherited from the “Italiano - Pro” keyboard accordingly). But I hesitate, because I know that only a few word processors (e.g. Nisus, which I'm using) are able to recognize the two different encodings (precomposed and combining characters) as equivalent for Finding/Replacing and Sorting purposes, while the most widespread softwares are not (Word for Mac, Neo Office, Open Office); and this fact would create problems if one adds/mixes text typed with my keyboard layout to an old file typed with the “Italiano - Pro” keyboard layout.
> 
> Precomposed characters or combining characters? This is the dilemma. Has any of you already faced such a quandary?
> 
> Best,
> 
> Marco Franceschini
> ---
> 
> _______________________________________________
> INDOLOGY mailing list
> INDOLOGY at list.indology.info
> http://listinfo.indology.info
> 



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.indology.info/pipermail/indology/attachments/20140115/14a275e1/attachment.htm>


More information about the INDOLOGY mailing list