[INDOLOGY] Precomposed characters vs combining characters

Dominik Wujastyk wujastyk at gmail.com
Tue Jan 14 13:10:27 UTC 2014


Although the Unicode standard describes both forms as canonically
normalized,* <https://en.wikipedia.org/wiki/Unicode_equivalence> I would
recommend precomposed (or NFC, *Normalization Form Canonical Composition*).
At the top of my XeLaTeX files, for example, I routinely say
"\XeTeXinputnormalization=1" which means the output PDFs contain
precomposed characters, whatever I do in my input file.  I think you should
not pay too much attention to the current (dis)abilities of various
word-processors.  In the Big World, and in the future, precomposed is what
makes sense.  The task of intelligent printing, searching and sorting -- of
searching for ss and also finding ß, for Mueller and finding Müller --  is
most appropriately located in the rendering and search/sort routines, not
the encoding of the text.  Actually, properly Unicode-compliant text
processing utilities are required to handle all NF(K)C and NF(K)D forms
without blinking.  Also, W3C normalization requires NFC. So if a text is
going to be rendered on a website, it should be in NFC (or in a character
reference entity, which looks nice but is normally horrible to work with).

See also question two, in the Unicode normalization
FAQ<http://www.unicode.org/faq/normalization.html>:
"NFC is the best form for general text, since it is more compatible with
strings converted from legacy encodings."

​Best,
Dominik​

--
​See also. <http://www.unicode.org/reports/tr15/#Implementation_Notes>
​

--
Dr Dominik Wujastyk
Department of South Asia, Tibetan and Buddhist Studies<http://stb.univie.ac.at>
,
University of Vienna,
Spitalgasse 2-4, Courtyard 2, Entrance 2.1
1090 Vienna, Austria
and
Adjunct Professor,
Division of Health and Humanities,
St. John's Research Institute, <http://www.sjri.res.in/> Bangalore, India.
Project <http://www.istb.univie.ac.at/caraka/> | home
page<http://www.academia.edu/DominikWujastyk>|
HSSA <http://hssa.sayahna.org> | PGP <http://wujastyk.net/pgp.html>





On 14 January 2014 01:01, Marco Franceschini <
franceschini.marco at fastwebnet.it> wrote:

> Dear friends,
>
>
> I’m devising a keyboard layout (on OS X) for the Italian "physical"
> keyboard, that allows the user to type all the combinations of a base
> character with one or more diacritics that are used for the transliteration
> of many Indian scripts as well as Arabic and Perso-Arabic scripts, in
> conformity with the main standards and transliteration schemes used in
> scholarly publications. I’m using Ukelele for this purpose.
>
> My keyboard layout makes extensive use of dead keys: it allows the user to
> combine up to three diacritics to one base character, in order to let
> her/him to add Vedic tone signs (represented by grave/acute or vertical
> stroke above/underbar) to the transliterated text. Diacritics can be typed
> in any order, and the base character must be typed after them. The complete
> list of the allowed combinations is available here:
>
> https://www.dropbox.com/s/6k057ksula49zqf/TABELLA.pdf
>
> My question is: should I encode the output as precomposed characters (or
> as combinations of a precomposed character plus added diacritics –as far as
> precomposed characters are available, of course) or should I use combining
> characters throughout (that is: sequences of the codes of all the
> glyphs that constitute the final character)?
>
> My keyboard is based on the “Italiano - Pro” keyboard layout that comes
> with OS X, in which just a few combinations of a base character+diacritic
> are provided. With a few exceptions, they are not used in the
> transliteration of Indian/Arabic scripts, but they are widely used in
> Italian language (e.g.: è é ì ò ù etc.). All of these combinations are
> encoded by the “Italiano - Pro” keyboard layout as precomposed characters.
>
> I’m tempted to use combining characters throughout (and to convert the
> encoding of the combinations inherited from the “Italiano - Pro” keyboard
> accordingly). But I hesitate, because I know that only a few word
> processors (e.g. Nisus, which I'm using) are able to recognize the two
> different encodings (precomposed and combining characters) as equivalent
> for Finding/Replacing and Sorting purposes, while the most widespread
> softwares are not (Word for Mac, Neo Office, Open Office); and this fact
> would create problems if one adds/mixes text typed with my keyboard layout
> to an old file typed with the “Italiano - Pro” keyboard layout.
>
>
> Precomposed characters or combining characters? This is the dilemma. Has
> any of you already faced such a quandary?
>
>
> Best,
>
>
> Marco Franceschini
> ---
>
> _______________________________________________
> INDOLOGY mailing list
> INDOLOGY at list.indology.info
> http://listinfo.indology.info
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.indology.info/pipermail/indology/attachments/20140114/10dbb6d1/attachment.htm>


More information about the INDOLOGY mailing list