Report on Unicode Proposed Encoding for Tibetan and Sinhalese
EVERSON at IRLEARN.UCD.IE
Fri Sep 17 18:57:05 UTC 1993
Everson Gunn Teoranta
15 Port Chaeimhghein I/ochtarach
Baile A/tha Cliath 2, E/ire
+353 1 478-2597
everson at irlearn.ucd.ie
The Unicode Consortium
1965 Charleston Avenue
Mountain View, CA 94043
Upon review of Unicode Technical Report #2, the Preliminary Draft
Proposals for Tibetan and Sinhalese encoding, I offer below the
following recommendations. The chief recommendation is that both
Tibetan and Sinhalese be encoded according to the standard ISCII
encoding for Brahmi-derived scripts in order to facilitate transfer of
texts written in the Sanskrit and Pali languages to the other
Brahmi-derived scripts in which these two languages are commonly
I trust that these comments will be of use. Thank you for giving me the
opportunity to review this material.
cc: Rick McGowan
cc: Tibetan Language Committee, Lhasa
REPORT ON UNICODE PROPOSED ENCODING FOR TIBETAN AND SINHALESE
1.0 Tibetan and Sinhalese scripts should have ISCII-parallel encodng to
support the Sanskrit and Pali languages.
1.1 The corpus of Sanskrit and Pali literature is enormous (estimated
to be larger than all of Classical Greek and Latin combined). Sanskrit
in particular occupies a unique position among world languages in that
it can, and has been, legitimately be written in many scripts:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil (Grantha),
Telugu, Kannada, Malayalam, Thai, Burmese, Javanese, Tibetan, and
Sinhalese, as well as the Latin alphabet (and other scripts), have each
been used to a greater or lesser extant in manuscripts and printed
books in the Sanskrit and Pali languages. Pali is most often written in
Burmese, Sinhalese, Thai, Lao, and Khmer scripts (as well as Latin in
the present day). The Unicode proposal for Burmese conforms to ISCII
encoding (at least with regard to the characters relevant to Sanskrit
and Pali), and the Unicode Draft Proposal for Khmer suggests that
either Thai-parallel encoding or ISCII-parallel encoding might be
appropriate "for political reasons". (I favour ISCII for Khmer because
its script is so Brahmi-conformant.)
1.2 ISCII-parallel encoding facilitates text transfer. Because Sanskrit
stands in the unique position of being the _only_ language in the world
which is/has been commonly written in _more than ten_ scripts, the
question of script transfer is particularly acute for Sanskrit texts.
Works written by scholars who read one script could, with parallel
encoding, be transliterated exactly via simple constant offset
algorithms, so long as the Brahmi-derived scripts are encoded in a
parallel fashion. This is the whole idea behind parallel encoding, and
I assume, in the absence of any particular knowledge of the history of
the ISCII encoding, that the idea was to permit simple script transfer
for the modern languages of India by providing a structure for such
constant offset transliteration. One advantage of this, for instance,
might be to facilitate the quick transliteration of computerized
corpora such as telephone directories or other databases for the
benefit of users who prefer one script over another. (This might be
considered an _a priori_ advantage for Sinhalese, as both Sinhalese and
Tamil are used in Sri Lanka.) For Sanskrit and Pali, the whole body of
traditional literature could (and should) benefit from the same
1.3 ISCII-parallel encoding does not impede processing in Tibetan or
Sinhalese. Obviously the most important use made of the Tibetan and
Sinhalese scripts is to write Tibetan and Sinhalese. The arguments
against ISCII-parallel encoding chiefly reside in the proposition that
"natural" sorting order is adversely affected by such encoding.
However, not only do the Unicode guidelines state specifically (Unicode
1.0 Vol. 1, p. 9, pp. 626-27) that sorting routines are to be dealt
with outside the codepage, but neither the ISCII-conformant scripts at
present, nor either the Tibetan or Sinhalese proposals currently under
review provide perfect sort order (characters U+0958 --> 095F in
Devanagari are encoded "out of order", for instance). Thus encoding via
"natural" sorting order 1) has hardly been implemented for _any_
Unicode script (Georgian comes close but for Khutsuri one would have to
interperse upper- and lower-case; "natural" sorting order would have
them alternate on the codepage) and 2) is of illusory value as sorting
algorithms take more complex information than codepage order into
account and, certainly for Tibetan and Sinhalese, will have to be
explicitly written whether they conform to ISCII or "natural" sorting
order. Dictionary sorting in Tibetan is hardly a straightforward linear
process, but rather a hierarchy of successively applied rules on a
syllable-by-syllable basis. "Natural" sorting order is of no particular
utility for Tibetan; I give below in an appendix Hannah's summary of
Tibetan sorting rules and an example of the sort of the letter KA in
1.4 Sorting Sanskrit and Pali is simpler with ISCII-parallel encoding.
The significant advantage of ISCII-parallel encoding for Tibetan and
Sinhalese is that only one sorting algorithm will need to be written
for the Sanskrit and Pali languages, with constant offset being taken
into account to rewrite the algorithms for each script. Each of the
ISCII-compatible scripts has some of its own features encoded, as
Tibetan and Sinhalese would require; but they are unified as to the
characters they all have in common, which as it turns out are the core
Brahmi characters used in Sanskrit and Pali. It would certainly be a
mistake not to provide codepoints in Tibetan for the Sanskrit voiced
aspirants; although they are stacked ligatures in Tibetan, they are
separate characters in Sanskrit and the same logic which provides LATIN
LETTER N J to match CYRILLIC LETTER NJE in support of Serbocroatian
(one language written in two scripts) must apply for Tibetan in support
of the Sanskrit language and its literature. ISCII compatibility serves
Sanskrit and Pali perfectly, and creates no particular burden for
either Sinhalese or Tibetan. The reverse is not the case. Stacked
ligatures GHA, JHA, DDHA, DHA, and BHA are not used except for Sanskrit
words; LHA is an exception in that it is used in Tibetan but not
Sanskrit (Hannah 1912:18). Precomposed initial vowels (A, AA, I, II, U,
UU etc.) might not be used for the Tibetan language itself, but are
necessary for painless reversible script transfer
1.5 The relation of Tibetan TSEG to Sanskrit requires more
investigation. Reversible transfer of text between ISCII-conformant
Sinhalese and the other ISCII-conformant scripts presents no problem.
However, the nature of Tibetan TSEG must be fully investigated in order
to determine whether it would present a problem for Tibetan-encoded
Sanskrit. Two possible solutions come to mind. One is not to have a
separate TSEG at all, but to make use of the virama and/or zero-width
joiners and zero-width-nonjoiners to determine syllable boundary, and
have the TSEG appear as a presentation form. In order to see if this is
feasible, I am going through my Tibetan dictionary taking down Sanskrit
words so that I can compare the two. At first glance it looks as though
a virama solution could work for Sanskrit in Tibetan script.
Another possible solution involves adding an invisible TSEG character
to the rest of the ISCII-conformant scripts to make them compatible
with Tibetan-script encoded Sanskrit texts (the unused first codepoint
in each script would be an ideal place to put this, e.g. U+0900,
U+0980, U+0A00, U+0A80, etc). Obviously input of this character would
not be required by other scripts, but it would be a _legal_ character
and would not have to be deleted from texts encoded first in Tibetan.
Texts transferred from e.g. Burmese script to Tibetan might need some
(possibly non-trivial) post-transfer processing to insert all the
TSEGs. ISCII-conformancy for Tibetan would ensure that at least all the
vowels and consonants would be represented accurately in parallel. The
implications for multi-script publishing, glossaries, and databases
which could be transparently accessed with a script filter, and so
forth, makes it well worth investigating this matter further.
1.6 Multi-script publishing of Sanskrit and Pali is a significant
concern. Scripts do not exist in a vacuum, and have no value in
themselves except insofar as they are used. It can be easily seen that
Tibetan and Sinhalese are used for Sanskrit and Pali and that
convenient, transparent, reversible interchange between them and
Devanagari and Burmese (two other scripts which make great use of
Sanskrit and Pali) should be supported by the encoding schemes in the
same way as convenient, transparent, reversible interchange between
Latin and Cyrillic is in support of Croatian and Serbian. It is easy to
envision a simple script filter utility that could that transliterate
and display text on CD-ROM _without reencoding_ the text for languages
like Sanskrit or Serbocroation which are commonly written in multiple
On 1 July 1993, Ashok Aklujkar (aklujkar at unixg.ubc.ca) said on the
"While the Sanskrit and Prakrit scholars at Indian institutions of
higher learning have hardly caught up with the computer age (and some
are unfortunately not even willing to consider how computers will
assist their and their institutions' work), the Jain monk-scholars, in
my experience, have shown a very progressive attitude. They are
collaborating with their lay computer experts and producing many
important tools of research. (I should perhaps be able to write more on
this in a report I am thinking of publishing on my tours for
manuscripts in India.) The Sharadaben Chimanlal Educational Research
Centre, "Darshan", Opposite Raakpur Society, Shahibag, Ahmedabad
380004, has about 100,000 Prakrit gaathaas computerised, besides large
sections of Jain aagama works and bibliographies of writings on
Jainism. It recently computer-published 3 vols. of Jambuvijayaji's
catalogue of Patan mss. and is said to have an IBM-compatible program
for Naagari sorting. Muni Jinendra-vijaya, whose work is guided out of
Jamnagar, Saurashtra, Gujarat, is in the process of compiling a
cumulative list of manuscripts in known or catalogued Jain collections.
The person assisting him on the computer side is: Mr. Mahendra Modi,
Galaxy Printers, Alankar Chambers, Dhebar Chowk, Rajkot 60 001. Mr.
Modi was looking for a Naagari sorting program for the Macintosh and
may have prevailed upon some computer programmer by now to develop one
for him and for the Muni's work."
Prakrit dialects are closely related to Sanskrit and Pali and Jain
literature would benefit from implementation of ISCII-encoding of the
Tibetan and Sinhalese scripts as well.
On 15 September 1993 Michael Witzel (witzel at husc3.harvard.edu) said on
the Indology list:
GOOD NEWS FROM THAILAND for all interested in Buddhism and Indian
Studies. The Dhammakaya Foundation has completed the input of the whole
Pali Tipitaka and is readying it for publication and distribution. The
Foundation has decided to distribute it on CD-ROM f r e e of charge
in the spirit of the teaching and propagation of Buddhism. All those
interested in receiving a copy of the text should write to: Nicolas C.
Wood, c/o Dattajivo Bhikkhu, The Dhammakaya Foundation, Pathumthani
12120 Thailand; or send an e-mail message c/o witzel at husc3.harvard.edu.
I have not as yet received an answer as to what format these texts are
encoded in,, but it will certainly be available in ISCII at some stage.
It seems to me that it would be contrary to Unicode principles to make
it less available to the Tibetan and Sinhalese scripts by not
implementing standard Brahmi-encoding for them. Constant-offset
transliteration, which implies ISCII-parallel encoding, is the best
solution to the fact that Sanskrit, Prakrit, and Pali are written in
many scripts. Parallel encoding will ensure that important texts such
as the Pali Tipit.aka (the Theravada Buddhist Canon, comparable to the
Judeo-Christian Bible but closer to the size of the Encyclopaedia
Britannica) could be made available, eventually, in a format which
could be read by anyone with an appropriate script filter.
I urge the members of the Unicode Consortium to discuss the question of
Sanskrit and Pali encoding with their colleagues in Tibet and Sri
Lanka. I am forwarding a copy of this report to the Indology list
(indology at liverpool.ac.uk), to the ISO 10646 list (iso10646 at jhuvm), to
the Asian Classics Input Project (acip at weell.sf.ca.us), to both
Buddhism discussions (buddha-l at ulkyvm and buddhist at jpntohok), the
Tamil list (tamil-l at dhdurz1), and the Tibetan Language Committee
at Tibet University in Lhasa for comment, as well as to the Unicode
Scripts Subcommittee (scripts-sc at unicode.org). I would appreciate it
if anyone could forward it on to a relevant body in Sri Lanka.
In a few days I will upload ISCII-parallel proposals for
encoding Tibetan and Sinhalese. I am sure that such proposals have been
submitted previously, but as I have not seen any I thought it best to
do so. I will also have looked a little more thoroughly at TSEG and
hope to offer some examples for discussion.
I look forward to further discussions of these points.
Das, Sarat Chandra. 1902, 1976. _A Tibetan-English dictionary, with
Sanskrit synonyms._ Rev. and edited under the orders of the Government
of Bengal by Graham Sandberg and A. William Heyde. Delhi : Motilal
Hannah, Herbert Bruce. 1912, 1985. _A grammar of the Tibetan language,
literary and colloquial._ With copious illustrations, and treating
fully of spelling, pronunciation and the construction of the verb, and
including appendices of the various forms of the verb. Delhi: Motilal
Appendix on Tibetan sorting (Hannah 1912:55-57):
[I have tried to follow the traditional transliteration of Tibetan for
Hannah's Tibetan examples and have modernized the format somewhat. For
7-bit transmission, I use _ for MACRON BELOW, * for DOT BELOW, / for
MACRON ABOVE, and @ for DOT ABOVE. Syllable boundary is represented by
a hyphen. The hardcopy is set in Tibetan as well as transliteration]
"#23. Use of the Tibetan Dictionary
The following appears to be the way in which the words in a Tibetan
dictionary (TSHIG-M_DSOD_) are arranged.
1. According to the order of the KA/-LI, or Consonantal Series of the
KA-KHA, regarded as Initials, or as they are sometimes called, Root
letters, with the inherent vowel-sound of A. The first thing,
therefore, that the student has to do, when he wants to look up a word,
is to ascertain what its initial letter is.
Then the words under each consonant, beginning for instance with KA,
are arranged thus:
2. The simple consonant, e.g. KA.
3. The simple consonant with subjuncts like H_A, WA-ZUR, or
S at A-LOG-KHA, e.g. LWA-BA 'woollen blanket'
4. The simple consonant with affixes, single and double, for the order
of which is as amongst themselves, see #16. [Order is GA, GA-SA, N at A,
N at A-SA, DA, NA, BA, BA-SA, MA, MA-SA, H_A, RA, LA, SA)]
5. Next, according to the foregoing order as regards their consonants,
words qualified by the vowel-signs GI-GU [I], SHABS_-KYU [U],
H_KREN at -BU [E], S_NA-RO [O], in that order.
6. Simple consonant qualified by YA-B_TAGS_ alone.
7. YA-B_TAGS_ words in all orders down to 5, inclusive.
8. Simple consonant qualified by RA-B_TAGS_ alone.
9. RA-B at TAGS@ words in all orders down to 5, inclusive.
10. Simple consonant qualified by HA-B_TAGS_ alone.
11. HA-B_TAGS_ words in all orders down to 5, inclusive.
12. Simple consonant qualified by LA-B_TAGS_ alone.
13. LA-B_TAGS_ words in all orders down to 5, inclusive.
14. Foreign or other special words formed with the Reversed letters.
15. Words with the Prefixes GA, DA, BA, MA, and H_A in that sequence,
and each sequence arranged according to the foregoing orders.
16. Consonant qualified by RA-M_GO.
17. RA-M_GO words according to foregoing orders.
18. Consonant qualified by LA-M_GO.
19. LA-M_GO words according to foregoing orders.
20. Consonant qualified by SA-M_GO.
21. SA-M_GO words according to foregoing orders.
22. No words with LA, as an Initial, and having any Superposed letter
like RA or SA, need be looked for under LA. They will only be found
under the head of the Superposed letter. Words in LA, however, are
found with qualifying vowel-signs, and such words may be looked up
N.B. Csoma de Ko"ro"s's Dictionary is differently arranged." [This is
the arrangement of Das' Dictionary, however.]
First syllables of the letter KA in Das' dictionary to illustrate sort
order. In traditional transliteration. I have tried to be complete but
cannot guarantee that this list is free of errors, either mine or Das'.
k_lun at s_
k_lon at s_
b_kron at s_
b_r_kyan at s_
s_kun at s_
s_kyan at s_
s_kyin at s_
s_kran at s_
b_s_kan at s_
b_s_kun at s_
b_s_kyan at s_
School of Architecture, UCD; Richview, Clonskeagh; Dublin 14; E/ire
Phone: +353 1 706-2745 Fax: +353 1 283-8908 Home: +353 1 478-2597
More information about the INDOLOGY