Report on Unicode Proposed Encoding for Tibetan and Sinhalese

Fri Sep 17 18:57:05 UTC 1993

Everson Gunn Teoranta
15 Port Chaeimhghein I/ochtarach
Baile A/tha Cliath 2, E/ire
+353 1 478-2597
everson at irlearn.ucd.ie

17.ix.93

The Unicode Consortium
1965 Charleston Avenue
Mountain View, CA 94043
USA

Upon review of Unicode Technical Report #2, the Preliminary Draft
Proposals for Tibetan and Sinhalese encoding, I offer below the
following recommendations. The chief recommendation is that both
Tibetan and Sinhalese be encoded according to the standard ISCII
encoding for Brahmi-derived scripts in order to facilitate transfer of
texts written in the Sanskrit and Pali languages to the other
Brahmi-derived scripts in which these two languages are commonly
written.

I trust that these comments will be of use. Thank you for giving me the
opportunity to review this material.

     Michael Everson
     Director

cc: Rick McGowan
cc: Tibetan Language Committee, Lhasa

REPORT ON UNICODE PROPOSED ENCODING FOR TIBETAN AND SINHALESE

1.0 Tibetan and Sinhalese scripts should have ISCII-parallel encodng to
support the Sanskrit and Pali languages.

1.1 The corpus of Sanskrit and Pali literature is enormous (estimated
to be larger than all of Classical Greek and Latin combined). Sanskrit
in particular occupies a unique position among world languages in that
it can, and has been, legitimately be written in many scripts:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil (Grantha),
Telugu, Kannada, Malayalam, Thai, Burmese, Javanese, Tibetan, and
Sinhalese, as well as the Latin alphabet (and other scripts), have each
been used to a greater or lesser extant in manuscripts and printed
books in the Sanskrit and Pali languages. Pali is most often written in
Burmese, Sinhalese, Thai, Lao, and Khmer scripts (as well as Latin in
the present day). The Unicode proposal for Burmese conforms to ISCII
encoding (at least with regard to the characters relevant to Sanskrit
and Pali), and the Unicode Draft Proposal for Khmer suggests that
either Thai-parallel encoding or ISCII-parallel encoding might be
appropriate "for political reasons". (I favour ISCII for Khmer because
its script is so Brahmi-conformant.)

1.2 ISCII-parallel encoding facilitates text transfer. Because Sanskrit
stands in the unique position of being the _only_ language in the world
which is/has been commonly written in _more than ten_ scripts, the
question of script transfer is particularly acute for Sanskrit texts.
Works written by scholars who read one script could, with parallel
encoding, be transliterated exactly via simple constant offset
algorithms, so long as the Brahmi-derived scripts are encoded in a
parallel fashion. This is the whole idea behind parallel encoding, and
I assume, in the absence of any particular knowledge of the history of
the ISCII encoding, that the idea was to permit simple script transfer
for the modern languages of India by providing a structure for such
constant offset transliteration. One advantage of this, for instance,
might be to facilitate the quick transliteration of computerized
corpora such as telephone directories or other databases for the
benefit of users who prefer one script over another. (This might be
considered an _a priori_ advantage for Sinhalese, as both Sinhalese and
Tamil are used in Sri Lanka.) For Sanskrit and Pali, the whole body of
traditional literature could (and should) benefit from the same
facility.

1.3 ISCII-parallel encoding does not impede processing in Tibetan or
Sinhalese. Obviously the most important use made of the Tibetan and
Sinhalese scripts is to write Tibetan and Sinhalese. The arguments
against ISCII-parallel encoding chiefly reside in the proposition that
"natural" sorting order is adversely affected by such encoding.
However, not only do the Unicode guidelines state specifically (Unicode
1.0 Vol. 1, p. 9, pp. 626-27) that sorting routines are to be dealt
with outside the codepage, but neither the ISCII-conformant scripts at
present, nor either the Tibetan or Sinhalese proposals currently under
review provide perfect sort order (characters U+0958 --> 095F in
Devanagari are encoded "out of order", for instance). Thus encoding via
"natural" sorting order 1) has hardly been implemented for _any_
Unicode script (Georgian comes close but for Khutsuri one would have to
interperse upper- and lower-case; "natural" sorting order would have
them alternate on the codepage) and 2) is of illusory value as sorting
algorithms take more complex information than codepage order into
account and, certainly for Tibetan and Sinhalese, will have to be
explicitly written whether they conform to ISCII or "natural" sorting
order. Dictionary sorting in Tibetan is hardly a straightforward linear
process, but rather a hierarchy of successively applied rules on a
syllable-by-syllable basis. "Natural" sorting order is of no particular
utility for Tibetan; I give below in an appendix Hannah's summary of
Tibetan sorting rules and an example of the sort of the letter KA in
Das' dictionary.

1.4 Sorting Sanskrit and Pali is simpler with ISCII-parallel encoding.
The significant advantage of ISCII-parallel encoding for Tibetan and
Sinhalese is that only one sorting algorithm will need to be written
for the Sanskrit and Pali languages, with constant offset being taken
into account to rewrite the algorithms for each script. Each of the
ISCII-compatible scripts has some of its own features encoded, as
Tibetan and Sinhalese would require; but they are unified as to the
characters they all have in common, which as it turns out are the core
Brahmi characters used in Sanskrit and Pali. It would certainly be a
mistake not to provide codepoints in Tibetan for the Sanskrit voiced
aspirants; although they are stacked ligatures in Tibetan, they are
separate characters in Sanskrit and the same logic which provides LATIN
LETTER N J to match CYRILLIC LETTER NJE in support of Serbocroatian
(one language written in two scripts) must apply for Tibetan in support
of the Sanskrit language and its literature. ISCII compatibility serves
Sanskrit and Pali perfectly, and creates no particular burden for
either Sinhalese or Tibetan. The reverse is not the case. Stacked
ligatures GHA, JHA, DDHA, DHA, and BHA are not used except for Sanskrit
words; LHA is an exception in that it is used in Tibetan but not
Sanskrit (Hannah 1912:18). Precomposed initial vowels (A, AA, I, II, U,
UU etc.) might not be used for the Tibetan language itself, but are
necessary for painless reversible script transfer

1.5 The relation of Tibetan TSEG to Sanskrit requires more
investigation. Reversible transfer of text between ISCII-conformant
Sinhalese and the other ISCII-conformant scripts presents no problem.
However, the nature of Tibetan TSEG must be fully investigated in order
to determine whether it would present a problem for Tibetan-encoded
Sanskrit. Two possible solutions come to mind. One is not to have a
separate TSEG at all, but to make use of the virama and/or zero-width
joiners and zero-width-nonjoiners to determine syllable boundary, and
have the TSEG appear as a presentation form. In order to see if this is
feasible, I am going through my Tibetan dictionary taking down Sanskrit
words so that I can compare the two. At first glance it looks as though
a virama solution could work for Sanskrit in Tibetan script.

Another possible solution involves adding an invisible TSEG character
to the rest of the ISCII-conformant scripts to make them compatible
with Tibetan-script encoded Sanskrit texts (the unused first codepoint
in each script would be an ideal place to put this, e.g. U+0900,
U+0980, U+0A00, U+0A80, etc). Obviously input of this character would
not be required by other scripts, but it would be a _legal_ character
and would not have to be deleted from texts encoded first in Tibetan.
Texts transferred from e.g. Burmese script to Tibetan might need some
(possibly non-trivial) post-transfer processing to insert all the
TSEGs. ISCII-conformancy for Tibetan would ensure that at least all the
vowels and consonants would be represented accurately in parallel. The
implications for multi-script publishing, glossaries, and databases
which could be transparently accessed with a script filter, and so
forth, makes it well worth investigating this matter further.

1.6 Multi-script publishing of Sanskrit and Pali is a significant
concern. Scripts do not exist in a vacuum, and have no value in
themselves except insofar as they are used. It can be easily seen that
Tibetan and Sinhalese are used for Sanskrit and Pali and that
convenient, transparent, reversible interchange between them and
Devanagari and Burmese (two other scripts which make great use of
Sanskrit and Pali) should be supported by the encoding schemes in the
same way as convenient, transparent, reversible interchange between
Latin and Cyrillic is in support of Croatian and Serbian. It is easy to
envision a simple script filter utility that could that transliterate
and display text on CD-ROM _without reencoding_ the text for languages
like Sanskrit or Serbocroation which are commonly written in multiple
scripts.

On 1 July 1993, Ashok Aklujkar (aklujkar at unixg.ubc.ca) said on the
Indology list:
"While the Sanskrit and Prakrit scholars at Indian institutions of
higher learning have hardly caught up with the computer age (and some
are unfortunately not even willing to consider how computers will
assist their and their institutions' work), the Jain monk-scholars, in
my experience, have shown a very progressive attitude. They are
collaborating with their lay computer experts and producing many
important tools of research. (I should perhaps be able to write more on
this in a report I am thinking of publishing on my tours for
manuscripts in India.) The Sharadaben Chimanlal Educational Research
Centre, "Darshan", Opposite Raakpur Society, Shahibag, Ahmedabad
380004, has about 100,000 Prakrit gaathaas computerised, besides large
sections of Jain aagama works and bibliographies of writings on
Jainism. It recently computer-published 3 vols. of Jambuvijayaji's
catalogue of Patan mss. and is said to have an IBM-compatible program
for Naagari sorting. Muni Jinendra-vijaya, whose work is guided out of
Jamnagar, Saurashtra, Gujarat, is in the process of compiling a
cumulative list of manuscripts in known or catalogued Jain collections.
The person assisting him on the computer side is: Mr. Mahendra Modi,
Galaxy Printers, Alankar Chambers, Dhebar Chowk, Rajkot 60 001. Mr.
Modi was looking for a Naagari sorting program for the Macintosh and
may have prevailed upon some computer programmer by now to develop one
for him and for the Muni's work."

Prakrit dialects are closely related to Sanskrit and Pali and Jain
literature would benefit from implementation of ISCII-encoding of the
Tibetan and Sinhalese scripts as well.

On 15 September 1993 Michael Witzel (witzel at husc3.harvard.edu) said on
the Indology list:
GOOD NEWS FROM THAILAND for all interested in Buddhism and Indian
Studies. The Dhammakaya Foundation has completed the input of the whole
Pali Tipitaka and is readying it for publication and distribution. The
Foundation has decided to distribute it on CD-ROM  f r e e  of charge
in the spirit of the teaching and propagation of Buddhism. All those
interested in receiving a copy of the text should write to: Nicolas C.
Wood, c/o Dattajivo Bhikkhu, The Dhammakaya Foundation, Pathumthani
12120 Thailand; or send an e-mail message c/o witzel at husc3.harvard.edu.

I have not as yet received an answer as to what format these texts are
encoded in,, but it will certainly be available in ISCII at some stage.
It seems to me that it would be contrary to Unicode principles to make
it less available to the Tibetan and Sinhalese scripts by not
implementing standard Brahmi-encoding for them. Constant-offset
transliteration, which implies ISCII-parallel encoding, is the best
solution to the fact that Sanskrit, Prakrit, and Pali are written in
many scripts. Parallel encoding will ensure that important texts such
as the Pali Tipit.aka (the Theravada Buddhist Canon, comparable to the
Judeo-Christian Bible but closer to the size of the Encyclopaedia
Britannica) could be made available, eventually, in a format which
could be read by anyone with an appropriate script filter.

I urge the members of the Unicode Consortium to discuss the question of
Sanskrit and Pali encoding with their colleagues in Tibet and Sri
Lanka. I am forwarding a copy of this report to the Indology list
(indology at liverpool.ac.uk), to the ISO 10646 list (iso10646 at jhuvm), to
the Asian Classics Input Project (acip at weell.sf.ca.us), to both
Buddhism discussions (buddha-l at ulkyvm and buddhist at jpntohok), the
Tamil list (tamil-l at dhdurz1), and the Tibetan Language Committee
at Tibet University in Lhasa for comment, as well as to the Unicode
Scripts Subcommittee (scripts-sc at unicode.org). I would appreciate it
if anyone could forward it on to a relevant body in Sri Lanka.
In a few days I will upload ISCII-parallel proposals for
encoding Tibetan and Sinhalese. I am sure that such proposals have been
submitted previously, but as I have not seen any I thought it best to
do so. I will also have looked a little more thoroughly at TSEG and
hope to offer some examples for discussion.

I look forward to further discussions of these points.

  Michael Everson

References:
Das, Sarat Chandra. 1902, 1976. _A Tibetan-English dictionary, with
Sanskrit synonyms._ Rev. and edited under the orders of the Government
of Bengal by Graham Sandberg and A. William Heyde. Delhi : Motilal
Banarsidass.

Hannah, Herbert Bruce. 1912, 1985. _A grammar of the Tibetan language,
literary and colloquial._ With copious illustrations, and treating
fully of spelling, pronunciation and the construction of the verb, and
including appendices of the various forms of the verb. Delhi: Motilal
Banarsidass.

Appendix on Tibetan sorting (Hannah 1912:55-57):

[I have tried to follow the traditional transliteration of Tibetan for
Hannah's Tibetan examples and have modernized the format somewhat. For
7-bit transmission, I use _ for MACRON BELOW, * for DOT BELOW, / for
MACRON ABOVE, and @ for DOT ABOVE. Syllable boundary is represented by
a hyphen. The hardcopy is set in Tibetan as well as transliteration]

"#23. Use of the Tibetan Dictionary

The following appears to be the way in which the words in a Tibetan
dictionary (TSHIG-M_DSOD_) are arranged.

1. According to the order of the KA/-LI, or Consonantal Series of the
KA-KHA, regarded as Initials, or as they are sometimes called, Root
letters, with the inherent vowel-sound of A. The first thing,
therefore, that the student has to do, when he wants to look up a word,
is to ascertain what its initial letter is.
  Then the words under each consonant, beginning for instance with KA,
are arranged thus:

2. The simple consonant, e.g. KA.

3. The simple consonant with subjuncts like H_A, WA-ZUR, or
S at A-LOG-KHA, e.g. LWA-BA 'woollen blanket'

4. The simple consonant with affixes, single and double, for the order
of which is as amongst themselves, see #16. [Order is GA, GA-SA, N at A,
N at A-SA, DA, NA, BA, BA-SA, MA, MA-SA, H_A, RA, LA, SA)]

5. Next, according to the foregoing order as regards their consonants,
words qualified by the vowel-signs GI-GU [I], SHABS_-KYU [U],
H_KREN at -BU [E], S_NA-RO [O], in that order.

6. Simple consonant qualified by YA-B_TAGS_ alone.

7. YA-B_TAGS_ words in all orders down to 5, inclusive.

8. Simple consonant qualified by RA-B_TAGS_ alone.

9. RA-B at TAGS@ words in all orders down to 5, inclusive.

10. Simple consonant qualified by HA-B_TAGS_ alone.

11. HA-B_TAGS_ words in all orders down to 5, inclusive.

12. Simple consonant qualified by LA-B_TAGS_ alone.

13. LA-B_TAGS_ words in all orders down to 5, inclusive.

14. Foreign or other special words formed with the Reversed letters.

15. Words with the Prefixes GA, DA, BA, MA, and H_A in that sequence,
and each sequence arranged according to the foregoing orders.

16. Consonant qualified by RA-M_GO.

17. RA-M_GO words according to foregoing orders.

18. Consonant qualified by LA-M_GO.

19. LA-M_GO words according to foregoing orders.

20. Consonant qualified by SA-M_GO.

21. SA-M_GO words according to foregoing orders.

22. No words with LA, as an Initial, and having any Superposed letter
like RA or SA, need be looked for under LA. They will only be found
under the head of the Superposed letter. Words in LA, however, are
found with qualifying vowel-signs, and such words may be looked up
under LA.

N.B. Csoma de Ko"ro"s's Dictionary is differently arranged." [This is
the arrangement of Das' Dictionary, however.]

First syllables of the letter KA in Das' dictionary to illustrate sort
order. In traditional transliteration. I have tried to be complete but
cannot guarantee that this list is free of errors, either mine or Das'.

ka
kak
kako
kag
kan@
kad_
kan
kan*
kab
kam
kah_u
kar
kal
ka/
kwa
ks*a
ks*e
ki
kin@
kim
kih_u
kil
ki/
ki/n@
ku
kug
kun@
kun%
kun
kur
kul
ke
keh_u
keg
ken@
ker
kel
kai
ko
kog
kon@
kod_
kon
kob
kom
kor
kol_
kos_
kya
kyag
kyan@
kyar
kyal_
kyi
kyig
kyin@
kyin
kyir
kyis_
kyu
kyur
kye
kyo
kyog
kyon@
kyom
kyor
kyol
kra
krag
kran@
krad_
kran
krab
kram
kri
krig
krin@
krin
kris*
kru
krun@
krum
krums_
kre
kro
krog
kron@
kron
k_la
k_lag
k_lags_
k_lad_
k_lan
k_lam
k_lal_
k_las_
k_lin@
k_lu
k_lun@
k_lun at s_
k_lub
k_lus_
k_log
k_lon@
k_lon at s_
ks\a
d_kag
d_kan
d_kah_
d_kar
d_ku
d_kon
d_kor
d_kol_
d_kos_
d_kyar
d_kyil
d_kyu
d_kyud_
d_kyus_
d_kyel_
d_kyor
d_kram
d_kri
d_krig
d_krigs_
d_kris_
d_kru
d_krug
d_krugs_
d_krum
d_kre
d_krog
d_krogs_
d_kron@
d_krol
b_kag
b_kan@
b_kad_
b_kan
b_kab
b_kam
b_kah_
b_kah_i
b_kar
b_kal_
b_kas_
b_ku
b_kug
b_kum
b_kur
b_kog
b_kon@
b_kod_
b_kon
b_kor
b_kol
b_kyal_
b_kyig
b_kye
b_kyed_
b_kyon
b_kra
b_krag
b_krab
b_kram
b_kral_
b_kras_
b_kri
b_krid_
b_kris_
b_kru
b_krug
b_krus_
b_kre
b_kren
b_kres_
b_kron at s_
b_krol
b_kros_
b_k_lags_
r_ka
r_kan@
r_kan
r_kam
r_ku
r_kur
r_kun
r_kub
r_ke
r_ked_
r_ko
r_kog
r_kon@
r_kod_
r_kon
b_r_kam
b_r_kus_
b_r_ko
b_r_kos_
r_kyag
r_kyan@
r_kyan
r_kyal_
r_kyen
r_kyon@
b_r_kyan@
b_r_kyan at s_
l_kug
l_kugs_
l_kog
l_kob
l_kol
s_ka
s_kag
s_kan@
s_kad_
s_kan
s_kab
s_kabs_
s_kam
s_kams_
s_kar
s_kal_
s_kas_
s_ku
s_kugs_
s_kun@
s_kun at s_
s_kud_
s_kun
s_kub
s_kum
s_kur
s_kul_
s_ke
s_keg
s_ken@
s_ked_
s_kem
s_kems_
s_ker
s_ko
s_kogs_
s_kon@
s_kon
s_kobs_
s_kom
s_koms_
s_kor
s_kol
s_kos_
s_kya
s_kyag
s_kyan@
s_kyan at s_
s_kyabs_
s_kyar
s_kyal_
s_kyas_
s_kyi
s_kyig
s_kyin@
s_kyin at s_
s_kyid_
s_kyin
s_kyibs_
s_kyim
s_kyil
s_kyu
s_kyug
s_kyus_
s_kyun@
s_kyud_
s_kyur
s_kyus_
s_kye
s_kyeg
s_kyegs_
s_kyen@
s_kyed_
s_kyen
s_kyem
s_kyems_
s_kyer
s_kyel_
s_kyes_
s_kyo
s_kyog
s_kyogs_
s_kyon@
s_kyod_
s_kyon
s_kyob
s_kyobs_
s_kyom
s_kyor
s_kyol
s_kyos_
s_kra
s_krah_i
s_krag
s_kran@
s_kran at s_
s_kran
s_krab
s_kras_
s_kri
s_kru
s_krud_
s_krun
s_krum
s_kro
s_krog
s_krod_
b_s_ka
b_s_kan@
b_s_kan at s_
b_s_kam
b_s_kams_
b_s_kal
b_s_ku
b_s_kun at s_
b_s_kum
b_s_kur
b_s_kul
b_s_kus_
b_s_kon
b_s_kor
b_s_kos_
b_s_kyan@
b_s_kyan at s_
b_s_kyabs_
b_s_kyams_
b_s_kyar
b_s_kyur
b_s_kyed_
b_s_kyod_
b_s_krad_
b_s_krun

==========

Michael Everson
School of Architecture, UCD; Richview, Clonskeagh; Dublin 14; E/ire
Phone: +353 1 706-2745  Fax: +353 1 283-8908  Home: +353 1 478-2597