Announcing CSX+

John Smith jds10 at CUS.CAM.AC.UK
Thu Jun 11 11:26:01 UTC 1998


Subscribers to the Conv-dev mailing list on transliteration of
Indian languages <conv-dev at elot.gr> may already be aware that there
is a proposal to implement a new computer encoding (character set)
for use with material in such languages. The new encoding is to be
based on the existing CSX character set, which is probably the most
widely-used such encoding, and has been dubbed "CSX+". The intention
is that it should be as nearly as possible compatible with CSX, so
that most CSX users could start using CSX+ fonts and notice no
difference; however, it will extend the CSX set with extra
characters, in particular those required by the draft ISO standard
which the Conv-dev discussion has produced. This amounts to an
attempt to provide the most useful possible set of "Indian" accented
characters, and I should welcome input from others to ensure that
the best choice is made.

Documents presenting the draft standard for transliteration can be
seen at:

  http://ourworld.compuserve.com/homepages/stone_catend/trdcd1a.htm

Those interested in contributing to a debate on what CSX+ ought to
be like should read on; others can simply await future announcements
with interest or apathy, as appropriate.

--------------------------------------------------------------------

                               CSX+

1.  The two attachments to this email message present two different
"views" of a proposed CSX+ standard, one ordered by character
sequence, the other alphabetically by character name. The remainder
of this document consists of an explanation of how the proposal has
been arrived at. Earlier drafts were seen by and discussed with
Dominik Wujastyk, Anshuman Pandey, Anthony Stone and John Clews; I
am grateful to them for their comments.

2.  The underlying philosophy is that CSX+ ought to be a strict
superset of CSX, so that "upgrading" is painless; CSX text should
appear unchanged in CSX+ fonts. Specifically, (A) there should be no
changes to the position of existing CSX characters; (B) no existing
CSX character should be deleted; (C) no attempt should be made to
map draft ISO standard usage on to the encoding: "r-underdot-macron"
will appear in the same place in CSX+ as it does in CSX, even though
the draft standard states that the preferred usage for Sanskrit long
vocalic "r" is "r-underring-macron". (The form with underring will
also be made available, of course.)

3.  A single compromise has proved to be necessary, in the area of
aim (A) above. CSX, which came into being in the MS-DOS era, uses
position 160 (decimal) for a-acute. That slot on modern PCs is
sacred to the non-breaking space; the character has thus become
inaccessible to many people, and will have to be moved to a new
position. Fortunately, this one incompatible change will not
inconvenience the majority of users: documents containing Vedic are
likely to be the biggest problem. (The Macintosh uses slot 202 for
its non-breaking space, and this too will be held vacant.)

4.  Strictly speaking, a-acute is not part of the CSX character set,
but belongs to the PC's code page 437, along with other "European"
accented characters that CSX borrows. If we add all these characters
to those officially defined by CSX, and also add in the three extra
"Indian" characters that have become widely available in recognised
character positions (171 a-macron-tilde, 172 i-macron-tilde, 216
u-macron-tilde), we find that 91 positions in the upper half of the
encoding are already occupied, leaving 37 vacant for new characters.
However, to keep both Mac and PC non-breaking space free, this has
to be reduced to 35.

5.  The draft ISO standard requires the following 24 characters that
do not form part of CSX. Only lower-case versions are shown, and
where the standard proposes a "productive" usage (e.g. "tilde over a
vowel means nasalised vowel"), only forms known to occur are
included (so no "e-macron-tilde"):

   ae-macron
   u-breve
   r-underring
   r-underring-acute
   r-underring-grave
   r-underring-macron
   r-underring-macron-acute
   l-underring
   l-underring-acute
   l-underring-macron
   e-macron
   o-macron
   r-underdieresis
   y-overdot
   r-breve
   m-candrabindu
   n-breve
   t-underbar
   k-underbar
   kh-underbar
   g-overdot
   c-circumflex
   h-underbar
   h-underbreve

Note that the precise *form* of some of these may still change; for
example, there has been a recent proposal to replace r-underdieresis
by z-underdot. But such changes will not affect the number of
necessary new characters, and the final form of the draft submission
to ISO should be available very soon.

6.  I would wish to add M-overdot, since the draft standard now
recommends m-overdot for anusvara, and it would otherwise become the
only "core" character without a capital equivalent. I would also
wish to add the following, as being centrally useful characters in
any text font: sterling (which was normally available in CSX fonts),
quotedblleft, quotedblright, endash, emdash. This exercise uses up
all but five of the "spare" character positions. ("Smart quotes"
cannot, alas, be made to work in Word, since the program assumes
that the characters in question are located at specific positions,
and none of the positions in question is free for this use.)

7.  There is a single "European" accented character, y-dieresis,
that CSX borrows from code page 437 but that is not used in any
Indian or European language. (It was probably a mistaken version of
"Dutch y", written "ij".) I have eliminated it to save one more
slot: thus six slots are available for further new characters.

8.  It seems to me that the best use for these and any other spare
slots that we can manufacture is to assign them to capitalised
versions of those new characters with the most need for capital
forms. I have tentatively given these first six to AE-macron,
E-macron, O-macron, Kh-underbar, G-overdot and C-circumflex.

9.  A case could perhaps be made to eliminate some of the more
outr'e Sanskrit characters, if there are other genuinely worthy
claims to their slots. But it is clear that the room for manoeuvre
is very slight, so good cases would have to be made.

10. I am aware of one further problem, which affects character 183
in some of Gates's operating systems. There appears to be no
documentation of this feature, so I describe it as well as I can: in
Windows 3.1 and Windows 95 the character at position 183 of a font
is made inaccessible to the user if the font uses a non-Windows
encoding, apparently in order to simplify slightly the task of
displaying "soft spaces" in Word. In CSX+ this makes the character
i-macron-acute unavailable to users of these systems. It should be
said that exactly the same applies to existing CSX fonts, and that
as far as I know nobody has ever even noticed the problem; in
addition, I gather that in Windows NT the difficulty does not arise,
suggesting that perhaps newer Microsoft systems will act more
politely in this area. I shall try to find out how Windows 98 will
behave. I do not favour a change to the CSX(+) character set to get
round the problem, but it might be necessary to think up some way of
making i-macron-acute available to those who need it and cannot
currently use it.

10. I would welcome comments on these proposals. However, there are
real constraints on the time that can be spent on a discussion, so I
hope it can be brisk and focused. I should say that it would be very
hard to persuade me to make major changes in the areas covered by
paragraphs 2-5 above; most useful would be advice on which
characters ought to win the scramble for the last few places. I
shall place a copy of this message on Conv-dev also, but I suggest
that discussion takes place on Indology <indology at liverpool.ac.uk>.
I shall be happy to do a reasonable amount of message-forwarding for
Conv-dev members who do not subscribe to Indology.

11. After, say, two weeks of discussion, I shall finalise the CSX+
standard and build a set of fonts to implement it: virtual fonts for
TeX, and Type 1 PostScript and TrueType fonts for PCs and
Macintoshes. (I do not have access to good Mac font software, and
would only be willing to make a Mac translation of one of the five
typefaces that I shall build for the PC: would anyone else like to
volunteer to do the job?) The fonts should be available within a
matter of a few days once the standard is agreed. As time permits I
shall also try to produce programs to handle conversion of text in
other encodings to CSX+.

John Smith

--
Dr J. D. Smith                *  jds10 at cam.ac.uk
Faculty of Oriental Studies   *  Tel. 01223 335140 (Switchboard 01223 335106)
Sidgwick Avenue               *  Fax  01223 335110
Cambridge CB3 9DA             *  http://bombay.oriental.cam.ac.uk/index.html


# CSX+ encoding for mkt1font and vpl2vpl
#
# Extended version of CSX (Classical Sanskrit eXtended encoding)
# for the representation of Indian languages in Roman script
#
# CSX+ aims to be downward compatible with CSX, save for moving aacute
# away from the slot (decimal 160) used as non-breaking space on PCs.
# It also seeks to implement the (draft) ISO/TC46/SC2 standard, while
# retaining a useful set of European accented characters and adding
# dashes and directional double quotes.


128	C cedilla
129	u dieresis
130	e acute
131	a circumflex
132	a dieresis
133	a grave
134	a ring
135	c cedilla
136	e circumflex
137	e dieresis
138	e grave
139	i dieresis
140	i circumflex
141	i grave
142	A dieresis
143	A ring
144	E acute
145	ae
146	AE
147	o circumflex
148	o dieresis
149	o grave
150	u circumflex
151	u grave
152	ae macron		# Was y dieresis in CSX
153	O dieresis
154	U dieresis
155	u breve			# Was cent in CSX
156	sterling
157	r underring		# Was yen in CSX
158	a acute
159	r underbar
160	space			# Non-breaking space on PC: was a acute in CSX
161	i acute
162	o acute
163	u acute
164	n tilde
165	N tilde
166	l tilde
167	m overdot
168	amacron breve
169	imacron breve
170	umacron breve
171	amacron tilde
172	imacron tilde
173	n underbar
174	runderring macron	# Was guillemotleft in CSX
175	l underring		# Was guillemotright in CSX
176	lunderring macron
177	runderring acute
178	runderring grave
179	runderringmacron acute
180	lunderring acute
181	amacron acute
182	amacron grave
183	imacron acute
184	imacron grave
185	e macron
186	o macron
187	r underdieresis
188	y overdot
189	umacron acute
190	umacron grave
191	r breve
192	M overdot
193	m candrabindu
194	t underbar
195	E macron
196	O macron
197	n breve
198	runderdot acute
199	runderdot grave
200	K h			# Overwritten by next definition
200	Kh underbar
201	k underbar
202	space			# Non-breaking space on Macintosh
203	AE macron
204	k h			# Overwritten by next definition
204	kh underbar
205	g overdot
206	c circumflex
207	runderdotmacron acute
208	a tilde
209	i tilde
210	u tilde
211	e tilde
212	o tilde
213	e breve
214	o breve
215	l underbar
216	umacron tilde
217	G overdot
218	C circumflex
219	h underbar
220	h underbreve
221	endash
222	emdash
223	quotedblleft
224	a macron
225	germandbls
226	A macron
227	i macron
228	I macron
229	u macron
230	U macron
231	r underdot
232	R underdot
233	runderdot macron
234	Runderdot macron
235	l underdot
236	L underdot
237	lunderdot macron
238	Lunderdot macron
239	n overdot
240	N overdot
241	t underdot
242	T underdot
243	d underdot
244	D underdot
245	n underdot
246	N underdot
247	s acute
248	S acute
249	s underdot
250	S underdot
251	quotedblright
252	m underdot
253	M underdot
254	h underdot
255	H underdot

# Alphabetical index of characters contained in proposed CSX+
# character set

A dieresis		142
A macron		226
A ring			143
AE 			146
AE macron		203
C cedilla		128
C circumflex		218
D underdot		244
E acute			144
E macron		195
G overdot		217
H underdot		255
I macron		228
Kh underbar		200
L underdot		236
Lunderdot macron	238
M overdot		192
M underdot		253
N overdot		240
N tilde			165
N underdot		246
O dieresis		153
O macron		196
R underdot		232
Runderdot macron	234
S acute			248
S underdot		250
T underdot		242
U dieresis		154
U macron		230
a acute			158
a circumflex		131
a dieresis		132
a grave			133
a macron		224
a ring			134
a tilde			208
ae 			145
ae macron		152
amacron acute		181
amacron breve		168
amacron grave		182
amacron tilde		171
c cedilla		135
c circumflex		206
d underdot		243
e acute			130
e breve			213
e circumflex		136
e dieresis		137
e grave			138
e macron		185
e tilde			211
emdash 			222
endash 			221
g overdot		205
germandbls 		225
h underbar		219
h underbreve		220
h underdot		254
i acute			161
i circumflex		140
i dieresis		139
i grave			141
i macron		227
i tilde			209
imacron acute		183
imacron breve		169
imacron grave		184
imacron tilde		172
k underbar		201
kh underbar		204
l tilde			166
l underbar		215
l underdot		235
l underring		175
lunderdot macron	237
lunderring acute	180
lunderring macron	176
m candrabindu		193
m overdot		167
m underdot		252
n breve			197
n overdot		239
n tilde			164
n underbar		173
n underdot		245
o acute			162
o breve			214
o circumflex		147
o dieresis		148
o grave			149
o macron		186
o tilde			212
quotedblleft 		223
quotedblright 		251
r breve			191
r underbar		159
r underdieresis		187
r underdot		231
r underring		157
runderdot acute		198
runderdot grave		199
runderdot macron	233
runderdotmacron acute	207
runderring acute	177
runderring grave	178
runderring macron	174
runderringmacron acute	179
s acute			247
s underdot		249
space 			160		# Non-breaking space on PC
space 			202		# Non-breaking space on Macintosh
sterling 		156
t underbar		194
t underdot		241
u acute			163
u breve			155
u circumflex		150
u dieresis		129
u grave			151
u macron		229
u tilde			210
umacron acute		189
umacron breve		170
umacron grave		190
umacron tilde		216
y overdot		188





More information about the INDOLOGY mailing list