Announcing CSX+
John Smith
jds10 at CUS.CAM.AC.UK
Thu Jun 11 11:26:01 UTC 1998
Subscribers to the Conv-dev mailing list on transliteration of
Indian languages <conv-dev at elot.gr> may already be aware that there
is a proposal to implement a new computer encoding (character set)
for use with material in such languages. The new encoding is to be
based on the existing CSX character set, which is probably the most
widely-used such encoding, and has been dubbed "CSX+". The intention
is that it should be as nearly as possible compatible with CSX, so
that most CSX users could start using CSX+ fonts and notice no
difference; however, it will extend the CSX set with extra
characters, in particular those required by the draft ISO standard
which the Conv-dev discussion has produced. This amounts to an
attempt to provide the most useful possible set of "Indian" accented
characters, and I should welcome input from others to ensure that
the best choice is made.
Documents presenting the draft standard for transliteration can be
seen at:
http://ourworld.compuserve.com/homepages/stone_catend/trdcd1a.htm
Those interested in contributing to a debate on what CSX+ ought to
be like should read on; others can simply await future announcements
with interest or apathy, as appropriate.
--------------------------------------------------------------------
CSX+
1. The two attachments to this email message present two different
"views" of a proposed CSX+ standard, one ordered by character
sequence, the other alphabetically by character name. The remainder
of this document consists of an explanation of how the proposal has
been arrived at. Earlier drafts were seen by and discussed with
Dominik Wujastyk, Anshuman Pandey, Anthony Stone and John Clews; I
am grateful to them for their comments.
2. The underlying philosophy is that CSX+ ought to be a strict
superset of CSX, so that "upgrading" is painless; CSX text should
appear unchanged in CSX+ fonts. Specifically, (A) there should be no
changes to the position of existing CSX characters; (B) no existing
CSX character should be deleted; (C) no attempt should be made to
map draft ISO standard usage on to the encoding: "r-underdot-macron"
will appear in the same place in CSX+ as it does in CSX, even though
the draft standard states that the preferred usage for Sanskrit long
vocalic "r" is "r-underring-macron". (The form with underring will
also be made available, of course.)
3. A single compromise has proved to be necessary, in the area of
aim (A) above. CSX, which came into being in the MS-DOS era, uses
position 160 (decimal) for a-acute. That slot on modern PCs is
sacred to the non-breaking space; the character has thus become
inaccessible to many people, and will have to be moved to a new
position. Fortunately, this one incompatible change will not
inconvenience the majority of users: documents containing Vedic are
likely to be the biggest problem. (The Macintosh uses slot 202 for
its non-breaking space, and this too will be held vacant.)
4. Strictly speaking, a-acute is not part of the CSX character set,
but belongs to the PC's code page 437, along with other "European"
accented characters that CSX borrows. If we add all these characters
to those officially defined by CSX, and also add in the three extra
"Indian" characters that have become widely available in recognised
character positions (171 a-macron-tilde, 172 i-macron-tilde, 216
u-macron-tilde), we find that 91 positions in the upper half of the
encoding are already occupied, leaving 37 vacant for new characters.
However, to keep both Mac and PC non-breaking space free, this has
to be reduced to 35.
5. The draft ISO standard requires the following 24 characters that
do not form part of CSX. Only lower-case versions are shown, and
where the standard proposes a "productive" usage (e.g. "tilde over a
vowel means nasalised vowel"), only forms known to occur are
included (so no "e-macron-tilde"):
ae-macron
u-breve
r-underring
r-underring-acute
r-underring-grave
r-underring-macron
r-underring-macron-acute
l-underring
l-underring-acute
l-underring-macron
e-macron
o-macron
r-underdieresis
y-overdot
r-breve
m-candrabindu
n-breve
t-underbar
k-underbar
kh-underbar
g-overdot
c-circumflex
h-underbar
h-underbreve
Note that the precise *form* of some of these may still change; for
example, there has been a recent proposal to replace r-underdieresis
by z-underdot. But such changes will not affect the number of
necessary new characters, and the final form of the draft submission
to ISO should be available very soon.
6. I would wish to add M-overdot, since the draft standard now
recommends m-overdot for anusvara, and it would otherwise become the
only "core" character without a capital equivalent. I would also
wish to add the following, as being centrally useful characters in
any text font: sterling (which was normally available in CSX fonts),
quotedblleft, quotedblright, endash, emdash. This exercise uses up
all but five of the "spare" character positions. ("Smart quotes"
cannot, alas, be made to work in Word, since the program assumes
that the characters in question are located at specific positions,
and none of the positions in question is free for this use.)
7. There is a single "European" accented character, y-dieresis,
that CSX borrows from code page 437 but that is not used in any
Indian or European language. (It was probably a mistaken version of
"Dutch y", written "ij".) I have eliminated it to save one more
slot: thus six slots are available for further new characters.
8. It seems to me that the best use for these and any other spare
slots that we can manufacture is to assign them to capitalised
versions of those new characters with the most need for capital
forms. I have tentatively given these first six to AE-macron,
E-macron, O-macron, Kh-underbar, G-overdot and C-circumflex.
9. A case could perhaps be made to eliminate some of the more
outr'e Sanskrit characters, if there are other genuinely worthy
claims to their slots. But it is clear that the room for manoeuvre
is very slight, so good cases would have to be made.
10. I am aware of one further problem, which affects character 183
in some of Gates's operating systems. There appears to be no
documentation of this feature, so I describe it as well as I can: in
Windows 3.1 and Windows 95 the character at position 183 of a font
is made inaccessible to the user if the font uses a non-Windows
encoding, apparently in order to simplify slightly the task of
displaying "soft spaces" in Word. In CSX+ this makes the character
i-macron-acute unavailable to users of these systems. It should be
said that exactly the same applies to existing CSX fonts, and that
as far as I know nobody has ever even noticed the problem; in
addition, I gather that in Windows NT the difficulty does not arise,
suggesting that perhaps newer Microsoft systems will act more
politely in this area. I shall try to find out how Windows 98 will
behave. I do not favour a change to the CSX(+) character set to get
round the problem, but it might be necessary to think up some way of
making i-macron-acute available to those who need it and cannot
currently use it.
10. I would welcome comments on these proposals. However, there are
real constraints on the time that can be spent on a discussion, so I
hope it can be brisk and focused. I should say that it would be very
hard to persuade me to make major changes in the areas covered by
paragraphs 2-5 above; most useful would be advice on which
characters ought to win the scramble for the last few places. I
shall place a copy of this message on Conv-dev also, but I suggest
that discussion takes place on Indology <indology at liverpool.ac.uk>.
I shall be happy to do a reasonable amount of message-forwarding for
Conv-dev members who do not subscribe to Indology.
11. After, say, two weeks of discussion, I shall finalise the CSX+
standard and build a set of fonts to implement it: virtual fonts for
TeX, and Type 1 PostScript and TrueType fonts for PCs and
Macintoshes. (I do not have access to good Mac font software, and
would only be willing to make a Mac translation of one of the five
typefaces that I shall build for the PC: would anyone else like to
volunteer to do the job?) The fonts should be available within a
matter of a few days once the standard is agreed. As time permits I
shall also try to produce programs to handle conversion of text in
other encodings to CSX+.
John Smith
--
Dr J. D. Smith * jds10 at cam.ac.uk
Faculty of Oriental Studies * Tel. 01223 335140 (Switchboard 01223 335106)
Sidgwick Avenue * Fax 01223 335110
Cambridge CB3 9DA * http://bombay.oriental.cam.ac.uk/index.html
# CSX+ encoding for mkt1font and vpl2vpl
#
# Extended version of CSX (Classical Sanskrit eXtended encoding)
# for the representation of Indian languages in Roman script
#
# CSX+ aims to be downward compatible with CSX, save for moving aacute
# away from the slot (decimal 160) used as non-breaking space on PCs.
# It also seeks to implement the (draft) ISO/TC46/SC2 standard, while
# retaining a useful set of European accented characters and adding
# dashes and directional double quotes.
128 C cedilla
129 u dieresis
130 e acute
131 a circumflex
132 a dieresis
133 a grave
134 a ring
135 c cedilla
136 e circumflex
137 e dieresis
138 e grave
139 i dieresis
140 i circumflex
141 i grave
142 A dieresis
143 A ring
144 E acute
145 ae
146 AE
147 o circumflex
148 o dieresis
149 o grave
150 u circumflex
151 u grave
152 ae macron # Was y dieresis in CSX
153 O dieresis
154 U dieresis
155 u breve # Was cent in CSX
156 sterling
157 r underring # Was yen in CSX
158 a acute
159 r underbar
160 space # Non-breaking space on PC: was a acute in CSX
161 i acute
162 o acute
163 u acute
164 n tilde
165 N tilde
166 l tilde
167 m overdot
168 amacron breve
169 imacron breve
170 umacron breve
171 amacron tilde
172 imacron tilde
173 n underbar
174 runderring macron # Was guillemotleft in CSX
175 l underring # Was guillemotright in CSX
176 lunderring macron
177 runderring acute
178 runderring grave
179 runderringmacron acute
180 lunderring acute
181 amacron acute
182 amacron grave
183 imacron acute
184 imacron grave
185 e macron
186 o macron
187 r underdieresis
188 y overdot
189 umacron acute
190 umacron grave
191 r breve
192 M overdot
193 m candrabindu
194 t underbar
195 E macron
196 O macron
197 n breve
198 runderdot acute
199 runderdot grave
200 K h # Overwritten by next definition
200 Kh underbar
201 k underbar
202 space # Non-breaking space on Macintosh
203 AE macron
204 k h # Overwritten by next definition
204 kh underbar
205 g overdot
206 c circumflex
207 runderdotmacron acute
208 a tilde
209 i tilde
210 u tilde
211 e tilde
212 o tilde
213 e breve
214 o breve
215 l underbar
216 umacron tilde
217 G overdot
218 C circumflex
219 h underbar
220 h underbreve
221 endash
222 emdash
223 quotedblleft
224 a macron
225 germandbls
226 A macron
227 i macron
228 I macron
229 u macron
230 U macron
231 r underdot
232 R underdot
233 runderdot macron
234 Runderdot macron
235 l underdot
236 L underdot
237 lunderdot macron
238 Lunderdot macron
239 n overdot
240 N overdot
241 t underdot
242 T underdot
243 d underdot
244 D underdot
245 n underdot
246 N underdot
247 s acute
248 S acute
249 s underdot
250 S underdot
251 quotedblright
252 m underdot
253 M underdot
254 h underdot
255 H underdot
# Alphabetical index of characters contained in proposed CSX+
# character set
A dieresis 142
A macron 226
A ring 143
AE 146
AE macron 203
C cedilla 128
C circumflex 218
D underdot 244
E acute 144
E macron 195
G overdot 217
H underdot 255
I macron 228
Kh underbar 200
L underdot 236
Lunderdot macron 238
M overdot 192
M underdot 253
N overdot 240
N tilde 165
N underdot 246
O dieresis 153
O macron 196
R underdot 232
Runderdot macron 234
S acute 248
S underdot 250
T underdot 242
U dieresis 154
U macron 230
a acute 158
a circumflex 131
a dieresis 132
a grave 133
a macron 224
a ring 134
a tilde 208
ae 145
ae macron 152
amacron acute 181
amacron breve 168
amacron grave 182
amacron tilde 171
c cedilla 135
c circumflex 206
d underdot 243
e acute 130
e breve 213
e circumflex 136
e dieresis 137
e grave 138
e macron 185
e tilde 211
emdash 222
endash 221
g overdot 205
germandbls 225
h underbar 219
h underbreve 220
h underdot 254
i acute 161
i circumflex 140
i dieresis 139
i grave 141
i macron 227
i tilde 209
imacron acute 183
imacron breve 169
imacron grave 184
imacron tilde 172
k underbar 201
kh underbar 204
l tilde 166
l underbar 215
l underdot 235
l underring 175
lunderdot macron 237
lunderring acute 180
lunderring macron 176
m candrabindu 193
m overdot 167
m underdot 252
n breve 197
n overdot 239
n tilde 164
n underbar 173
n underdot 245
o acute 162
o breve 214
o circumflex 147
o dieresis 148
o grave 149
o macron 186
o tilde 212
quotedblleft 223
quotedblright 251
r breve 191
r underbar 159
r underdieresis 187
r underdot 231
r underring 157
runderdot acute 198
runderdot grave 199
runderdot macron 233
runderdotmacron acute 207
runderring acute 177
runderring grave 178
runderring macron 174
runderringmacron acute 179
s acute 247
s underdot 249
space 160 # Non-breaking space on PC
space 202 # Non-breaking space on Macintosh
sterling 156
t underbar 194
t underdot 241
u acute 163
u breve 155
u circumflex 150
u dieresis 129
u grave 151
u macron 229
u tilde 210
umacron acute 189
umacron breve 170
umacron grave 190
umacron tilde 216
y overdot 188
More information about the INDOLOGY
mailing list