Representing diacritics in 7-bit ASCII

bpj at netg.se bpj at netg.se
Wed Dec 11 14:44:26 UTC 1996


(This post should be viewed in a mono-width font like Courier.)
(A .gif showing the printed characters is attatched.)

REPRESENTING DIACRITICS IN 7-BIT ASCII

by B.PHILIP JONSSON


INTRODUCTION

Some time ago my attention was drawn to the severe difficulty of
representing the diacritics found in many orthographies as well as
scholarly and bibliographic transcription-schemes. Even users of 8-bit
systems must have access to special fonts to write and view such special
characters; these fonts are hard to obtain, and there have as yet emerged
no useful standards for the encoding of character-sets outside the
orthographies of European languages. The different 8-bit systems (e.g.
Macintosh and Windows) don't even agree wrt the encoding of diacriticized
letters found in "major" languages like Spanish, French and German.

The solution then, must be to use linear strings of 7-bit ASCII-characters
to represent diacriticised letters. There has been quite some
experimenting, e.g. with the assignment of separate values to upper-case
and lower-case letters. While such solutions may be applicable in outright
re-encoding, such as phonetic/phonemic transcription, many people don't
feel that they are very appealing even in that context, being kludgey and
error-prone. For purposes of philological and bibliographical transcription
and transliteration such solutions are hardly practicable, mainly for two
reasons:

A. It is generally desired that the computerized
transliteration/transcription should visually resemble the
usual/traditional printed transliteration/transcription.

B. It is usually desirable that the normal conventions wrt capitalization
can be preserved.

The following represents my attempt at achieving this in a systematic way,
with the help of the punctuation characters and other special characters
found within the standard 7-bit ASCII.

The scheme makes heavy use of what I call THE SMILEY-PRINCIPLE: most of the
diacritics should be viewed as if rotated 90 degrees -- mostly
counter-clockwise.


DELIMITERS

In order to distinguish diacriticized text from surrounding ordinary text,
it should be enclosed in delimiters as shown here. Note that the characters
used as delimiters are NOT used as diacritics

NAME            CHARACTERS      FUNCTION, PRINTED EQUIVALENTS
underscore      _ _             italics, emphasis (1)
braces          { }             transliteration, bolds
brackets        [ ]             transcription, bold italics
asterisk        *               unattested form/spelling/word (2)


(1) The underscore {_} can not be used as a diacritic because it is needed
as a delimiter to indicate italicization/emphasis -- necessarily so since
{/} is used as a diacritic.  Its use e.g. for macron or underline would
unnecessary violate the "smiley-principle", while not being very obvious
within the premises of this system. It is recommended that _ _ is used to
delimit strings that do NOT contain any diacritics, though punctuation
marks are probably best kept outside the delimiters.

(2) The asterisk{*} can not be used to indicate bold style/strong emphasis,
since it is used to indicate unattested forms; use capitalization instead.


PUNCTUATION MARKS

Almost all the Roman punctuation marks found in 7-bit ASCII are used as
diacritics in this system.  In order to distinguish their use as
punctuation marks from their use as diacritics, true punctuation marks
should as far as possible be kept outside the delilimiter characters
described below. If this is not possible or practical, true punctuation
marks should be delimited by ONE PRECEEDING and TWO SUCCEDING spaces (ASCII
32), e.g.:
{a/ :  a-acute ,  a: :  a-umlaut}
Conversely diacritics should always be kept adjacent to their base-letters
and not separated from them with any spaces.


DIACRITICS

NAME            FUNCTION                SYMBOL  EXAMPLE  USED...
macron          length                  |       a|      with vowels
breve (above)   shortness               (       a(      "    "
curve (below)   non-syllabicness        )       )i      "    "
acute           stress, rise, (length)  /       a/      "    "
grave           half-stress, fall       \       a\      "    "
circumflex      rise-fall, length       >       a>      "    "
umlaut/         fronting, centralized   :       a:      "    "
diaeresis       syllable separator (3)  :       e:      "    "
over-ring       rounding                %       a%      "    "
plus            rised                   +       e+
minus           lowered                 -       e-
tilde           nasalization            $       a$
underdot (4)    retraction, retroflex   .       .t .e
double underdot murmur, approximant     ;       ;d ;r
wedge           (post)alveolar          <       s<
underline (5)   mid                     =       =t      with consonants
curve (below)   fronted                 )       )t      "    "
cross (6)       fricative               #       t# d#   "    "
hook/cedilla    palatalization          ,       ,c      "    "
acute           apical                  /       s/      "    "
grave           laminal (7)             \       s\      "    "
circumflex      laminal, (post)alveolar >       s>      "    "
ring            voiceless               %       d%      "    "
aleph           glottal(ized) (8)       !       t!
ayin            pharyngeal(ized) (8)    ?       t?
inverted apostrophe
                unreleased              `       t`


JUNCTURES

ligature        ^       a^e     {a^esc}
separator       ~       t~h     _pot~hole_
syllable-break  ' (9)           [kh&/p'l]  "couple"


SPECIAL CHARACTERS

schwa (10)      neutral vowel           &       [s<&va/]  "schwa"
eng (10)        velar nasal             @       [e@]      "eng"
aleph           glottal stop (8)        !       t!
ayin            pharyngeal glide (8)    ?       t?

(3) strictly not needed if the curve below is consistently used to indicate
non-syllabic vowels.

(4) The less common _overdot_ may be rendered by {^.}, e.g. {m^.}. In
languages where it does not contrast with underdot (like Maltese) {.c} etc
may be used.

(5) rendered with {=} to preclude confusion with {|} ,  written over a letter.

(6) that is to represent the crossing over of the stem or body of a letter,
like the Icelandic/Old English "edh" {d#}. It may also be used with vowels,
e.g. Danish and Norwegian {o#}; furthermore {i#} may be used as a
convention for Turkish undotted {i}.

(7) Laminals have sometimes been indicated with a circumflex .  The use of
the grave, parallell to that of the acute for apicals, has been introduced
in order to disambiguate, since the circumflex is used in Esperanto to
indicate palatalization/postalveolarization.

(8) That is {!} and {?} are used both to indicatediacritics and for
separate sounds. If need is felt to disambiguate the use as diacritic the
ligature {^} may be used: {t^!}, {d^?}. If the use of {!} and {?} in these
functions is felt to be too misleading wrt their usual punctuation values,
{'} and {~} respectively might be used instead, but then the normal uses of
the latter must both be replaced by the cumbersome double-hyphen {--}.

(9) This will also serve to indicate syllabic vowels: {k'm'tom},
{sa.msk'r'ta} -- or {sa.msk.rta}, or indeed {sa.mskr.ita}.

(10) The sound called schwa (neutral vowel) is often represented by @ on
the Internet. That is an unfortunate practice, since the latter symbol is
derived from {at} -- indeed its name is of course "at", while the usual
symbol for schwa, rotated {e}, is derived from {e}. Imo it would be better
to represent schwa by the & sign, derived from {et} -- being a ligature of
the letters in the Latin word for "and".

It would then perhaps be natural to use the symbol @ to indicate the vowel
of the English word "at" -- as I indeed often do in phonetic transcriptions
--, but since there are already means to represent the {a^e} ligature I
feel that @ may instead be used to represent another phonetic symbol often
found in philological transcriptions (e.g. of Avestan and Germanic runes),
namely the eng-character. It is arguably the case, if arguments be needed,
that the @ symbol, especially as it appears in small-size bitmap

 XXXXXX
X      X
X  XX  X
X  XXXX
X
 XXXXXXX

, will if rotated 90 degrees

 XXXXXX
X      X
X  XX  X
X  XX  X
X   XXX

Bear some resemblance to the usual eng-character

X XXXXX
XX     X
X      X
X      X
X      X
    X  X
     XX
------------------------------------------------

December 11, 1996 B.Philip Jonsson

[END]


-------------- next part --------------
A non-text attachment was scrubbed...
Name: 7bit_diac.gif
Type: image/gif
Size: 8425 bytes
Desc: not available
URL: <https://list.indology.info/pipermail/indology/attachments/19961211/f75f2989/attachment.gif>


More information about the INDOLOGY mailing list