Issues in the creation and dissemination of Sanskrit e-texts
ucgadkw at ucl.ac.uk
ucgadkw at ucl.ac.uk
Mon May 31 15:32:23 UTC 1993
Lars makes several excellent points, with which I mostly agree completely.
> ... the formatting program of TUSTEP (Tuebingen System von
> Textverarbeitungsprogrammen) enables you to print quite beautiful Sanskrit
> texts with proper transcription characters. So far, I have seen nothing
> better in the field of computer based transliteration.
TeX is better.
> Converting to other
> font systems is easy. I therefore suggest that we should try to agree on a
> common format, and I put the "TUSTEP" format up as my candidate.
The CSX 8-bit coding scheme has already been agreed upon for general
file exchange, and terminal display. All the Sanskrit texts available
via this INDOLOGY listserv and the associated ftp site
(ftp.bcc.ac.uk:pub/users/ucgadkw/indology) are also available in
standard CSX coding at blackbox.hacc.washington.edu, courtesy of
Your other points are well taken. The Schreiner transcription is
indeed extremely valuable, but there are two points to be made
1/ It is in no way bound to TUSTEP.
The Schreiner transcription can be done with any editor
at all, from EMACS to vi, from WordPerfect to edlin. It
doesn't matter. It is a 7-bit transcription and coding
scheme for marking all characters, and a wide range of
grammatical values such as sandhi, samaasas, etc. (See
the online version of Saundaryalahari for an example.
Available by ftp from ftp.bcc.ac.uk in directory
pub/users/ucgadkw/indology.) Peter himself has written
filters in TUSTEP to convert between his and other
transcriptions, so that he can print in naagarii using
the Velthuis transcription for TeX, etc.
TUSTEP is a wonderful set of tools for textual analysis,
but there are also other such tools such as TACT, the
Oxford Concordance Program, and the many text processing
tools which form such a prominent part of Unix (grep,
awk, sed, tr, uniq, spell, troff, etc.). These other
tools are equally able to use the Schreiner transcription
to extract lists, lemmas, and do statistics on
grammatical or lexical features. I am not against
TUSTEP: it is great. But the Schreiner transcription is
a separate issue.
2/ It require very significant grammatical knowledge on the part
of the typist.
If the typist and the scholar are identical, as with
Peter Schreiner, then you can have large amounts of text,
already grammatically analysed. But this is not commonly
the case. Usually, I believe, a scholar gets a grant to
pay someone (a student) to type a text. In very big
transcription projects, the typists may not even know the
language they are typing (this was the case with the
Greek TLG project, where Greek texts were typed by
Phillipino typists who just learned the Greek alphabet.)
In that situation, it would slow the project unacceptably
to require grammatical analysis as well as transcription.
It is still very important to have texts transcribed
verbatim, without the dissolution of sandhi, compounds,
cases and tenses. I hope that in time it will be
possible to semi-automate these tasks. As I mentioned in
my earlier note, Peter already has a substantial list of
analysed lemmata, and this list can be used to analyse
"samhita" texts. In classical/Puranic literature, Peter
has found that up to 60% of words are common to all
texts. So a semi-automatic analysis by reference to a
list (i.e., dictionary-based, as opposed to algorithmic)
should have a very substantial impact on the task.
Secondly, at the Leiden world Sanskrit conference, Aad
Verboom demonstrated an algorithmic sandhi analysis
program, and a grammatical analysis program. I don't
know what has happened to this effort since then. But
either it can be completed, or someone can do it again.
Aad's demonstration at Leiden provided a fully
On the question of sharing texts, something we could do *right
now* is to share titles. I would like to urge all members of
Indology to submit the names of texts they have transcribed into
digital form, or of any they know of that have been done by
others. I volunteer to gather the details together into a list
which I will make publicly available by LISTSERV and ftp.
More information about the INDOLOGY