Issues in the creation and dissemination of Sanskrit e-texts

Mon May 31 09:09:48 UTC 1993

Regarding Dominik's email on etexts:

I fully agree with what Dominik says, and I would like to add a few
comments of my own. Collecting existing Sanskrit (and/or middle indic)
etexts would be extremely useful to us all. In other fields of philology
large corpora have existed for a long time, and software has been developed
for more or less automatic tagging of such texts. Some corpora, I believe,
are fully tagged. I would like, however, to stress the value of a common
format for the entry of indic texts. As far as I can see, several formats
exist, some of which are not very useful from a linguist's or
statistician's point of view. Personally I find the "TUSTEP" format used by
Peter Schreiner et al. very useful. This format analyses all compounds, but
at the same time the formatting program of TUSTEP (Tuebingen System von
Textverarbeitungsprogrammen) enables you to print quite beautiful Sanskrit
texts with proper transcription characters. So far, I have seen nothing
better in the field of computer based transliteration. Converting to other
font systems is easy. I therefore suggest that we should try to agree on a
common format, and I put the "TUSTEP" format up as my candidate.

As regards the construction of corpora, diversity is as important as the
number of words. Large corpora are usually divided into genres, and text
samples may be limited to 2,000 words. The Brown Corpus contains one
million words of written, edited American English published in 1961; the
corpus comprises 500 text samples, each 2,000 words long, taken from
fifteen text categories (e.g. press reportage, editorials, academic prose,
general fiction). The American linguist-cum-statistican Douglas Biber has
shown that even samples as short as 1,000 words give a fairly consistent
representation of a number of linguistic parameters, so that even short
texts of 1,000 - to 2,000 words may be of value. (See Douglas Biber (1990).
"Methodological Issues Regarding Corpus-based Analyses of Linguistic
Variation." Literary and Linguistic Computing,  5(4): 258-269. One does
have to demonstrate, however, that the same thing applies to Sanskrit!)
Therefore, all you indologists out there who have entered a few thousand
words of Sanskrit on your computer, don't hesitate to share your work with
the rest of us. Some of us may even have entered other parts of the same
text, so that we could put together a complete electronic edition!

Best regards,

Lars Martin Fosse

Lars Martin Fosse
Department of East European
and Oriental Studies
P. O. Box 1030, Blindern
N-0315 OSLO Norway

Tel: +47 22 85 68 48
Fax: +47 22 85 41 40

E-mail: l.m.fosse at easteur-orient.uio.no