Issues in the creation and dissemination of Sanskrit e-texts

Mon May 31 15:32:23 UTC 1993

Lars makes several excellent points, with which I mostly agree completely.
But
> ... the formatting program of TUSTEP (Tuebingen System von
> Textverarbeitungsprogrammen) enables you to print quite beautiful Sanskrit
> texts with proper transcription characters. So far, I have seen nothing
> better in the field of computer based transliteration. 

TeX is better.

And
> Converting to other
> font systems is easy. I therefore suggest that we should try to agree on a
> common format, and I put the "TUSTEP" format up as my candidate.

The CSX 8-bit coding scheme has already been agreed upon for general
file exchange, and terminal display.  All the Sanskrit texts available
via this INDOLOGY listserv and the associated ftp site
(ftp.bcc.ac.uk:pub/users/ucgadkw/indology) are also available in 
standard CSX coding at blackbox.hacc.washington.edu, courtesy of 
Tom Ridgeway.

Your other points are well taken.  The Schreiner transcription is
indeed extremely valuable, but there are two points to be made
about it.

1/      It is in no way bound to TUSTEP.

        The Schreiner transcription can be done with any editor
        at all, from EMACS to vi, from WordPerfect to edlin.  It
        doesn't matter.  It is a 7-bit transcription and coding
        scheme for marking all characters, and a wide range of
        grammatical values such as sandhi, samaasas, etc.  (See
        the online version of Saundaryalahari for an example.
        Available by ftp from ftp.bcc.ac.uk in directory
        pub/users/ucgadkw/indology.)  Peter himself has written
        filters in TUSTEP to convert between his and other
        transcriptions, so that he can print in naagarii using
        the Velthuis transcription for TeX, etc.

        TUSTEP is a wonderful set of tools for textual analysis,
        but there are also other such tools such as TACT, the
        Oxford Concordance Program, and the many text processing
        tools which form such a prominent part of Unix (grep,
        awk, sed, tr, uniq, spell, troff, etc.).  These other
        tools are equally able to use the Schreiner transcription
        to extract lists, lemmas, and do statistics on
        grammatical or lexical features.  I am not against
        TUSTEP: it is great.  But the Schreiner transcription is
        a separate issue.

2/  It require very significant grammatical knowledge on the part
    of the typist.

        If the typist and the scholar are identical, as with
        Peter Schreiner, then you can have large amounts of text,
        already grammatically analysed.  But this is not commonly
        the case.  Usually, I believe, a scholar gets a grant to
        pay someone (a student) to type a text.  In very big
        transcription projects, the typists may not even know the
        language they are typing (this was the case with the
        Greek TLG project, where Greek texts were typed by
        Phillipino typists who just learned the Greek alphabet.)
        In that situation, it would slow the project unacceptably
        to require grammatical analysis as well as transcription.

        It is still very important to have texts transcribed
        verbatim, without the dissolution of sandhi, compounds,
        cases and tenses.  I hope that in time it will be
        possible to semi-automate these tasks.  As I mentioned in
        my earlier note, Peter already has a substantial list of
        analysed lemmata, and this list can be used to analyse
        "samhita" texts.  In classical/Puranic literature, Peter
        has found that up to 60% of words are common to all
        texts.  So a semi-automatic analysis by reference to a
        list (i.e., dictionary-based, as opposed to algorithmic)
        should have a very substantial impact on the task.

        Secondly, at the Leiden world Sanskrit conference, Aad
        Verboom demonstrated an algorithmic sandhi analysis
        program, and a grammatical analysis program.  I don't
        know what has happened to this effort since then.  But
        either it can be completed, or someone can do it again.
        Aad's demonstration at Leiden provided a fully
        satisfactory proof-of-concept.

On the question of sharing texts, something we could do *right
now* is to share titles.  I would like to urge all members of
Indology to submit the names of texts they have transcribed into
digital form, or of any they know of that have been done by
others.  I volunteer to gather the details together into a list
which I will make publicly available by LISTSERV and ftp.

Dominik