Input of e-texts - some suggestions

Tue Jun 11 09:39:28 UTC 2002

Dear list members,

roughly half a year after the introduction of the Goettingen Register
of Electronic Texts in Indian Languages (GRETIL) I would like to
express my thanks to all contributors of e-texts and, at the same
time, invite further contributions. It should be noted that these
contributions are in no way expected to comply with the suggestions
made below.

Anyway, here are some points that I have come to find useful in my own
work as well as in preparing files from various sources for GRETIL.
Perhaps they can serve as a starting point for a discussion.

-   FORMAT: Assuming that the aim of the text input is to provide a
    scholarly reference aid for a given text, rather than an exercise
    in piety, I consider transliteration in a PLAIN TEXT FILE
    preferable to any other format such as PDF, RTF, HTML etc.,
    which may turn out practically useless for the said purpose,
    especially when combined with non-Latin scripts.

-   ENCODING: No matter which encoding is used in transliteration,
    it should be
    -  FREE FROM ANY AMBIGUITY (that may, e.g., arise from employing
    "n" for different class nasals)
    -  and FULLY DOCUMENTED at the beginning of every e-text,
    preferably in a chart giving the equivalent of each diacritic in
    ASCII or an established reference encoding such as CSX. Casual
    references to "ITRANS", "Unicode", "UTF8" or whatever are not
    very helpful to those using other encodings - and, odd as it may
    seem, "other" encodings are not likely to vanish into thin air,
    nor will "global" marketing strategies for long prevent the rise
    of new encoding systems, making today's one-size-fits-all
    solution just another item of electronic mythology.

-    REFERENCE SYSTEM: This is perhaps the most neglected aspect in
     the majority of e-texts one comes across. And yet, with the
     computer's well-known limitation to one screenful of text at a
     time, it is crucial to provide readers with adequate orientation,
     citing, as it were, book, chapter and verse in each and every
     screenful of text.
     -     REFERENCES SHOULD BE PLACED AT THE END of the respective
           text unit (such as a verse or line) to allow for later
           SORTING of lines (or padas) in alphabetical order
           (cf. below).
     -     REFERENCES SHOULD BE GIVEN IN FULL, e.g. "3,13.120",
           instead of restricting them to the smallest unit, say, the
           verse number (just "120" instead of "3,13.120"). Having
           browsed two or three screens up or down from a chapter
           heading, one may easily have forgotten where exactly one
           happens to be. Orientation can be even more difficult if an
           ordinary word search takes you from the beginning of the
           file right to a verse with the enigmatic reference "120":
           for a start, you will have to scroll 119 verses up to find
           out that you're in chapter 13, and it is all too plain that
           your expedition through the text - and away from the
           passage you were looking for - doesn't end there.
     -     With next to no additional effort, references can be made
           SUITABLE FOR CLASSIFIED SEARCH simply by using distinctive
           punctuation, such as COMMA between book and chapter, and
           DOT between chapter and verse. This allows you to
           distinguish the search for "3,13" (=book 3, chapter 13)
           from "3.13" (chapter 3, verse 13).
     -     Especially when a file contains more than one e-text, the
           reference should include an ABBREVIATION FOR THE TEXT in
           question, preferably with a connecting underscore to
           prevent accidental separation due to line break, e.g.
           "MBh_3,13.120". Such an abbreviation is essential in pada /
           verse indices that you may later want to merge with indices
           of other texts to search for parallels.
     -     In a file combining a root text and interspersed
           commentary, say, the Mahabharata and Nilakantha's
           Bharatabhavadipa, distinct abbreviations, e.g.,
           "MBh_3,13.120" resp. "MBhN_3,13.120", will facilitate
           orientation significantly.
     -     MARKERS FOR METRICAL UNITS (padas) AND SECTIONS OF PROSE
           (sentences) are indispensable for generating pada indices.
           E.g., the Anustubh pattern could look like this:
           For a four-pada verse:
           ........  $ ........  &
           ........  stlg ........  // XY_n,n.n //
           For a six-pada verse:
           ........  $ ........  &
           ........  stlg ........  peseta
           ........  florin ........  // XY_n,n.n //
           Here again, everything is fine as long as it is
           UMAMBIGUOUS.

*******************************************************************

These suggestions have gradually emerged from my own practice. I
would be interested to hear what others have to say about this.

Finally, let me again point out that contributions to GRETIL are in
no way expected to comply with these suggestions!

Best regards

Reinhold Gruenendahl

********************************************************************

Dr. Reinhold Gruenendahl
Niedersaechsische Staats- und Universitaetsbibliothek
Fachreferat sued- und suedostasiatische Philologien
(Dept. of Indology)

37070 Goettingen, Germany
Tel (+49) (0)5 51 / 39 52 83
Fax (+49) (0)5 51 / 39 23 61
gruenen at mail.sub.uni-goettingen.de

FACH-INFORMATIONEN INDOLOGIE, GOETTINGEN:
http://www.sub.uni-goettingen.de/ebene_1/fiindolo/fiindolo.htm
In English:
http://www.sub.uni-goettingen.de/ebene_1/fiindolo/fiindole.htm

GRETIL - Goettingen Register of Electronic Texts in Indian Languages
http://www.sub.uni-goettingen.de/ebene_1/fiindolo/gretil.htm