Sanskrit lexicon

Fri Feb 17 16:53:34 UTC 1995

> In response to Thomas Malten's sample page of Monier-Williams I have been
> reflecting about the types of information I would like to be able to find
> (electronically); a simple reproduction of the printed page might not be
> sufficient. And IF the conversion into electronic format is done manually
> (and Malten's sample looks like it), and IF it is done by trained people, it
> would seem highly recommendable to tag the various types of information.

The conversion of my sample page of MW was indeed done by a student of
Sanskrit but I think all the general structural information (which basically
amount to typeface chnages and diacritics as contained in the tag list) can
be identified by anyone.

A simple reproduction of MW at first may answer most of the needs a general
user may have and does not preclude further tagging operations. What I
wanted to convey with the example page is that with comparative little
effort one could have at least all the information electronically that the
printed MW gives and more (e.g. English-Skt). I think that most of the
tagging requirements you have detailed can be achieved automatically once
the "text" is there (that is what I meant by "programmable"). I shall try
to comment on each point you have raised.

The MW entries can then be used either in PC/Mac editor systems or can
be converted to the simple Internet/Unix GOPHER based WAIS full
text retrieval system (you can check our OTL and Tamil concordances). But
basically there is no hindrance I can see to convert the MW text for use
under any system, OCP, Tact, 4th Dimension, Dbase, TUSTEP or whatever.

> A main entry may contain information concerning:
> -- source texts (individual or groups, with textual reference or without,
> mostly tied to specific meaning)

the sources which MW gives can be unambiguously accessed through his list
of abbreviations. That is, if you search for the string "MBh.*" you are not
likely to get anything but Mahabharata citations.

> -- sublemmas (distinguished according to part of speech, e.g. nouns as
> sublemmas to adjectives, marked by italic endings in parentheses)

can be generated by searching for string "(%i1%), f." or the like which
can then be automatically tagged to each subsequent Skt. word of the
subentry.

> -- compounds with identical first member (indicated by leading hyphen if
> sandhi allows that)

can be generated automatically by putting the headword into a variable XYZ
and then replacing each occurence of string "{#-" with the headword. Let's
take for example the entry *{%kun5jara, as$}:
1. XYZ = {%kun5jara
2. change all "{#-" into xyz [until next occurrence of "*"]

> -- compounds with the lemma-word as second member

basically the same as above, eg.
1. xyz={%kun5jara
2. change all "+$}" into XYZ [{%ra1jak+$} becomes {%ra1jak+kun5jara$}]

> -- parallel lemmas

basically the same, if I know the notation used in MW

> -- pointers to other lemmata (entries without indication of meanings)

that should be taken care of in a retrieval system by querying for the
string after the crossreference but otherwise it would be equally easy -
if somewhat wasteful (digital pollution!) - to repeat the entry referred to.

> -- "homonyms" (different meanings) dependant on the grammatical tag (e.g.
> kut2 as "cl. 6.P." or as "cl. 4.P.")

the identification is there, as the classes (e.g. "cl. 6.P.") can serve as
tags [or have I missed something?].

> -- explanations (e.g. "there being eight elephants of the cardinal points")

explanationas are always in Roman type face and preceeded and followed by a
bracket "(" ")". To take up your example:
1. locate first occurence of "$" [=Roman type face]
2. locate first occurrence of "("
3. match closing bracket
4. [done]

> I do not see that the typography in Monier-Williams would allow to
> distinguish these (and other) types of information automatically.

unless there are ambiguities/difficulties in MW not met with on page 288
(for example the occurrence of words written in Greek or other exotic
scripts) I think everything can be done. If you are interested I can send
you the REXX macros which will do this for page 288.

> In my project of computerizing Mylius' Woerterbuch Sanskrit-Deutsch I have
> no "manual labour" at my disposal and must restrict the tagging to what can
> be achieved by interpreting the typography of the printed book:
> 1. Counter for homonyms
> 2. lemma
> 3. grammatical tags
> 4. meaning or meanings (including specifications concerning semantic
> context, syntax, etc.)

how do you do it? it may be useful to have a detailed
description and examples of your work process.

> I suppose some kind of agreement as to what is recommandable and/or
> necessary should be reached before each of us begin to encode his/her 55K of
> Monier-Williams. Does the Text Encoding Initiative provide us with a model?!

Ihr Wort in Gottes Ohr!

> Peter Schreiner
> (PS: I am NOT concerned with the details of transliteration or tagging;
> Malten's system is beautifully unambiguous. But I would like to understand
> the requirements for the "logic" of an electronic dictionary.)

Unambiguity is indeed all that counts. On the whole I don't think it would
take more than two days work to convert a (correctly) typed MW ascii text
into a fully expanded and tagged lexical database.
-Thomas Malten
------------------------------------------------------------------------------
Institute of Indology and Tamil Studies, Pohligstr.1, 50969 Koeln, Germany
Tel 0221/4705340 Fax 0221/4705151 email ami01 at rrz.uni-koeln.de
-------------------------------------------------------------------------------