compound analysis in e-texts

Wed Sep 4 17:35:36 UTC 1996

Dear Birgit,

I am sorry you had wait for so long for an answer. I have had a very busy week!

>One problem, which Jakub Cejka mentioned before, are ambiguous compounds
>which are read in different ways by the tradition itself, and I would like
>to second his question whether you have a policy on such cases. 

When we talk about ambiguous compounds, we are dealing with two
possibilities: Compounds that can be analysed as different types of
compounds (tatpursha, bahuvrihi etc), and compounds consisting of words that
can be analysed i different ways, like the following:

a/sva.m 

cakravartinarav-ahanocitam = fit to be a vehicle for a ruler who is sovereign
or        
        cakravarti+narav-ahana+ucitam = fit for Narav-ahana, who is a sovereign
        cakra+varti+nara+v-ahana+ucitam =fit to be a vehicle for a ruler who
                                         displays the wheel
        cakra+varti+narav-ahana+ucitam = fit to be a vehicle for Narav-ahana,
                                         who displays the wheel

(Examples courtesy of Gwendolyn Lane, translator of the Kadambari.

This should be a good example of the problems that Birgit is discussing. In
my system of transliteration, narav-ahana would be kept as one word provided
it is actually a name (I expect the context to disambiguate in most cases).
Otherwise, the analysis would be cakra+varti+nara+v-ahana+ucita. If we
happen to know that cakravartin here means ruler, then that would also be
kept as one word. If we have no pragmatic knowledge as to how we should
interpret the compound, we should use computational codes that allow us to
give all possible interpretations. Thus, any automatic analysis of the text
would be able to tell the scholar that there is a problem which has to be
solved "manually". Ambiguity is a technical problem, nothing else. 

The fact that occasional analytical errors are made, or that possibilities
are overlooked, is no argument against the analysis of compounds. Other
scholars should always be aware of the fact that there may be more work to do. 

>Add to which, my current experience with preparing an e-text version of the
>complete works of J~naana'sriimitra (short: JNA) tells me that "competence"
>is a very, very relative concept. I have typed in quite a lot of his texts
>by now, and I am virtually "living" with two of his treatises, but his style
>is so intricately difficult that, more often than not, I have to give up on
>compound analysis. Another problem is my lack of competence outside the very
>narrow field of pramaan.a-studies. An author like J~naana'srii, who
>frequently uses vocabulary/illustrations taken from poetics or at least not
>conforming to the "standards" of the poor man's pramaan.a-terminology in
>general, requires constant lexicographical investigation, and a lot of
>reading experience in other subject areas. I don't have this experience, and
>if I had to gain it simply to TYPE in the text, it would take at least ten
>more years for me to come up with the preliminary electronic version of JNA,
>which is not really in anybody's interest. 

In other words, at your present level of competence you can only do part of
the job. I see nothing wrong with that. Other scholars can do the part that
you are not yet fit to do. Again, this is not an argument against the
analysis of compounds, only a description of a practical problem. The main
thing is that as long as we are able to return to the "original" shape of
the text by means of filters or macros that reestablish the text as it was
before analysis, you can have your text both ways. 

>Hence, I have formed the opinion that (a) we can never be sure about the
>competence required for the analysis,

Maybe not initially, but certainly after some time!

 and (b) if I personally have to choose
>between probably flawed compound-analysis and no compound-analysis at all, I
>would prefer the latter, as far as texts published for the general audience
>are concerned.

I disagree. All texts have to be read with a critical eye. This is also the
case with analysed texts. I would definitely prefer a flawed
compound-analysis to no compound-analysis, but I would of course try to
correct the errors and fill out the missing parts. An electronic text, just
like a medieval manuscript, is a living text.

 This, of course, does not prevent one from preparing
>compound-analyzed texts for the tasks you mentioned (indexing, collocations
>etc.). Maybe one should differentiate different target-audiences for
>different types of e-texts in the first place. 

I don't see any need to differentiate target-audiences, only a need for some
elementary programming. If a text is typed and analysed, furnished with the
relevant bibliographical data concerning the person who typed the text
originally, along with the necessary macros or filters needed to create a
"proper" Sanskrit text, the recipient can continue work on the text as s/he
pleases. The text can then be passed on in an improved state with new
bibliographic data concerning changes, who made them, when and why they were
made etc. At the bottom of this is good old-fashioned philology, which we
should all cherish. What is new, is the way the work on the text is
communicated. 

>Another question I would like to ask is what principles people apply when
>carrying out compound-analysis. Motoi Ono, Jun'ichi Oda and Jun Takashima,
>for example, separated compounds with hyphens in their recently published
>KWIC-Index to Dharmakiirti's works. They adopted the policy not to separate
>(1) words with the prefixes a-, dur- and nih.-; (2) possessive adjectives
>with -vat/-mat are separated, while adverbs with -vat meaning "such as" are
>not; (3) a numeral with -dha/-vidha/-prakaara remains unseparated; (4)
>compounds with -taa/-tva or with the elements -bhaava/-bhuuta are not
>separated; (5) compounds starting with evam-, tat-, tathaa-, para-, yathaa-,
>su-, sva- are not separated; (6) some compounds which are considered as
>technical terms are not separated, e.g. padaartha, agnihotra,
>ayogavyavaccheda, prasajyapratis.edha, svabhaavapratibandha. 
>I would be very interested in getting opinions on this policy. 

This is an interesting question. Personally, I have separated a, an, nir/nih
etc (negations) by means of an equal sign (=). This enables me to manipulate
them in a slightly different manner than other compounds if I want to. As
terms that are technical terms (concepts) or personal names, I have not
separated them.

>As to Lars' argument that compound-analyzed texts facilitate students'
>efforts - this leads on to another discussion, that whether facilitating
>reading Sanskrit for students should be made into a general policy for
>e-texts, and whether it is such a good thing to facilitate too many things
>for students in the first place. 

All sound paedagogics start with simple things and then proceed to the
difficult stuff. As it is, Sanskrit teaching has been very much on the
sadistic side. My idea is to lead the students on, through simple narrative
texts, to a reasonably good grasp of the Sanskrit linguistic system, and
then slowly enable them to make their own compound analyses. As for more
seasoned scholars, they may very well be able to analyse their compounds,
but computers need help.

I personally don't like romanization at
>all, and I think romanized textual editions should die out as soon as
>possible. This opinion is not based on a somewhat sadistic dislike of
>students as such, but on the assumption that Sanskrit is a foreign language
>with its own distinct writing style, and that it should be taught as such.

I think this has already been commented upon, but let me repeat that there
is no particular Sanskrit writing system. The choice of devanagari for S. is
arbitrary. I see nothing wrong in using romanized text. What's more, when we
analyse S. computationally, we are definitely better off with romanized text. 

Once again, sorry about a late answer!

Best regards,

Lars Martin