order of letters

Wed May 19 22:48:57 UTC 2010

Dear Pr Tull,
This matter is one of those small problems which create lots of  
headaches for the proper design of computer-processing tools
for Sanskrit. One wants to avoid encoding redundancies, in order to  
keep overgeneration low, while preserving completeness.
Also clear consistent rules must be applied concerning lexicographic  
ordering in computerized lexicons.

Here is what I do at present. I normalize non-genuine anusvAra to the  
nasal homophonic to the following nasal. Both in the parsed
input and in the inflected forms databanks. Remark that the inverse  
normalization would be incorrect, since we want to keep
forms of root sa~nj and not turn them to *sa.mj.

Until recently I applied a similar normalization to visarga. When it  
is followed by a sibilant, I would replace it by the same sibilant.  
Thus e.g. a.hsu would normalize to assu. Similarly to the anusvaara  
treatment, such visarga would appear in the alphabetical order
of the corresponding sibilant, and only visarga preceding "k" and "p"  
would be preserved, and be listed just before the consonants.
But recently I noticed an improper side-effect of this normalization,  
in an example such as:
     api tvam adya prAtaH svasuH g.rham agacchaH
since normalizing input prAtaHsvasuH to prAtassvasuH would prevent its  
parsing. Here prAtaH must be preserved, in order for
sandhiviccheda to find the proper solution with prAtar. Because of  
this, I now treat all visargas in the same way, and I do not
replace them before sibilants.

Another place where a similar problem occurs is that words ending in  
"r" such as prAtar, antar, punar, catur but also all the
verbal forms in -ur must be stored intact in the forms databanks, and  
terminal sandhi should not be used.
This is needed to parse chunks such as punarapi, since puna.h+api  
would yield *puno'pi.
This difficulty is discussed nowhere I know of, and pandits protest at  
my listing punar rather than puna.h in tables.

Many other mundane problems arise with avagraha, with marking hiatus  
in words such as pra-ucya or pra-uga, etc.
Another problem which seems to be below the threshold of interest for  
most linguists is the optional gemination/degemination: forms
such as karmma, kiirttyate, vAggmI which are indeed Paninian (the last  
one using the very ad-hoc 5.2.124 sUtra), but also forms such as
chatra, chAtra, patrikA, vArtA, satra, satva, which may not be  
Paninian, but which occur in corpus nonetheless.

Regards
G. Huet

Le 19 mai 10 à 15:38, Herman Tull a écrit :

> I am putting together some simple rules for first year students.  In  
> dealing with the order of the letters, there is the ever present  
> confusion over the placement of anusvAra and visarga.  I notice that  
> Monier Williams and Macdonell are consistent in their placement of  
> the anusvara (placing it after the vowels when it precedes the semi- 
> vowels or the sibilants).  But, they seem not to agree on the  
> placement of the visarga.  Macdonell follows the rule that he states  
> in his student grammar (p. 3), that the visarga follows the vowels  
> when it precedes "k" and "p" and that it when it precedes a sibilant  
> it is placed in the consonantal order of the sibilants.  E.g., on p.  
> 17 of his dictionary he has an article for antaH-ka... and then on  
> p. 18 he has the article for antaH-sa...
>
> Monier-Williams, on the other hand, combines this all into one  
> article (p. 43, "antaH"), and so does not distinguish between  
> visarga before "k" and "p" on the one hand, and before the sibilants  
> on the other, treating all of it as if it precedes "antar" (which  
> seems incorrect, since antaH-sa would not precede antar according to  
> the rule).
>
> A small matter (especially with the advent of the electronic  
> dictionary), perhaps, but can anyone shed light on this?
>
> Herman Tull