Release of Sanskrit platform

Wed Sep 8 16:57:36 UTC 2004

This is to announce the Sept 8th release of my Sanskrit computational
linguistics platform,
available at http://pauillac.inria.fr/~huet/SKT/
The verbal morphology is now complete, for the present system (present,
imperfect,
optative, imperative), passive, perfect, future and aorist.

This may be tested at
http://pauillac.inria.fr/~huet/SKT/DICO/grammar.html
where you submit the root (in Velthuis transliteration form) and the
present class.
E.g.: bhuu 1, as 2, m.rj 2, han 2, haa 3, hu 3, daa 4, su 5, p.r 6, yuj
7, k.r 8, j~naa 9,
namas 10, etc.

Conversely, a lemmatiser (at
http://pauillac.inria.fr/~huet/SKT/DICO/index.html#stemmer)
attempts to tag inflected words.
Try for instance apibat, akaar.siit, dudoha, vaahyate, etc (clicking on
Verb).

An experimental segmenter/tagger is also available. It will segment
simple sentences, verb final, such as maarjaarodugdha.mpibati,
analysing the required sandhi.
In tagger mode, it gives a shallow parsing of the sentence or phrase,
linking to the
dictionary.
It may be used for instance to tag verb forms with preverbs, such as
utti.s.tha,
and to decompose compounds. It is able to correctly analyse forms such
as
ihehi (iha-aa-ihi = come here), and adhiiye (I learn, with sandhi
duplicating the i).

The current syntax of recognized phrases is N*.(1+V) with V=(P+1).R
that is a list of noun forms followed optionally with a verb, where a
verb is optionally a
preverb followed with a root form; the current set P of prefixes is:
ati, adhi, adhyava, anu, anuparaa, anupra, anuvi, anta.h, apa, apaa,
api,
abhi, abhini, abhipra, abhivi, abhisam, abhyava, abhyaa, abhyut,
abhyupa,
ava, aa, ut, udaa, upa, upani, upasam, upaa, upaadhi, tira.h, ni, ni.h,
nirava, niraa, paraa, pari, parini, parisam, paryupa, pi, pura.h, pra,
prati, pratini, prativi, pratisam, pratyaa, pratyut, prani, pravi,
pravyaa, praa, vi, vini, vini.h, viparaa, vipari, vipra, vyati, vyapa,
vyava, vyaa, vyut, sa, sa.mni, sa.mpra, sa.mprati, sa.mpravi, sa"mvi,
sam, samava, samaa, samut, samudaa, samudvi, samupa.

Full lists of declined forms are available as downloadable free
linguistic resources.
The first one comprises 135,000 declined nouns (it includes pronouns,
numbers, participles,
particles and undeclinable forms such as absolutive and infinitive
forms). The second one
comprises 74,000 conjugated root forms. These data bases are available
in pdf format and
XML (given with a DTD).

This work is still in a very preliminary form, since it has not been
tested yet on real
corpus. I beg the indulgence of the reader, since many mistakes and
omissions remain.
He is kindly requested to report them to me.
Secondary conjugations are not systematically generated, they are
included only if
explicitly reported in the dictionary. Aorist forms are also included
on need, I did
not attempt to generate all forms given in Whitney roots. Also
participles, infinitives,
periphrastic future and perfect, are not systematically generated.
Conditional and precative
are missing. On the other hand, I generally generate both active
(parasmaipada) and middle
(aatmanepada) forms, so report non attested forms only when they do not
make sense
semantically.

All these tools are available one click away from the Sanskrit
dictionary index, at
http://pauillac.inria.fr/~huet/SKT/DICO/
Enjoy

Gerard Huet