On 31 January 2013 16:57, Jan E.M. Houben
<jemhouben@gmail.com> wrote:
When the level of perfection is so high:
- at one place I saw "Rao" instead of "Rau" in connection with VP.
Thanks for spotting this. It's fixed. The raw XML file in "downloads" is updated, but the copy in the SARIT/Philologic system won't be updated for a couple of weeks.
Incidentally, while I'm glad to help, in the long run the effort to update files and fix corrections is public. You may do this yourself, at the SARIT Github home, as described on the SARIT front page. It takes some computer-savvy, but SARIT is potentially a community project, at least in some key regards. If you feel like taking the Mahabharata and tagging all the geographical names, for example, feel free. You can then feed that updated file back into Github, and the new tagging will be there for all to benefit from.
- I regret that words remain improperly joined following devanagari consonant-vowel mergers as in uktaH and evamuktaH which need to be searched separately (wildcards possible but leads to other problems: cp. evamuktaH and compounds with -muktaH).
Yes, this is a real question. In SARIT, we mostly host files that are in Devanagari-script style spacing. At
Gaveṣikā, Amba Kulkarni and her team demonstrate that such files can be algorithmically parsed and word-separations can be inserted automatically, and rather successfully. A future release of SARIT may incorporate this technology, which is Open-Access. We want, also, to run a Romanized and a Devanagari service side-by-side, so that we also serve our audience in India in an appropriate manner. There are technologies for all these things, and they work. But just at the present time, we have concentrated on building up the size of the corpus.
In defence of the current situation, in my experience if one really masters the syntax of the
Grep searching that SARIT supports, surprisingly sophisticated and precise searches can be achieved, even with Devanagari-style files.
Best,
Dominik