[INDOLOGY] New resources

Oliver Hellwig hellwig7 at gmx.de
Thu Aug 28 17:04:41 UTC 2025


Dear all,

a few Sanskrit related resources have been published today:

* The initial version of the Vedic Prose Corpus (VPC), built in 
collaboration with K. Amano and S. Sellmer. Data available for download 
from github; more here:
https://github.com/OliverHellwig/sanskrit/tree/master/corpus/VPC

Note that this resource contains English machine translations for all 
texts (produced with Sebastian Nehrdich's MT model; files named 
qqq-mt.txt in the translation subfolders), and human translations 
aligned at sentence level for selected prose works.

* A slightly reformatted and lemmatized version of almost half of the 
last GRETIL release:
https://github.com/OliverHellwig/sanskrit/tree/master/corpus/GRETIL

General background information about both resources is given in their 
parent git directory ("Sanskrit text repository"):
https://github.com/OliverHellwig/sanskrit/tree/master/corpus

* Those working with Vedic prose might have a look at a new English 
search interface built on top of the VPC:
https://huggingface.co/spaces/OliverHellwig/vpcsearch

This tool lets you enter an English sentence and then searches for 
semantically related statements in Vedic prose - no Sanskrit, no 
questions, nothing generative! See the "How to use this search tool" 
section at the top of the tool.

* Finally, the lexicographic information from VPC and GRETIL has been 
integrated into the DCS, almost doubling the size of its corpus. Right 
now, the accessibility of this new data is somewhat limited. However, 
when searching a word (Query page), results from VPC and GRETIL are 
retrieved simultaneously and displayed under the header "Results for qqq 
from unsupervised lemmatization". Clicking this link gives you access to 
the relevant contexts and additional information about the occurrences.
While integrating this new data, I realized that the VPC search 
interface is almost unusable from smartphones and tablets. I hope this 
has been fixed to some extent, e.g. by introducing a new character 
encoding system (aa for long a etc.; follow the link "Diacritics" on the 
query page).


Best, Oliver

---

Oliver Hellwig
Institute for the Interdisciplinary Study of Language Evolution, 
University of Zurich


More information about the INDOLOGY mailing list