[INDOLOGY] GRETIL quotation network

Sebastian Nehrdich nehrdbsd at googlemail.com
Sat Feb 2 23:22:29 UTC 2019


Hello everybody,

Based on the Sandhi-splitter that Oliver Hellwig and me developed last year
(
https://github.com/OliverHellwig/sanskrit/blob/master/papers/2018emnlp/sandhi-rnn-hellwig-nehrdich.pdf),
I have calculated approximate quotations and parallel passages within the
GRETIL corpus. The HTML-tables of the quotations can be accessed here:
http://buddhist-db.de/sanskrit-html/index.html
The tables are in a very simple format, but since links to the quoted
passages are provided, it can be quite entertaining to navigate through the
files and jump to the quoted passages.

The code and a small description of the used tools are on github:
https://github.com/sebastian-nehrdich/gretil-quotations

The algorithm is based on fasttext vector representations of sequences with
a fixed length of six tokens. This is short enough to match sets of two
pādas of an anuṣṭubh stanza and long enough to avoid yielding too much
unwanted results. I set the cutoff for the similarity intentionally low, so
even matches which might be just very loosely related are included in the
results. I think it is better to have some unrelated results from time to
time than to miss something that might be of importance. Also the
formatting of the GRETIL files is quite different from file to file, and it
might have happened that during the process of extracting and splitting the
Sanskrit sentences something did not go perfectly well. It is therefore
always a good idea to check back with the original etext files.
With best wishes,

Sebastian Nehrdich


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.indology.info/pipermail/indology/attachments/20190203/abd9a0bb/attachment.htm>


More information about the INDOLOGY mailing list