Dear Madhav, 

Thank you very much for the feedback and thank you for pointing out the missing etext of the  Atharvaveda-Saṃhitā. I do remember that there was a problem with the encoding in a few cases and the splitter thus failed to process a couple of etexts but when I spot-checked I couldn't locate the problem. I think it should be easy to fix and I will look into it the next days. 
Regarding the Vākyapadīya, this should be the file: http://buddhist-db.de/sanskrit-html/vakyp1au.html
When I scraped the data from GRETIL, all I could use in order to provide the filenames was the information included in the headers of the etexts, which is usually without diacritics and sometimes in a strange formatting, so using the list I provided has it's own challenges. Unfortunately there is not yet  a good well-formatted list of the (about 1300) etexts in this collection, that would be very helpful.
With best wishes, 

Sebastian 

On Sun, Feb 3, 2019 at 8:47 AM Madhav Deshpande <mmdesh@umich.edu> wrote:
Dear Sebastian,

     Congratulations for an amazing achievement.  I tested a few texts and they worked well.  The one that did not open was Atharvaveda-Saṃhitā.  So there may be a few glitches here and there.  I saw Bhartr̥hari's Śatakas.  Is the Vākyapadīya inculded?  I may have missed it in the long list.
     In any case, thank you so much for providing this wonderful resource.  With best wishes,

Madhav

Madhav M. Deshpande
Professor Emeritus
Sanskrit and Linguistics
University of Michigan
[Residence: Campbell, California]


On Sat, Feb 2, 2019 at 3:23 PM Sebastian Nehrdich via INDOLOGY <indology@list.indology.info> wrote:
Hello everybody, 

Based on the Sandhi-splitter that Oliver Hellwig and me developed last year (https://github.com/OliverHellwig/sanskrit/blob/master/papers/2018emnlp/sandhi-rnn-hellwig-nehrdich.pdf), I have calculated approximate quotations and parallel passages within the GRETIL corpus. The HTML-tables of the quotations can be accessed here: http://buddhist-db.de/sanskrit-html/index.html
The tables are in a very simple format, but since links to the quoted passages are provided, it can be quite entertaining to navigate through the files and jump to the quoted passages.

The code and a small description of the used tools are on github: https://github.com/sebastian-nehrdich/gretil-quotations

The algorithm is based on fasttext vector representations of sequences with a fixed length of six tokens. This is short enough to match sets of two pādas of an anuṣṭubh stanza and long enough to avoid yielding too much unwanted results. I set the cutoff for the similarity intentionally low, so even matches which might be just very loosely related are included in the results. I think it is better to have some unrelated results from time to time than to miss something that might be of importance. Also the formatting of the GRETIL files is quite different from file to file, and it might have happened that during the process of extracting and splitting the Sanskrit sentences something did not go perfectly well. It is therefore always a good idea to check back with the original etext files. 
With best wishes, 

Sebastian Nehrdich 
 
_______________________________________________
INDOLOGY mailing list
INDOLOGY@list.indology.info
indology-owner@list.indology.info (messages to the list's managing committee)
http://listinfo.indology.info (where you can change your list options or unsubscribe)