[INDOLOGY] Search interface for the GRETIL Corpus

Dan Lusthaus yogacara at gmail.com
Wed May 7 16:55:08 UTC 2025


Dear Claudius,

While splitting words is worthwhile, there is value in indexing “aggregated” words as well. For instance, I recently used your search engine to find hits for viśeṣalakṣaṇa. While typing viśeṣa lots of suggested hits and “aggregated” forms appeared, but it was not until I typed the “l” beginning “lakṣaṇa” that that particular compound appeared as one of the suggestions, which I could then click. So an index that only deals with isolated words, would make the search for term like viśeṣalakṣaṇa much more complicated and time consuming. There would need to be an option for searching for such aggregated terms without having to troll multiple examples (as, e.g. the Digital Corups of Sanskrit provides).

Just my two cents.

Thanks for the wonderful and very useful tool.

Best wishes,
Dan Lusthaus

> On May 7, 2025, at 12:32 PM, Claudius Teodorescu via INDOLOGY <indology at list.indology.info> wrote:
> 
> Dear Arlo,
> 
> Thank you for your message.
> 
> Yes, it is possible to add new texts to the corpus. I would also like to find a reliable module for splitting the words that are written without spaces between them, during indexing. As of now, the full text index contains 6.3 millions words, as it is also including such aggregated words. In this way, the index would be smaller.
> 
> The current approach for indexing texts within GRETIL allows for indexing also texts that are in plain text (TXT) format or even HTML format, if there are no resources for converting them to XML (TEI).
> 
> So, let us start adding new texts! :)
> 
> Best wishes,
> Claudius
> 
> On Wed, 7 May 2025 at 17:52, Arlo Griffiths <arlogriffiths at hotmail.com <mailto:arlogriffiths at hotmail.com>> wrote:
>> Dear Claudius,
>> 
>> Thanks a lot for this initiative. Allow me to ask if it is also possible to resume absorbing texts into the same corpus?
>> 
>> Now that its Göttingen host no longer seems to be interested in curating it, why not store all files on github or gitlab and initiate a collective INDOLOGY endeavor toward curating (txt > xml conversion) and expanding the corpus?
>> 
>> I write these words without having a full understanding of everything that would be required, but I'd certainly be interested in contributing.
>> 
>> Best wishes,
>> 
>> Arlo Griffiths
>> EFEO
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: INDOLOGY <indology-bounces at list.indology.info <mailto:indology-bounces at list.indology.info>> on behalf of Claudius Teodorescu via INDOLOGY <indology at list.indology.info <mailto:indology at list.indology.info>>
>> Sent: Tuesday, April 22, 2025 9:00 AM
>> To: Indology <indology at list.indology.info <mailto:indology at list.indology.info>>
>> Subject: [INDOLOGY] Search interface for the GRETIL Corpus
>>  
>> Dear all,
>> 
>> During the last months, I managed to set a search interface for the texts of the GRETIL Corpus, located at [1]. The interface is published as a static website, with a static full-text index and a static search engine, which execute the search in the browser, without the need for a server.
>> 
>> In order to convert the files to HTML format, which is used to display them in the search interface, I had to make some small updates to the XML files of the corpus. These changes are documented in [2]. As one expects, there is still work to be done with the XML files of the corpus.
>> 
>> Please let me know if you find any bugs with the search interface.
>> 
>> Best regards,
>> Claudius Teodorescu
>> 
>> [1] https://claudius-teodorescu.gitlab.io/gretil-corpus-site/
>> [2] https://gitlab.com/claudius-teodorescu/gretil-corpus-data
> 
> 
> 
> -- 
> Cu stimă,
> Claudius Teodorescu
> 
> _______________________________________________
> INDOLOGY mailing list
> INDOLOGY at list.indology.info <mailto:INDOLOGY at list.indology.info>
> https://list.indology.info/mailman/listinfo/indology

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.indology.info/pipermail/indology/attachments/20250507/d8fe11d8/attachment.htm>


More information about the INDOLOGY mailing list