[INDOLOGY] Search interface for the GRETIL Corpus
Claudius Teodorescu
claudius.teodorescu at gmail.com
Wed May 7 17:45:03 UTC 2025
I see. This index will stay as it is.
If ever needed, one can add another index, for simple words. With you
example of viśeṣalakṣaṇa, the existing one contains this compound, and the
index with simple words would contain viśeṣa separated from lakṣaṇa (again,
if ever needed).
Claudius
On Wed, 7 May 2025 at 19:55, Dan Lusthaus <yogacara at gmail.com> wrote:
> Dear Claudius,
>
> While splitting words is worthwhile, there is value in indexing
> “aggregated” words as well. For instance, I recently used your search
> engine to find hits for viśeṣalakṣaṇa. While typing viśeṣa lots of
> suggested hits and “aggregated” forms appeared, but it was not until I
> typed the “l” beginning “lakṣaṇa” that that particular compound appeared as
> one of the suggestions, which I could then click. So an index that only
> deals with isolated words, would make the search for term like
> viśeṣalakṣaṇa much more complicated and time consuming. There would need to
> be an option for searching for such aggregated terms without having to
> troll multiple examples (as, e.g. the Digital Corups of Sanskrit provides).
>
> Just my two cents.
>
> Thanks for the wonderful and very useful tool.
>
> Best wishes,
> Dan Lusthaus
>
> On May 7, 2025, at 12:32 PM, Claudius Teodorescu via INDOLOGY <
> indology at list.indology.info> wrote:
>
> Dear Arlo,
>
> Thank you for your message.
>
> Yes, it is possible to add new texts to the corpus. I would also like to
> find a reliable module for splitting the words that are written without
> spaces between them, during indexing. As of now, the full text index
> contains 6.3 millions words, as it is also including such aggregated words.
> In this way, the index would be smaller.
>
> The current approach for indexing texts within GRETIL allows for indexing
> also texts that are in plain text (TXT) format or even HTML format, if
> there are no resources for converting them to XML (TEI).
>
> So, let us start adding new texts! :)
>
> Best wishes,
> Claudius
>
> On Wed, 7 May 2025 at 17:52, Arlo Griffiths <arlogriffiths at hotmail.com>
> wrote:
>
>> Dear Claudius,
>>
>> Thanks a lot for this initiative. Allow me to ask if it is also possible
>> to resume absorbing texts into the same corpus?
>>
>> Now that its Göttingen host no longer seems to be interested in curating
>> it, why not store all files on github or gitlab and initiate a collective
>> INDOLOGY endeavor toward curating (txt > xml conversion) and expanding the
>> corpus?
>>
>> I write these words without having a full understanding of everything
>> that would be required, but I'd certainly be interested in contributing.
>>
>> Best wishes,
>>
>> Arlo Griffiths
>> EFEO
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------
>> *From:* INDOLOGY <indology-bounces at list.indology.info> on behalf of
>> Claudius Teodorescu via INDOLOGY <indology at list.indology.info>
>> *Sent:* Tuesday, April 22, 2025 9:00 AM
>> *To:* Indology <indology at list.indology.info>
>> *Subject:* [INDOLOGY] Search interface for the GRETIL Corpus
>>
>> Dear all,
>>
>> During the last months, I managed to set a search interface for the texts
>> of the GRETIL Corpus, located at [1]. The interface is published as a
>> static website, with a static full-text index and a static search engine,
>> which execute the search in the browser, without the need for a server.
>>
>> In order to convert the files to HTML format, which is used to display
>> them in the search interface, I had to make some small updates to the XML
>> files of the corpus. These changes are documented in [2]. As one expects,
>> there is still work to be done with the XML files of the corpus.
>>
>> Please let me know if you find any bugs with the search interface.
>>
>> Best regards,
>> Claudius Teodorescu
>>
>> [1] https://claudius-teodorescu.gitlab.io/gretil-corpus-site/
>> [2] https://gitlab.com/claudius-teodorescu/gretil-corpus-data
>>
>
>
> --
> Cu stimă,
> Claudius Teodorescu
>
> _______________________________________________
> INDOLOGY mailing list
> INDOLOGY at list.indology.info
> https://list.indology.info/mailman/listinfo/indology
>
>
>
--
Cu stimă,
Claudius Teodorescu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.indology.info/pipermail/indology/attachments/20250507/bccc1466/attachment.htm>
More information about the INDOLOGY
mailing list