[INDOLOGY] Search interface for the GRETIL Corpus
Claudius Teodorescu
claudius.teodorescu at gmail.com
Wed May 7 16:32:57 UTC 2025
Dear Arlo,
Thank you for your message.
Yes, it is possible to add new texts to the corpus. I would also like to
find a reliable module for splitting the words that are written without
spaces between them, during indexing. As of now, the full text index
contains 6.3 millions words, as it is also including such aggregated words.
In this way, the index would be smaller.
The current approach for indexing texts within GRETIL allows for indexing
also texts that are in plain text (TXT) format or even HTML format, if
there are no resources for converting them to XML (TEI).
So, let us start adding new texts! :)
Best wishes,
Claudius
On Wed, 7 May 2025 at 17:52, Arlo Griffiths <arlogriffiths at hotmail.com>
wrote:
> Dear Claudius,
>
> Thanks a lot for this initiative. Allow me to ask if it is also possible
> to resume absorbing texts into the same corpus?
>
> Now that its Göttingen host no longer seems to be interested in curating
> it, why not store all files on github or gitlab and initiate a collective
> INDOLOGY endeavor toward curating (txt > xml conversion) and expanding the
> corpus?
>
> I write these words without having a full understanding of everything that
> would be required, but I'd certainly be interested in contributing.
>
> Best wishes,
>
> Arlo Griffiths
> EFEO
>
>
>
>
>
>
>
>
> ------------------------------
> *From:* INDOLOGY <indology-bounces at list.indology.info> on behalf of
> Claudius Teodorescu via INDOLOGY <indology at list.indology.info>
> *Sent:* Tuesday, April 22, 2025 9:00 AM
> *To:* Indology <indology at list.indology.info>
> *Subject:* [INDOLOGY] Search interface for the GRETIL Corpus
>
> Dear all,
>
> During the last months, I managed to set a search interface for the texts
> of the GRETIL Corpus, located at [1]. The interface is published as a
> static website, with a static full-text index and a static search engine,
> which execute the search in the browser, without the need for a server.
>
> In order to convert the files to HTML format, which is used to display
> them in the search interface, I had to make some small updates to the XML
> files of the corpus. These changes are documented in [2]. As one expects,
> there is still work to be done with the XML files of the corpus.
>
> Please let me know if you find any bugs with the search interface.
>
> Best regards,
> Claudius Teodorescu
>
> [1] https://claudius-teodorescu.gitlab.io/gretil-corpus-site/
> [2] https://gitlab.com/claudius-teodorescu/gretil-corpus-data
>
--
Cu stimă,
Claudius Teodorescu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://list.indology.info/pipermail/indology/attachments/20250507/43bbcd4f/attachment.htm>
More information about the INDOLOGY
mailing list