[INDOLOGY] DLI (Digital Library of India)
michaelnm.meyer at gmail.com
Mon Dec 19 11:33:43 UTC 2022
Dear all, Dear M. Spier,
A few notes about your second point.
As evidenced by the menu on the top right, sanskritdictionary.com uses
Google's OCR engine, namely Tesseract (https://tesseract-ocr.github.io).
The software can be downloaded freely and is relatively straightforward to
use through tools like ocrmypdf (https://ocrmypdf.readthedocs.io). As far
as I know, no models are yet available for recognizing transliterated
Sanskrit, but there is one for Sanskrit written in devanāgarī.
Now, concerning the contents-based search feature, there is a caveat to
keep in mind: a large proportion of the PDF files originating from DLI
(both those available at http://dli.sanskritdictionary.com and those
available at https://archive.org/details/digitallibraryindia?tab=collection)
were originally passed to OCR with incorrect language or script parameters,
when they were passed to OCR at all. The extracted text is thus of low
quality, which leads to poor search results.
I have found it absolutely worth it to systematically download the books
from DLI I might be interested in, to pass them to OCR with the correct
language parameters (determined with a classification model), and to use a
local full-text search tool. By contrast, books from other major
collections on archive.org (e.g. those from the library of Toronto) are
generally well indexed and can be searched directly through archive.org's
Le dim. 18 déc. 2022 à 21:17, Harry Spier via INDOLOGY <
indology at list.indology.info> a écrit :
> 1) Does anyone happen to know the different old DLI different websites
> addresses . I'd like to check if Archive.org's wayback machine happened to
> back the DLI up from these websites. Some of these sites had alphabetical
> title lists if I recall correctly.
> 2)I've just seen this interesting note about an OCR search engine for the
> DLI project of Martin Gluckman
> The site http://dli.sanskritdictionary.com has complete mirror of Digital
> Library of India. It allows ̎deep OCR based searches̎ within documents
> activated by its own OCR engine <http://ocr.sanskritdictionary.com/>.
> Harry Spier
> INDOLOGY mailing list
> INDOLOGY at list.indology.info
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the INDOLOGY