AW: [INDOLOGY] OCR for Romanized Sanskrit with Diacritics

Kellner, Birgit kellner at ASIA-EUROPE.UNI-HEIDELBERG.DE
Tue May 18 16:32:50 UTC 2010


ABBYY finereader, though not cheap, is the best product I know (http://www.abbyy.com/). I use it regularly to produce searchable PDFs from scanned secondary literature, with the text underlying the image (this can also be done with Acrobat, but ABBYY is more accurate). 

It needs to be trained, though, to recognize romanized Sanskrit, and one probably has to define different training patterns depending on the typeface of the original (older books with "typewriter diacriticals" are a nightmare). But the training capacity is in ABBYY without limitations (whereas other products that come bundled with scanners sometimes allow you to only store up to a certain number of custom characters in a training file - last time I checked). 

I am wondering whether Acrobat recognizes diacritics like ṇ, ṭ, ś or ṣ and properly selects the fitting Unicode letters. I've never tried - does it, Dominik? 

Best, 

Birgit
________________________________________
Von: Indology [INDOLOGY at liverpool.ac.uk] im Auftrag von Alexander von Rospatt [rospatt at BERKELEY.EDU]
Gesendet: Dienstag, 18. Mai 2010 18:14
An: INDOLOGY at liverpool.ac.uk
Betreff: [INDOLOGY] OCR for Romanized Sanskrit with Diacritics

Dear Computer-Literati,

I have been in contact with Dominik Wujastyk regarding the application of OCR to romanized Sanskrit.

Dominik responded:

Several software packages will do that quite well, even Acrobat 9.  It's critical that the exemplar is good and that the scan is not a too low a resolution.  300dpi minimum, 400dpi+ better.  ...
If you choose one of the better contemporary OCR packages, and really learn how to use it, I believe you can get good results even for romanized Sanskrit.  The advent of Unicode has changed everything, and many software packages are now more or less obliged to be strongly multilingual and recognise a wide range of diacritcal marks...
Acrobat is the only one with Clearscan font technology, I believe, which is very good it you can use it.

I wonder about others' experiences in using OCR for this purpose. Which programs are most user-friendly, and which programs did you have the best results with?


Many thanks,

Alex Rospatt




----------------
Alexander von Rospatt, Professor and Chair
Department of South and Southeast Asian Studies
Head Graduate Adviser of the Group in Buddhist Studies
University of California
7233 Dwinelle Hall # 2540
Berkeley, CA 94720-2540
USA

Phone: +1-510-6421610
Fax: +1-510-6432959
Email:  rospatt at berkeley.edu





More information about the INDOLOGY mailing list