AW: [INDOLOGY] OCR for Romanized Sanskrit with Diacritics

Paul G. Hackett ph2046 at COLUMBIA.EDU
Wed May 19 11:32:31 UTC 2010


At 6:32 PM +0200 5/18/10, Kellner, Birgit wrote:
>
>ABBYY finereader, though not cheap, is the best product I know 
>(http://www.abbyy.com/). I use it regularly to produce searchable 
>PDFs from scanned secondary literature, with the text underlying the 
>image (this can also be done with Acrobat, but ABBYY is more 
>accurate).

I would agree with Birgit on this point.  I have had great succes 
with ABBYY working with diacritic characters and, most recently, with 
Devanagari -- see:

http://www.columbia.edu/~ph2046/RnD/Hackett/SktComp.html

Eventually, it is my hope to make both my diacritic & Devanagari 
recognition files for ABBYY freely available for others to use.

also, at 10:59 AM +0200 5/19/10, Dominik Wujastyk wrote:
>I did some simple tests this morning, and I was startled at how bad the
>results were.  I scanned a page of a Brill book on indology at 300dpi.
...
>After selecting and copying all the text from the resulting PDFs, and
>examining them in a plain-text editor (UTF8-aware), the results were
>dreadful.  Many, many errors, and certainly no diacritcal marks.

Part of the problem with the poor results that you experienced is 
certainly due to the fact that you were working with 300 dpi scans. 
This is too low of a resolution for OCR.  You need a minimum of 400 
dpi for decent OCR accuracy.

Best,

Paul Hackett
Columbia University





More information about the INDOLOGY mailing list