AW: [INDOLOGY] OCR for Romanized Sanskrit with Diacritics

Wed May 19 08:59:55 UTC 2010

On 18 May 2010 18:32, Kellner, Birgit <kellner at asia-europe.uni-heidelberg.de
> wrote:

>
> I am wondering whether Acrobat recognizes diacritics like ṇ, ṭ, ś or ṣ and
> properly selects the fitting Unicode letters. I've never tried - does it,
> Dominik?
>
> Best,
>
> Birgit
>

I did some simple tests this morning, and I was startled at how bad the
results were.  I scanned a page of a Brill book on indology at 300dpi.  I
then ran the resulting jpeg through the ORC of Acrobat 9, using both "exact
image" and "Clearscan".  (The latter creates vector fonts on the fly,
matching the look of the fonts in the document. Very clever.)

After selecting and copying all the text from the resulting PDFs, and
examining them in a plain-text editor (UTF8-aware), the results were
dreadful.  Many, many errors, and certainly no diacritcal marks.

So my earlier impression that current off-the-shelf OCR was mature in
recognising accented characters was completely wrong.  And I was rather
shocked at how bad the OCR was even for non-accented text.

Acrobat 9 is quite a complicated product, and it is possible that there are
settings I am not aware of that could improve the OCR.  I had a quick search
through the Preferences to see if one could set character sets for the OCR,
but I couldn't find anything relevant.  The basic OCR menu contains a single
language setting, which I had set appropriately.

I think my "residual memory" of OCR being good on diacritics was a
mis-remembering based on  the scenario that when I copy and paste text from
the PDFs produced by my TeX system, as I occasionally do, all the diacritics
are correct, and properly coded in UTF8.  I'm using XeLaTeX (with the
xunicode package).

Best,
Dominik