A query on OCR for Sanskrit or Tamil

Jürgen Neuss juergen.neuss at FU-BERLIN.DE
Thu Sep 29 09:04:40 UTC 2011

dear palaniappan,

i have some experience in ocr for nagari. the only way to read nagari  
texts for a long time was to use finereader (www.abbyy.com) and train it  
to nagari characters, which was a tedious procedure. as finereader was  
developed for latin characters (or, at least, for characters which are  
separated by spaces, [not sharing a rekhA]) the result needed thorough  
further correction. problematic were also the superscript mAtrAs e, ai, o,  
au which are ambiguous for ocr-engines. still, finereader yielded the best  
results compared to other ocr software at least according to my  
experience. but it can be used only for one script at a time and not for  
mixed script, i.e. you have first to train and recognize nagari, then  
tamil and so on and finally merge the results manually.

now, however, there is a far better specialized program for devanagari,  
which was developed by a colleague, Dr. Oliver Hellwig (www.indsenz.com).  
There are two versions, one for Sanskrit and one for Hindi, which include  
an automatic proof-reading function.
I have worked a lot with both versions and the program reads texts with  
amazing accuracy. If the pictures you are processing are of good quality  
the accuracy can be more than 99%. Text output is in Unicode UTF-8,  
nagari, and the document comes in .rtf-format, which can be further  
processed with the word-processor of your choice.

I hope this helps.

best regards

Dr. Jürgen Neuß
Freie Universität Berlin
Institut für die Sprachen und Kulturen Südasiens
Königin-Luise-Str. 34 a
D-14195 Berlin
juergen.neuss at fu-berlin.de

More information about the INDOLOGY mailing list