A query on OCR for Sanskrit or Tamil
Jürgen Neuss
juergen.neuss at FU-BERLIN.DE
Thu Sep 29 09:04:40 UTC 2011
dear palaniappan,
i have some experience in ocr for nagari. the only way to read nagari
texts for a long time was to use finereader (www.abbyy.com) and train it
to nagari characters, which was a tedious procedure. as finereader was
developed for latin characters (or, at least, for characters which are
separated by spaces, [not sharing a rekhA]) the result needed thorough
further correction. problematic were also the superscript mAtrAs e, ai, o,
au which are ambiguous for ocr-engines. still, finereader yielded the best
results compared to other ocr software at least according to my
experience. but it can be used only for one script at a time and not for
mixed script, i.e. you have first to train and recognize nagari, then
tamil and so on and finally merge the results manually.
now, however, there is a far better specialized program for devanagari,
which was developed by a colleague, Dr. Oliver Hellwig (www.indsenz.com).
There are two versions, one for Sanskrit and one for Hindi, which include
an automatic proof-reading function.
I have worked a lot with both versions and the program reads texts with
amazing accuracy. If the pictures you are processing are of good quality
the accuracy can be more than 99%. Text output is in Unicode UTF-8,
nagari, and the document comes in .rtf-format, which can be further
processed with the word-processor of your choice.
I hope this helps.
best regards
Dr. Jürgen Neuß
Freie Universität Berlin
Institut für die Sprachen und Kulturen Südasiens
Königin-Luise-Str. 34 a
D-14195 Berlin
juergen.neuss at fu-berlin.de
More information about the INDOLOGY
mailing list