Sanskrit OCR

Mon Jul 7 14:14:06 UTC 1997

OCR for manuscript Devanagari is many, many decades away, in my view, and
may never be economical.

For printed Devnagari:

Romanized text
==============
In about 1978 I used a Kurzweil Data Entry Machine 2000 (affectionately
known as "Kdem") to scan the romanized text of the Bhagavadgita
(Edgerton's edition).  The results were excellent.  I can't give you
actual error rates, but I remember the text was pretty acceptable.  In
spite of the early date, the Kdem was a very sophisticated machine,
implementing powerful pattern-recognition algorithms.  Basically, an
operator would sit at a VDU while the machine scanned the page; each time
an unknown char came up it would be flashed on the screen.  The operator
would type the encoding on a keyboard, and with several iterations of this
process the machine would learn the font.

I would imagine that the better OCR programs around today could manage
romanized Sanskrit pretty well, with training, as Lars confirms.

Devanagari
==========

The only published information I know on Dan Ingalls Jr.'s program is
section 4 of 
Daniel H. H. Ingalls and Daniel H. H. Ingalls, Jr., "The Mahaabhaarata:
stylistic study, computer analysis, and concordance", Journal of South
Asian Literature vol.20, part 1, pp. 17-46.

DIJr gives a good account there of his algorithm.  He implemented the
program in the Smalltalk language (of which he was one of the creators).
Few people until recently have had access to cross-platform compilers for
this language, so DIJr's work remained almost theoretical for some years.
Then, about a decade ago, the Maharishi's organization got hold of the
Smalltalk code and reimplemented the algorithm in the language Object
Pascal, on the Mac platform.  I saw this demonstrated at a meeting held by
Richard Lariviere at Austin in 1988 (I think it was).  The program was fed
some Ramayana text, and did work its way through a page of Devanagari. 
The error rate was about two or more errors per page, as I recall, with
only one in two flagged as such.  This is not a good enough rate for
large-scale work.  The OCR was very slow too.

I'm interested to hear that Optopus was able to be taught Devanagari.
However, it seems that it is still more economical by a large margin to
enter Devanagari printed text by hand than to scan it.  One may expect
this situation to change, but perhaps not as fast as one might hope.

The Thesaurus Linguae Graecae project regularly re-evaluates the state of
OCR technology in the context of humanities texts, especially ancient
Greek.  They still use double-keying in the Phillipines, as far as I know,
finding this to be cheaper and more accurate than OCR.  (See the links via
the Perseus Project on the INDOLOGY pages.) 

All the best,
Dominik

--
Dominik Wujastyk               Wellcome Institute for the History of Medicine
email: d.wujastyk at ucl.ac.uk          183 Euston Road, London NW1 2BE, England
<URL: http://www.ucl.ac.uk/~ucgadkw/>                    FAX: 44 171 611 8545