(Xe)(La)TeX, (L)Edmac, and Velthuis query

Lubin, Tim lubint at WLU.EDU
Sat Jul 3 11:24:38 UTC 2010

Dear Dominik (and anyone else who might be able to help):

Your message to Indology (below) made me wonder about a couple of things.  I am still in the process (!) of trying to set a short Sanskrit critical edition using plain TeX with Edmac and the Velthuis font package.  It is the publisher's desire to stick with Velthuis, which (last time I checked) works only with plain TeX.  But everywhere I look, when I run into a difficulty, TeX-savvy people are scoffing at the very notion of using plain TeX, yet also still saying that various portings of Edmac to LaTeX are iffy and imperfect (though I myself succeeded in using Ledmac for an edition set in transliteration).

In your message, you say you are using XeLaTeX with UTF-8, but also Edmac and Velthuis -- something I would like to do as well, but am behind in latest developments.  

Are you using some newer form of Edmac or Ledmac for this?  

Is there a new, improved Velthuis, or Velthuis-look-alike that can work with Ledmac?  

And for godsake has Velthuis ever "learned" how to print a conjunct for /chra/ or /.dra/ ??  Even Word can do that now!

From: Indology [INDOLOGY at liverpool.ac.uk] On Behalf Of Dominik Wujastyk [wujastyk at GMAIL.COM]
Sent: Wednesday, May 19, 2010 4:59 AM
To: INDOLOGY at liverpool.ac.uk
Subject: Re: [INDOLOGY] AW: [INDOLOGY] OCR for Romanized Sanskrit with Diacritics

On 18 May 2010 18:32, Kellner, Birgit <kellner at asia-europe.uni-heidelberg.de
> wrote:

> I am wondering whether Acrobat recognizes diacritics like ṇ, ṭ, ś or ṣ and
> properly selects the fitting Unicode letters. I've never tried - does it,
> Dominik?
> Best,
> Birgit

I did some simple tests this morning, and I was startled at how bad the
results were.  I scanned a page of a Brill book on indology at 300dpi.  I
then ran the resulting jpeg through the ORC of Acrobat 9, using both "exact
image" and "Clearscan".  (The latter creates vector fonts on the fly,
matching the look of the fonts in the document. Very clever.)

After selecting and copying all the text from the resulting PDFs, and
examining them in a plain-text editor (UTF8-aware), the results were
dreadful.  Many, many errors, and certainly no diacritcal marks.

So my earlier impression that current off-the-shelf OCR was mature in
recognising accented characters was completely wrong.  And I was rather
shocked at how bad the OCR was even for non-accented text.

Acrobat 9 is quite a complicated product, and it is possible that there are
settings I am not aware of that could improve the OCR.  I had a quick search
through the Preferences to see if one could set character sets for the OCR,
but I couldn't find anything relevant.  The basic OCR menu contains a single
language setting, which I had set appropriately.

I think my "residual memory" of OCR being good on diacritics was a
mis-remembering based on  the scenario that when I copy and paste text from
the PDFs produced by my TeX system, as I occasionally do, all the diacritics
are correct, and properly coded in UTF8.  I'm using XeLaTeX (with the
xunicode package).



More information about the INDOLOGY mailing list