A misconception regarding the PDF format (Re: Text processing in Unicode
jean-luc.chevillard at UNIV-PARIS-DIDEROT.FR
Fri Mar 26 03:42:57 EDT 2010
Dear S. Palaniappan,
I don't think it is really appropriate
for you to begin to export to this Indological forum
the violent infighting which exists inside the Tamil community,
in which a minority strongly opposes the use of Unicode,
and wants to promote an ad hoc encoding,
in the Private User Area,
which will bring the Tamil Communicity back to the Stone Age of
Why do you not limit your search to HTML encoded pages?
The PDF format was not primarily invented for being a text storage format
and it has never been guaranteed that round-trip conversions is always
between PDF files and text files.
PDF files are "E-paper".
You might as well ask for the possibility of having round-trip conversions
between .DVI files and text files.
The reason why many pages on Tamil government sites
clumsily use PDF rather than HTML
is that a powerful lobby has been preventing rational behaviour,
sometimes claiming that the Unicode consortium
does not recognize the linguistic specificities of "Dravidian"
and sometimes acccusing the Unicode Consortium
of being the new "East India Company"
The fact that pasting from a PDF file to a plain text file does not
can certainly NOT be described as a "major" defect of Unicode,
as some uninformed people have kept repeating
(and as a reason for the Tamil government not to use Unicode)
Sorry for being so blunt in my statements
but I have seen for several months
hundreds of misleading paranoid statements
in several Tamil mailing lists
repeated AD NAUSEAM
and the idea that this is all going to start here in this academic list
is very unpleasant to say the least.
That Unicode has been invented is certainly a MIRACLE,
which could not be predicted 30 years ago.
If the Tamil communicity wanted to use their political clout in a useful
they could lobby for the grantha script to be quickly implemented inside
the Unicode standard,
in order to make the miracle even more of a miracle.
That would allow for the easy reprint of all the Vaishnava Manipravalam
That would be useful indeed.
Have a nice day!
-- Jean-Luc Chevillard
Le 3/26/2010 4:11 AM, Sudalaimuthu Palaniappan a écrit :
> Dear Indologists,
> I am seeing some problems in text processing in Tamil texts created using Unicode fonts.
> Consider the following text in Project Madurai.
> According to the cover page, "This pdf file is based on Unicode with corresponding Latha font embedded in the file. Hence this file can be viewed and printed on all computer platforms: Windows, Macintosh and Unix without the need to have the font installed in your computer."
> When I searched the text for the string தான் (tAn2), I hit not only தான் but also தோன் (tOn2) !
> Has anyone processing (searching, sorting) Unicode texts in Sanskrit or other Indian languages encountered any problems like the above?
> (Needless to say, when one copies the text from PDF and pastes in email, one gets messed up text like this. நாககம் இல்லாத மிகப் பழங் காலத்தில் மனிதர்கள் வடுீ கட்டத் ெதயாமல் குைககளில் வாழ்ந்தார்களாம். அந்தப் பழஙகாலத்ைதக் கற்காலம் என்று ெசால்லுகிேறாம்.)
> (However, a draft report by an Expert Committee on Technology Standards for Indian Languages
> (http://egovstandards.gov.in/apex-review/egscontent.2009-06-10.5999916108/at_download/file) claims:
> All major operating systems, browsers, editors, word processors and other applications& tools are supporting Unicode.
> It is possible to use Indian languages and scripts in the Unicode environment, which will resolve the compatibility issue.
> The documents created using Unicode may be searched very easily on the web.
> As Unicode is widely recognized all over the world and also supporting Indian languages, it will ease Localization applications including e-Governance application
> for all the constitutionally recognized Indian languages.
> Since Indian languages are also used in the other part of the world, it is possible to have Global data exchange.)
> Thanks in advance.
More information about the INDOLOGY