Text processing in Unicode

Dipak Bhattacharya dbhattacharya2004 at YAHOO.CO.IN
Fri Mar 26 00:19:26 EDT 2010


I can speak of two problems. The characters/signs are sometimes dispersed when saved in MSWord and a systematic replacement of the signs in the keyboard. The seond problem rises because of unintended bad programming during processing and can be removed by restarting.the computer. The first problem rises only when some editing is attempted in the MS Word format ie  not in the Indian script processor. I avoid editing such files in MSWord formaat.
Best
DB

--- On Fri, 26/3/10, Sudalaimuthu Palaniappan <palaniappa at AOL.COM> wrote:


From: Sudalaimuthu Palaniappan <palaniappa at AOL.COM>
Subject: Text processing in Unicode
To: INDOLOGY at liverpool.ac.uk
Date: Friday, 26 March, 2010, 8:41 AM


Dear Indologists,

I am seeing some problems in text processing in Tamil texts created using Unicode fonts.
Consider the following text in Project Madurai.
http://www.projectmadurai.org/pm_etexts/pdf/pm0323.pdf

According to the cover page, "This pdf file is based on Unicode with corresponding Latha font embedded in the file. Hence this file can be viewed and printed on all computer platforms: Windows, Macintosh and Unix without the need to have the font installed in your computer."

When I searched the text for the string தான் (tAn2), I hit not only தான் but also தோன் (tOn2) ! 

Has anyone processing (searching, sorting) Unicode texts in Sanskrit or other Indian languages encountered any problems like the above? 

(Needless to say, when one copies the text from PDF and pastes in email, one gets messed up text like this. நாக􏰀கம் இல்லாத மிகப் பழங் காலத்தில் மனிதர்கள் வடுீ    கட்டத் ெத􏰀யாமல் குைககளில் வாழ்ந்தார்களாம். அந்தப் பழஙகாலத்ைதக் கற்காலம் என்று ெசால்லுகிேறாம்.)

(However, a draft report by an Expert Committee on Technology Standards for Indian Languages 
(http://egovstandards.gov.in/apex-review/egscontent.2009-06-10.5999916108/at_download/file) claims: 
 All major operating systems, browsers, editors, word processors and other applications & tools are supporting Unicode.
 It is possible to use Indian languages and scripts in the Unicode environment, which will resolve the compatibility issue.
 The documents created using Unicode may be searched very easily on the web. 
 As Unicode is widely recognized all over the world and also supporting Indian languages, it will ease Localization applications including e-Governance application
for all the constitutionally recognized Indian languages.
 Since Indian languages are also used in the other part of the world, it is possible to have Global data exchange.)

Thanks in advance.

Regards,
Palaniappan



      Your Mail works best with the New Yahoo Optimized IE8. Get it NOW! http://downloads.yahoo.com/in/internetexplorer/



More information about the INDOLOGY mailing list