Text processing in Unicode

Fri Mar 26 04:40:51 UTC 2010

This may also happen based on the specific operating system. I use an  
old OS X (10.4)-based Mac PowerBook G4.

I copied a small paragraph and have pasted it below. It seems to have  
turned out fine (except for a couple of minor glitches -- with  
respect to the letters r and ai). The latest OS may have resolved  
this problem:

===============
நன்றியுைர
ெசன்ைன வாெனாலி  
நிைலயத்தான்  
ஏற்பாட்டின்படி  
"ெசால்வன்ைம"
என்னும் ெபாருள்பற்றிப்  
பள்ளி மாணவரக்காக, ௧௯௫௨  
ஜூைல௧௪, ௨௮,
ஆகஸ்ட் 11,25, ெசப்ெடம்பர் 15  
ஆகிய ஐந்து நாட்களில்  
ஐந்து தைலப்பில்
வாெனாலியில் ேபச ே 
நர்ந்தது. ேபசியவற்ைற  
நூல்வடிவில் ெவளியிட
வாெனாலி நிைலயத்தார்  
அனுமதி தந்தனர்.
அவர்கட்கு நன்றி  
கூறுகின்ேறன்.

==================

As for searching, the typing into the search field may pose a  
problem, since the keyboard may be different and the ASCII values are  
not properly read-in. The best way is to copy an instance of the  
desired item and paste it into the search field and hit the Enter/ 
Return key as we go along.

Best,
--vsr

On Mar 25, 2010, at 9:19 PM, Dipak Bhattacharya wrote:

> I can speak of two problems. The characters/signs are sometimes  
> dispersed when saved in MSWord and a systematic replacement of the  
> signs in the keyboard. The seond problem rises because of  
> unintended bad programming during processing and can be removed by  
> restarting.the computer. The first problem rises only when some  
> editing is attempted in the MS Word format ie  not in the Indian  
> script processor. I avoid editing such files in MSWord formaat.
> Best
> DB
>
> --- On Fri, 26/3/10, Sudalaimuthu Palaniappan <palaniappa at AOL.COM>  
> wrote:
>
>
> From: Sudalaimuthu Palaniappan <palaniappa at AOL.COM>
> Subject: Text processing in Unicode
> To: INDOLOGY at liverpool.ac.uk
> Date: Friday, 26 March, 2010, 8:41 AM
>
>
> Dear Indologists,
>
> I am seeing some problems in text processing in Tamil texts created  
> using Unicode fonts.
> Consider the following text in Project Madurai.
> http://www.projectmadurai.org/pm_etexts/pdf/pm0323.pdf
>
> According to the cover page, "This pdf file is based on Unicode  
> with corresponding Latha font embedded in the file. Hence this file  
> can be viewed and printed on all computer platforms: Windows,  
> Macintosh and Unix without the need to have the font installed in  
> your computer."
>
> When I searched the text for the string தான் (tAn2), I hit  
> not only தான் but also தோன் (tOn2) !
>
> Has anyone processing (searching, sorting) Unicode texts in  
> Sanskrit or other Indian languages encountered any problems like  
> the above?
>
> (Needless to say, when one copies the text from PDF and pastes in  
> email, one gets messed up text like this. நாக􏰀கம்  
> இல்லாத மிகப் பழங்  
> காலத்தில் மனிதர்கள்  
> வடுீ    கட்டத் ெத􏰀யாமல்  
> குைககளில்  
> வாழ்ந்தார்களாம். அந்தப்  
> பழஙகாலத்ைதக் கற்காலம்  
> என்று ெசால்லுகிேறாம்.)
>
> (However, a draft report by an Expert Committee on Technology  
> Standards for Indian Languages
> (http://egovstandards.gov.in/apex-review/egscontent. 
> 2009-06-10.5999916108/at_download/file) claims:
>  All major operating systems, browsers, editors, word processors  
> and other applications & tools are supporting Unicode.
>  It is possible to use Indian languages and scripts in the  
> Unicode environment, which will resolve the compatibility issue.
>  The documents created using Unicode may be searched very easily  
> on the web.
>  As Unicode is widely recognized all over the world and also  
> supporting Indian languages, it will ease Localization applications  
> including e-Governance application
> for all the constitutionally recognized Indian languages.
>  Since Indian languages are also used in the other part of the  
> world, it is possible to have Global data exchange.)
>
> Thanks in advance.
>
> Regards,
> Palaniappan
>
>
>
>       Your Mail works best with the New Yahoo Optimized IE8. Get it  
> NOW! http://downloads.yahoo.com/in/internetexplorer/