Text processing in Unicode

Fri Mar 26 03:11:28 UTC 2010

Dear Indologists,

I am seeing some problems in text processing in Tamil texts created using Unicode fonts.
Consider the following text in Project Madurai.
http://www.projectmadurai.org/pm_etexts/pdf/pm0323.pdf

According to the cover page, "This pdf file is based on Unicode with corresponding Latha font embedded in the file. Hence this file can be viewed and printed on all computer platforms: Windows, Macintosh and Unix without the need to have the font installed in your computer."

When I searched the text for the string தான் (tAn2), I hit not only தான் but also தோன் (tOn2) ! 

Has anyone processing (searching, sorting) Unicode texts in Sanskrit or other Indian languages encountered any problems like the above? 

(Needless to say, when one copies the text from PDF and pastes in email, one gets messed up text like this. நாக􏰀கம் இல்லாத மிகப் பழங் காலத்தில் மனிதர்கள் வடுீ	கட்டத் ெத􏰀யாமல் குைககளில் வாழ்ந்தார்களாம். அந்தப் பழஙகாலத்ைதக் கற்காலம் என்று ெசால்லுகிேறாம்.)

(However, a draft report by an Expert Committee on Technology Standards for Indian Languages 
(http://egovstandards.gov.in/apex-review/egscontent.2009-06-10.5999916108/at_download/file) claims: 
 All major operating systems, browsers, editors, word processors and other applications & tools are supporting Unicode.
 It is possible to use Indian languages and scripts in the Unicode environment, which will resolve the compatibility issue.
 The documents created using Unicode may be searched very easily on the web. 
 As Unicode is widely recognized all over the world and also supporting Indian languages, it will ease Localization applications including e-Governance application
for all the constitutionally recognized Indian languages.
 Since Indian languages are also used in the other part of the world, it is possible to have Global data exchange.)

Thanks in advance.

Regards,
Palaniappan