Digital Indological Dream? 17 oct. 1995

Sun May 12 12:17:01 UTC 1996

                 DIGITAL INDOLOGICAL DREAM

A reaction on INDOLOGY at LIVERPOOL.AC.UK, 17-OCT-1995

               GATEWAY MULTIMEDIA INDIA LTD.
               =============================

Gateway Multimedia India offers a wide range of services in
the field of Electronic Publishing.

The staff of the company is multi-lingual and multi-
disciplined while the management is by a Dutch national. 

Gateway Multimedia India is known for her linguistic
publications in book format and on CD-ROM (Transliterated
Hindi-Hindi-English dictionary).

The company is based in Ahmedabad, 580 KM North of Bombay.

                    MONIER-WILLIAMS

We have indeed considered the publication of the Monier-
Williams Sanskrit dictionary on CD-ROM. But it ain't that easy
as it may look like. Not only will it require double entry &
file comparation, but there are some other angles as well to
enable a hypertext based search engine and retrieval. A
detailed study is available for those keenly interested.

We have calculated the cost of the project at Rs. 30 Lakh only
( ± $ 86.000 only) and the duration of the project on 1 year.
The publication will then be entirely edited, every entry in
full (no tilde) etc.

The Monier is frequently reprinted and its users are world
wide. At an end-user price setting of $249,95 the production
will certainly pay off and will serve scholars in every remote
corner of the globe.

Our company has the required experience to tackle this
project. See above and below. Our Hindi-English CD-ROM has
received favourable reactions world wide. 

We do however lack the finance.

Which university or scientific institute will push the
project?

For further details, please contact

                 Hein W. Wagenaar

e-mail         : parikh at rullet.leidenuniv.nl
phone & fax    : 00 31 20 626 7479
address        : Hoogte Kadijk 109
                 1018 BH - Amsterdam, The Netherlands
               GATEWAY MULTIMEDIA INDIA LTD. 
                 MULTILINGUAL PRODUCTIONS
                 ========================

Gateway Multimedia India works on the creation of a single
database containing the Constitutional languages of India (16)
linked to the major European ones for the production of
several types of dictionaries. 

Types of dictionaries to be derived from the database will
include comprehensive monolingual and bilingual dictionaries,
pocket dictionaries, multilingual dictionaries, specialised
dictionaries, for example for technical terms, etc. 

These dictionaries will be published both as traditional paper
dictionaries, as well as on CD-ROM. Further applications
of the database may be found in creating thesauruses, rhyming
dictionaries, spell-checking applications and natural language
processing.

General outline

The aim of the multilingual dictionary database project is to
come to a database covering a multitude of languages, with a
special focus on Indian languages. 

This database will be the basic source for a whole range of
derived end-user products, including paper dictionaries for
various languages, uses, and specialisations, and multilingual
dictionaries on CD-ROM. 

To give an example of this, if say, a market analysis shows
that there is a demand for a Hindi-Marathi dictionary, it will
be a matter of turning some knobs to extract the required
information from the database and run it through an automatic
lay-out program to create camera-ready copy which can be send
off to a printer in a matter of days. 

A Kannada-Bengali dictionary, or any other combination of
languages can be generated with the same ease.

India has 18 officially recognised languages; 

     Hindi, English, Sanskrit, Bengali, Urdu, Gujarati,
     Marathi, Punjabi, Oriya, Assamese, Nepali, Kashmiri,
     Konkani, Manipuri, Telugu, Kannada, Tamil, and Malayalam,

     of which Sanskrit can be omitted, while Sindhi needs to
     be included. 

Whereas the inclusion of mayor foreign languages (German,
French, Russian, Arabic, Chinese, Japanese, Spanish,
Portuguese) will immensely increase the scope of the database.

To enable automated dictionary production, special care has to
be taken with respect to the way information is stored and
linked. In every language, words have a different scope of
meanings, which only partly overlap with those in other
languages. 

The task of a bilingual dictionary is to demarcate these
borders of meaning as closely as possible. For this reason,
one cannot simply work with list words, but one has to work
with a kind of ?link language? which has a word for every
conceivable meaning in all languages. 

Such a link language can be provided with the means of
definitions, which need to be created with great care.
However, when this task is completed, the savings in work will
be considerable, in fact so large, that the database can truly
be called revolutionary. 

Traditionally, each bi-lingual dictionary was created
separately, so to create a full set of bi-lingual dictionaries
for 26 languages, the traditional method requires going
through the editing process 676 times, whereas with our
system, we only have to edit the sets of definitions and
vocabulary for 26 languages, after which we can simply
generate the 676 dictionaries automatically. 

Further, the same database can be used to derive thesauri for
each of the languages covered, spelling checkers, and further
products like rhyming dictionaries and puzzle dictionaries.

The special focus on Indian languages introduces problems
unheard of in existing dictionary databases: the use of many
different complex scripts, which are in use in India,
including not only at least ten different Brahmi derived
scripts, each with their own peculiarities, but also the
Nastaliq style of Arabic script used for Urdu, which is
probably the most difficult script to automate in use today.  

The multilingual CD-ROM will require display and editing
facilities in all Indian languages integrated into its
user-interface. 

Although solutions exist for individual Indian languages,
no-one so far has taken up the task of integrating all of them
into a single product in a satisfactory way. So tackling this
problem will be an important part of the project. 

Next to developing the required fonts, this will also include
handling multiple keyboard lay-outs, and composing the complex
conjunct characters and contextual dependent shapes in used in
Indian scripts. 

The software to do this, however, when completed, will also
prove to be a product in itself, with a much wider field of
applications, ranging from editors and DTP products to Indian
language multi-media titles and video subtitling.

Further, the multi-lingual database CD-ROM application will
require fast culturally correct sorting tools, which can sort
according to for example the Arabic or Devanagari alphabetical
order, and indexing and searching tools to navigate through
the huge amount of data. Such tools will also be useful in a
much wider range of Multimedia and text retrieval
applications, and thus can also be marked as a separate
product.

Salient features of the multi-lingual database project

* coverage off all mayor Indian and foreign languages in their 
  native scripts
* automatic production of monolingual, bilingual dictionaries
* automatic production of specialised dictionaries
* production of further products, like thesauri, puzzle and
  rhyming dictionaries
* separately marketable products, such as

      * Indian language processing software, including

             . text presentation
             . text editing
             . spell-checking
             . culturally correct sorting

      * hyper-text indexing and retrieval software

* a huge database accessible for various kinds of linguistic   
  research.