[INDOLOGY] Google Translate for Sanskrit
Sebastian Nehrdich
nehrdbsd at gmail.com
Fri May 13 13:53:46 UTC 2022
Looking at the characteristics on these translations, their rather
good performance on what could be considered 'modern Sanskrit' and the
not-so-convincing performance on older material, I have my doubts
whether this is really a zery shot approach; I rather assume that they
used a (zero-shot?) language model to bootstrap a bitext mining system
that then generated Sanskrit<>English translation pairs based on the
Sanskrit WIkipedia and other related resources. With a small amount of
such data a T5-like seq2seq architecture that was pretrained on a lot
of (English) data could then be used to output nicely formed (but not
always correct) translations.
Even without much specific vocabulary, this system doesn't seem to be
very robust I am afraid.
For example a famous verse from Maitreya's Madhyāntavibhāga:
na śūnyaṃ nāpi cāśūnyaṃ tasmāt sarvaṃ vidhīyate |
sattvādasattvāt sattvācca madhyamā pratipacca sā || 1.3
Google translate:
"Neither empty nor empty, therefore everything is prescribed
That is the middle counterpart from Sattva to Sattva"
There is no very specific vocabulary here but the translation is still
quite meh.
Now when we run the Chinese translation (玄奘, 7th century) of this
verse through a machine translation model that we are currently
developing for Buddhist Chinese, we get a very different result:
故說一切法 非空非不空
有無及有故 是則契中道
therefore it is said that all dharmas are neither empty nor nonempty
because of both existence and nonexistence, this is the middle way
For Chinese, we also don't have much parallel data for the training of
the models, but we can take advantage of the large bilingual corpora
for non-Classical/non-Buddhist material.
The different results of these two models show rather clearly to me
that zero shot transfer learning for low resource languages such as
Sanskrit sounds nice in theory but is extremely difficult to do well
on target-domains in practice. For Chinese, on the other hand, the
sheer mass of bilingual training data that we have enables transfer
learning from a non-domain-specific corpus into a very narrow domain
and produce surprisingly good results. Who knows what our models can
do in the near future, but I guess it will take a while until we get
universally good Sanskrit machine translations...
Best,
Sebastian Nehrdich
On Fri, May 13, 2022 at 1:05 PM Satyanad Kichenassamy
<satyanad.kichenassamy at univ-reims.fr> wrote:
>
>
> That sounds right. In addition, if I understand them correctly, they claim they have made a significant improvement. In their announcement (https://blog.google/products/translate/24-new-languages/), they say:
> "This is also a technical milestone for Google Translate. These are the first languages we’ve added using Zero-Shot Machine Translation, where a machine learning model only sees monolingual text — meaning, it learns to translate into another language without ever seeing an example. While this technology is impressive, it isn't perfect. And we’ll keep improving these models to deliver the same experience you’re used to with a Spanish or German translation, for example. If you want to dig into the technical details, check out our Google AI blog post and research paper."
>
> Blog : https://ai.googleblog.com/2022/05/24-new-languages-google-translate.html
> Paper : https://arxiv.org/pdf/2205.03983.pdf
>
> They say: "There are two key bottlenecks... The first arises from data scarcity; digitized data for many languages is limited and can be difficult to find on the web... The second challenge arises from modeling limitations....models must learn to translate from limited amounts of monolingual text, which is a novel area of research. "
>
> Indologists would add that the program aims at reproducing a consensus about a one-to-one mapping from text to meaning. But we are usually interested in situations where such a consensus either does not exist, or is problematic.
>
> Best regards,
>
> Satyanad Kichenassamy
>
> On Fri, 13 May 2022 12:34:37 +0200
> Oliver Hellwig via INDOLOGY <indology at list.indology.info> wrote:
>
> > Most probably they have built their MT system on top of so called deep
> > contextualized embeddings such as BERT
> > (https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b)
> > or Roberta (https://huggingface.co/docs/transformers/model_doc/roberta).
> > We have analyzed such multilingual embeddings for a Sanskrit project,
> > and it turned out that the Sanskrit data were mainly taken from a dump
> > of the Sanskrit Wikipedia, which explains the preference for the modern
> > version. Very useful for MT, less so for a close reading of Vedic texts.
> >
> > Best, Oliver
> >
> > On 13/05/2022 11:51, Antonia Ruppel via INDOLOGY wrote:
> > > I think it's also worth asking what the programmers who made this meant
> > > when they said 'Sanskrit'. The classical language, or the modern spoken
> > > version taught and stratified by organisations like e.g. Samskrita
> > > Bharati? I tried a few simple sentences (I went into town, I saw the
> > > man, Where is the cat? etc) and found that
> > > -- the past tense is expressed by means of the ta- and tavant-
> > > participles (the default is the masculine participle, by the way, even
> > > when you try things like 'I, Sītā, went into town'), as favoured e.g. by
> > > modern spoken Sanskrit (not only by it, of course)
> > > -- 'Where is the cat?' resulted in the word order बिडालः कुत्र अस्ति
> > > favoured by modern Sanskrit (and mirrored by e.g. Hindi).
> > > - my, her etc in sentences like 'she sees her sisters' are usually
> > > expressed, e.g. by means of sva- or through the actual genitive pronoun,
> > > unlike the Classical Sanskrit tendency of only expressing this when
> > > omission causes confusion
> > > - with at least some expressions we get the noun in accusative + karoti
> > > expression (e.g. smitam karoti rather than smayati), that, I think, also
> > > becomes more prevalent as time passes
> > > - external sandhi is not applied, again following the prevalent modern
> > > spoken convention
> > >
> > > Entering 'I have seen him' (rather than 'I saw him') gives me मया तं
> > > दृष्टम्, which I don't quite understand because I'd have expected 'him' to
> > > be the subject and thus nominative. (The same results with other
> > > transitive verbs.)
> > >
> > > When you create a translation program, you need to decide what the
> > > 'right' translation of something is. With literary languages, like
> > > Sanskrit, whose features usually include variety of expression, that is
> > > difficult. So it seems natural that the programmers would use the
> > > standards of the modern spoken language, for whose creation those
> > > decisions were at some point made.
> > >
> > > That google translate now includes Sanskrit is a fascinating social
> > > phenomenon. I'm looking forward to seeing how they are going to develop
> > > it, and hope someone might at some point talk about their methodology in
> > > creating this function. (Let's find out and invite them to a conference?
> > > It would surely make for a fascinating talk!)
> > >
> > > All best,
> > > Antonia
> > >
> > > On Fri, 13 May 2022 at 10:01, Satyanad Kichenassamy
> > > <satyanad.kichenassamy at univ-reims.fr
> > > <mailto:satyanad.kichenassamy at univ-reims.fr>> wrote:
> > >
> > >
> > > Dear All,
> > >
> > > Here are a few further experiments that illustrate other issues :
> > >
> > > Input: सत्यमेव जयते
> > > Output: Truth always triumphs
> > >
> > > Input: Truth always triumphs
> > > Output: सत्यं सदा विजयते
> > >
> > > Input: सत्यं सदा विजयते
> > > Output: Truth always triumphs
> > >
> > > Input: C'est la réalité qui triomphe
> > > Output: Reality wins
> > >
> > > Input: C'est la réalité qui triomphe.
> > > Output: It is reality that triumphs.
> > >
> > > (The only difference between the last two inputs is the final period.)
> > >
> > > Input: Reality prevails.
> > > Input: La réalité l'emporte.
> > >
> > > Input: Reality alone prevails.
> > > Output: Seule la réalité prévaut.
> > >
> > > Input: Seule la réalité prévaut.
> > > Output: Only reality prevails.
> > >
> > > And, for fun, Prop. 12.21 from Brahmagupta's Braahmasphu.tasiddhaanta.
> > >
> > > Input: स्थूलफलं त्रिचतुर्भुजबाहुप्रतिबाहुयोगदलघातः।
> > > भुजयोगार्धचतुष्टयभुजोनघातात्पदं सूक्ष्मम् ॥
> > >
> > > Output: The gross fruit is the three-four-arm arm-counter-arm
> > > combination team attack.
> > > The subtle step is from the impact of the four and a half arms of
> > > the Yoga of the arms.
> > >
> > > A correct translation is as follows (the four lines correspond to
> > > the four parts of this Arya verse):
> > > A crude value [indeed] of the area of a triquadrilateral
> > > Is the product of the half-sums of opposite sides ;
> > > Of a group consisting of four half-sums of the sides, from which
> > > The sides have been subtracted [in turn], the root of the
> > > product is the refined [value].
> > >
> > > NB: There are quite a few technical terms here; taking some of them
> > > in their ordinary meaning leads to gibberish. "Pada" here is the
> > > square root (because the foot of a tree is its root). Yoga is here
> > > the sum. "Dala" is the half (literally, "broken (in half)"). A
> > > triquadrilateral is the figure obtained from a trilateral by adding
> > > a fourth vertex on its circumcircle. Tricaturbhuja is a neologism
> > > introduced by Brahmagupta that we translated by a neologism because
> > > there is no corresponding notion in English.
> > >
> > > Thus, Google Translate seems adequate at the स्थूल level, but may miss
> > > the सूक्ष्म.
> > >
> > > Reverting to general issues from an Indological or mathematical (or
> > > computer science) viewpoint, I would suggest offhand the following
> > > for discussion:
> > >
> > > (i) is the algorithm public or not? (Probably not, but who knows?)
> > >
> > > (ii) is there a public algorithm with comparable performance?
> > >
> > > (iii) what is the knowledge base (or training set in the sense of
> > > neural networks) of known algorithms?
> > >
> > > (iv) a possibly related issue is that there does not seem to be any
> > > equivalent for Indian languages of Chinese databases such as
> > > ctext.org <http://ctext.org> for instance, that include many tools
> > > in addition to searching. For Sanskrit and Tamil, we are grateful to
> > > have what you can find on
> > > https://indology.info/external-resources/
> > > <https://indology.info/external-resources/>
> > > including
> > > https://www.projectmadurai.org/ <https://www.projectmadurai.org/>
> > > http://gretil.sub.uni-goettingen.de/gretil.html
> > > <http://gretil.sub.uni-goettingen.de/gretil.html>
> > > https://titus.uni-frankfurt.de/indexf.htm
> > > <https://titus.uni-frankfurt.de/indexf.htm>
> > >
> > > etc.
> > >
> > > For Sanskrit morphology and, to some extent, parsing, the situation
> > > is much better : https://sanskrit.inria.fr/DICO/
> > > <https://sanskrit.inria.fr/DICO/>
> > > But such tools do not seem to have been integrated into other
> > > databases (so that, for instance, hovering the mouse over a word
> > > would suggest its grammatical nature, or suggest meanings -- such
> > > things exist in Chinese). This may require the text input into the
> > > database to integrate a modicum of grammatical analysis and
> > > therefore, what amounts to an implicit commentary. This may
> > > nonetheless be appropriate for research journals that could provide
> > > enriched versions of papers. Automated translation always requires
> > > some form of semantic input anyway, except for the crudest examples.
> > >
> > > Best regards,
> > >
> > > Satyanad Kichenassamy
> > >
> > > On Thu, 12 May 2022 16:48:48 -0400
> > > Elliot Stern via INDOLOGY <indology at list.indology.info
> > > <mailto:indology at list.indology.info>> wrote:
> > >
> > > > Aleksandar’s comment is spot on:
> > > >
> > > >
> > > >
> > > > Elliot M. Stern
> > > > 552 South 48th Street
> > > > Philadelphia, PA 19143-2029
> > > > emstern1948 at gmail.com <mailto:emstern1948 at gmail.com>
> > > > 267-240-8418
> > > >
> > > > > On May 12, 2022, at 1:45 PM, Uskokov, Aleksandar via INDOLOGY
> > > <indology at list.indology.info <mailto:indology at list.indology.info>>
> > > wrote:
> > > > >
> > > > > It will be a while before it becomes a philosopher --
> > > > >
> > > > > Aleksandar Uskokov
> > > > > Lector in Sanskrit
> > > > > South Asian Studies Council, Yale University
> > > > > 203-432-1972 | aleksandar.uskokov at yale.edu
> > > <mailto:aleksandar.uskokov at yale.edu>
> > > <mailto:aleksandar.uskokov at yale.edu
> > > <mailto:aleksandar.uskokov at yale.edu>>
> > > > >
> > > > > Office Hours Sign-up: https://calendly.com/aleksandar-uskokov
> > > <https://calendly.com/aleksandar-uskokov>
> > > <https://calendly.com/aleksandar-uskokov
> > > <https://calendly.com/aleksandar-uskokov>>
> > > > > From: INDOLOGY <indology-bounces at list.indology.info
> > > <mailto:indology-bounces at list.indology.info>
> > > <mailto:indology-bounces at list.indology.info
> > > <mailto:indology-bounces at list.indology.info>>> on behalf of Madhav
> > > Deshpande via INDOLOGY <indology at list.indology.info
> > > <mailto:indology at list.indology.info>
> > > <mailto:indology at list.indology.info
> > > <mailto:indology at list.indology.info>>>
> > > > > Sent: Thursday, May 12, 2022 1:31 PM
> > > > > To: Dominik Wujastyk <wujastyk at gmail.com
> > > <mailto:wujastyk at gmail.com> <mailto:wujastyk at gmail.com
> > > <mailto:wujastyk at gmail.com>>>
> > > > > Cc: Indology <indology at list.indology.info
> > > <mailto:indology at list.indology.info>
> > > <mailto:indology at list.indology.info
> > > <mailto:indology at list.indology.info>>>
> > > > > Subject: Re: [INDOLOGY] Google Translate for Sanskrit
> > > > >
> > > > > This is Google Translator for the first verse of Meghadūta:
> > > > >
> > > > > "Someone is neglected by the teacher of separation from his lover:
> > > > > Shapenastangmitamahima varshabhogyaena bhartu:
> > > > > The yaksha bathed Janaka's daughter in the holy waters
> > > > > I lived in the hermitages of Ramagiri among the lush shady trees."
> > > > >
> > > > > GT could not figure out the long compounds, and "guru" got
> > > translated as "teacher." The syntax of the verse is also missed.
> > > > >
> > > > > Madhav M. Deshpande
> > > > > Professor Emeritus, Sanskrit and Linguistics
> > > > > University of Michigan, Ann Arbor, Michigan, USA
> > > > > Senior Fellow, Oxford Center for Hindu Studies
> > > > > Adjunct Professor, National Institute of Advanced Studies,
> > > Bangalore, India
> > > > >
> > > > > [Residence: Campbell, California, USA]
> > > > >
> > > > >
> > > > > On Thu, May 12, 2022 at 10:17 AM Madhav Deshpande
> > > <mmdesh at umich.edu <mailto:mmdesh at umich.edu> <mailto:mmdesh at umich.edu
> > > <mailto:mmdesh at umich.edu>>> wrote:
> > > > > <image.png>
> > > > > Madhav M. Deshpande
> > > > > Professor Emeritus, Sanskrit and Linguistics
> > > > > University of Michigan, Ann Arbor, Michigan, USA
> > > > > Senior Fellow, Oxford Center for Hindu Studies
> > > > > Adjunct Professor, National Institute of Advanced Studies,
> > > Bangalore, India
> > > > >
> > > > > [Residence: Campbell, California, USA]
> > > > >
> > > > >
> > > > > On Thu, May 12, 2022 at 10:16 AM Dominik Wujastyk via INDOLOGY
> > > <indology at list.indology.info <mailto:indology at list.indology.info>
> > > <mailto:indology at list.indology.info
> > > <mailto:indology at list.indology.info>>> wrote:
> > > > > It's quite remarkable:
> > > > > <image.png>
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > INDOLOGY mailing list
> > > > > INDOLOGY at list.indology.info
> > > <mailto:INDOLOGY at list.indology.info>
> > > <mailto:INDOLOGY at list.indology.info
> > > <mailto:INDOLOGY at list.indology.info>>
> > > > > https://list.indology.info/mailman/listinfo/indology
> > > <https://list.indology.info/mailman/listinfo/indology>
> > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist.indology.info%2Fmailman%2Flistinfo%2Findology&data=05%7C01%7Caleksandar.uskokov%40yale.edu%7C68ffc9acdc8241aa59d608da343d6b2e%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637879735643840753%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XN7lAO3%2B4CZr1hjpYDZY4y0AcEs0HCIrhj1vCDTcw9k%3D&reserved=0
> > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist.indology.info%2Fmailman%2Flistinfo%2Findology&data=05%7C01%7Caleksandar.uskokov%40yale.edu%7C68ffc9acdc8241aa59d608da343d6b2e%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637879735643840753%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XN7lAO3%2B4CZr1hjpYDZY4y0AcEs0HCIrhj1vCDTcw9k%3D&reserved=0>>
> > > > > <Screenshot 2022-05-12 134349.png>
> > > > > _______________________________________________
> > > > > INDOLOGY mailing list
> > > > > INDOLOGY at list.indology.info
> > > <mailto:INDOLOGY at list.indology.info>
> > > <mailto:INDOLOGY at list.indology.info
> > > <mailto:INDOLOGY at list.indology.info>>
> > > > > https://list.indology.info/mailman/listinfo/indology
> > > <https://list.indology.info/mailman/listinfo/indology>
> > > <https://list.indology.info/mailman/listinfo/indology
> > > <https://list.indology.info/mailman/listinfo/indology>>
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > **********************************************
> > > Satyanad KICHENASSAMY
> > > Professor of Mathematics
> > > Laboratoire de Mathématiques de Reims (CNRS, UMR9008)
> > > Université de Reims Champagne-Ardenne
> > > F-51687 Reims Cedex 2
> > > France
> > > Web: https://www.normalesup.org/~kichenassamy
> > > <https://www.normalesup.org/~kichenassamy>
> > > **********************************************
> > >
> > > _______________________________________________
> > > INDOLOGY mailing list
> > > INDOLOGY at list.indology.info <mailto:INDOLOGY at list.indology.info>
> > > https://list.indology.info/mailman/listinfo/indology
> > > <https://list.indology.info/mailman/listinfo/indology>
> > >
> > >
> > >
> > > _______________________________________________
> > > INDOLOGY mailing list
> > > INDOLOGY at list.indology.info
> > > https://list.indology.info/mailman/listinfo/indology
> >
> > _______________________________________________
> > INDOLOGY mailing list
> > INDOLOGY at list.indology.info
> > https://list.indology.info/mailman/listinfo/indology
>
>
> --
> **********************************************
> Satyanad KICHENASSAMY
> Professor of Mathematics
> Laboratoire de Mathématiques de Reims (CNRS, UMR9008)
> Université de Reims Champagne-Ardenne
> F-51687 Reims Cedex 2
> France
> Web: https://www.normalesup.org/~kichenassamy
> **********************************************
>
> _______________________________________________
> INDOLOGY mailing list
> INDOLOGY at list.indology.info
> https://list.indology.info/mailman/listinfo/indology
More information about the INDOLOGY
mailing list