About
My primary…
Activity
-
🤩 Starting #ACMIndia #VicePresidentGiri. 🤩 #Humbled #StrengthofWellWishers #ProfGiri 👏 Congratulations Prof. Meenakshi D'Souza for President &…
🤩 Starting #ACMIndia #VicePresidentGiri. 🤩 #Humbled #StrengthofWellWishers #ProfGiri 👏 Congratulations Prof. Meenakshi D'Souza for President &…
Liked by Anoop Kunchukuttan
-
Excited to share that I have started a rotation in a Microsoft product group focusing on M365 Copilots to bring research on multilingual language…
Excited to share that I have started a rotation in a Microsoft product group focusing on M365 Copilots to bring research on multilingual language…
Liked by Anoop Kunchukuttan
Experience & Education
Publications
-
Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach
Preprint - arxiv
We propose a novel geometric approach for learning bilingual mappings given monolingual embeddings and a bilingual dictionary. Our approach decouples learning the transformation from the source language to the target language into (a) learning rotations for language-specific embeddings to align them to a common space, and (b) learning a similarity metric in the common space to model similarities between the embeddings. We model the bilingual mapping problem as an optimization problem on smooth…
We propose a novel geometric approach for learning bilingual mappings given monolingual embeddings and a bilingual dictionary. Our approach decouples learning the transformation from the source language to the target language into (a) learning rotations for language-specific embeddings to align them to a common space, and (b) learning a similarity metric in the common space to model similarities between the embeddings. We model the bilingual mapping problem as an optimization problem on smooth Riemannian manifolds. We show that our approach outperforms previous approaches on the bilingual lexicon induction and cross-lingual word similarity tasks. We also generalize our framework to represent multiple languages in a common latent space. In particular, the latent space representations for several languages are learned jointly, given bilingual dictionaries for multiple language pairs. We illustrate the effectiveness of joint learning for multiple languages in zero-shot word translation setting.
Other authorsSee publication -
Judicious Selection of Training Data in Assisting Language for Multilingual Neural NER
Conference of the Association of Computational Linguistics (ACL)
Multilingual learning for Neural Named Entity Recognition (NNER) involves jointly training a neural network for multiple languages. Typically, the goal is improving the NER performance of one of the languages (the primary language) using the other assisting languages. We show that the divergence in the tag distributions of the common named entities between the primary and assisting language can reduce the effectiveness of multilingual learning. To alleviate this problem, we propose a metric…
Multilingual learning for Neural Named Entity Recognition (NNER) involves jointly training a neural network for multiple languages. Typically, the goal is improving the NER performance of one of the languages (the primary language) using the other assisting languages. We show that the divergence in the tag distributions of the common named entities between the primary and assisting language can reduce the effectiveness of multilingual learning. To alleviate this problem, we propose a metric based on symmetric KL divergence to filter out the highly divergent training instances in the assisting language. We empirically show that our data selection strategy improves NER performance on many languages, including those with very limited training data.
Other authors -
Leveraging Orthographic Similarity for Multilingual Neural Transliteration
Transactions of the Association for Computational Linguistics (TACL)
We address the task of joint training of transliteration models for multiple language pairs (multilingual transliteration). This is an instance of multitask learning, where individual tasks (language pairs) benefit from sharing knowledge with related tasks. We focus on transliteration involving related tasks i.e., languages sharing writing systems and phonetic properties (orthographically similar languages). We propose a modified neural encoder-decoder model that maximizes parameter sharing…
We address the task of joint training of transliteration models for multiple language pairs (multilingual transliteration). This is an instance of multitask learning, where individual tasks (language pairs) benefit from sharing knowledge with related tasks. We focus on transliteration involving related tasks i.e., languages sharing writing systems and phonetic properties (orthographically similar languages). We propose a modified neural encoder-decoder model that maximizes parameter sharing across language pairs in order to effectively leverage orthographic similarity. We show that multilingual transliteration
significantly outperforms bilingual transliteration in different scenarios (average increase of 58% across a variety of languages we experimented with). We also show that multilingual transliteration models can generalize well to languages/language pairs not encountered during training and hence perform well on the zeroshot transliteration task. We show that further improvements can be achieved by using phonetic feature input.Other authors -
The IIT Bombay English-Hindi Parallel Corpus
Language Resources and Evaluation Conference
We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a compilation of parallel corpora previously available in the public domain as well as new parallel corpora we collected. The corpus contains 1.49 million parallel segments, of which 694k segments were not previously available in the public domain. The corpus has been pre-processed for machine translation, and we report baseline phrase-based SMT and NMT translation results on this corpus. This corpus has been used in two…
We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a compilation of parallel corpora previously available in the public domain as well as new parallel corpora we collected. The corpus contains 1.49 million parallel segments, of which 694k segments were not previously available in the public domain. The corpus has been pre-processed for machine translation, and we report baseline phrase-based SMT and NMT translation results on this corpus. This corpus has been used in two editions of shared tasks at the Workshop on Asian Language Transation (2016 and 2017). The corpus is freely available for non-commercial research. To the best of our knowledge, this is the largest publicly available English-Hindi parallel corpus.
Other authors -
Utilizing Lexical Similarity between Related, Low-resource Languages for Pivot-based SMT
International Joint Conference on Natural Language Processing
We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting. We show that a subword-level pivot-based SMT model using a related pivot language is substantially better than word and morpheme-level pivot models. It is also highly competitive with the best direct translation model, which is encouraging as no direct source-target training corpus is used. We also show that combining multiple related language pivot models can rival a direct translation…
We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting. We show that a subword-level pivot-based SMT model using a related pivot language is substantially better than word and morpheme-level pivot models. It is also highly competitive with the best direct translation model, which is encouraging as no direct source-target training corpus is used. We also show that combining multiple related language pivot models can rival a direct translation model. Thus, the use of subwords as translation units coupled with multiple related pivot languages can compensate for the lack of a direct parallel corpus.
Other authorsSee publication -
Learning variable length units for SMT between related languages via Byte Pair Encoding
Workshop on Subword and Character level models in NLP (SCLeM 2017, co-located with EMNLP 2017)
We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic…
We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11% increase in BLEU score. While orthographic syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.
Other authorsSee publication -
Orthographic Syllable as basic unit for SMT between Related Languages
Conference on Empirical Methods in Natural Language Processing (EMNLP)
We explore the use of the orthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts. We show that orthographic syllable level translation significantly outperforms models trained over other basic units (word, morpheme and character) when training over small parallel corpora.
Other authors -
Substring-based unsupervised transliteration with phonetic and contextual knowledge
SIGNLL Conference on Computational Natural Language Learning (CoNLL)
We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions.
Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi & Knight's (2009) state-of-the-art unsupervised transliteration…We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions.
Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi & Knight's (2009) state-of-the-art unsupervised transliteration method and outperforms a rule-based baseline by up to 50\% for top-1 accuracy on multiple language pairs. We show that substring-based models are superior to character-based models, and observe that their top-10 accuracy is comparable to the top-1 accuracy of supervised systems.
Our method only requires a phonemic representation of the words. This is possible for many language-script combinations which have a high grapheme-to-phoneme correspondence \textit{e.g.} scripts of Indian languages derived from the Brahmi script. Hence, Indian languages were the focus of our experiments. For other languages, a grapheme-to-phoneme converter would be required.
Other authors -
A System for Compound Noun Multiword Expression Extraction for Hindi
6th Intl. Conf. on Natural Language Processing, ICON 2008
Identifying compound noun multiword
expressions is important for applications like
machine translation and information retrieval.
We describe a system for extracting Hindi
compound noun multiword expressions
(MWE) from a given corpus. We identify
major categories of compound noun MWEs,
based on linguistic and psycholinguistic
principles. Our extraction methods use
various statistical co-occurrence measures to
exploit the statistical idiosyncrasy of MWEs.
We make…Identifying compound noun multiword
expressions is important for applications like
machine translation and information retrieval.
We describe a system for extracting Hindi
compound noun multiword expressions
(MWE) from a given corpus. We identify
major categories of compound noun MWEs,
based on linguistic and psycholinguistic
principles. Our extraction methods use
various statistical co-occurrence measures to
exploit the statistical idiosyncrasy of MWEs.
We make use of various lexical cues from the
corpus to enhance our methods. We also
address the extraction of reduplicative
expressions using lexical, semantic and
phonetic knowledge. We have also built an
evaluation resource of compound noun
MWEs for Hindi. Our methods give a recall
of 80% and precision of 23% at rank 1000.
Other authors -
Projects
-
BrahmiNet
Brahmi-Net is an online system for transliteration and script conversion for all major Indian language pairs (306 pairs). The system covers 13 Indo-Aryan languages, 4 Dravidian languages and English.
Languages supported include:
- Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani, Assamese, Odia, Sindhi, Sinhala, Nepali, Sanskrit
- Dravidian languages: Tamil, Telugu, Malayalam, Kannada
- English
Other creatorsSee project -
Śata-Anuva̅dak: Indic Translator
- Present
Śata-Anuva̅dak (100 Translators) is a broad coverage Statisitical Machine Translation system for Indian languages. It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. It currently supports translation between 11 Indian languages:
- Indo-Aryan languages: Hindi, Urdu, Bengali…Śata-Anuva̅dak (100 Translators) is a broad coverage Statisitical Machine Translation system for Indian languages. It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. It currently supports translation between 11 Indian languages:
- Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani
- Dravidian languages: Tamil, Telugu, Malayalam
- EnglishOther creators -
-
Indic NLP Library
- Present
The goal of this project is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.
The library provides the following functionalities:
- Text Normalization
- Indic Script Conversion
- Romanization of Indic…The goal of this project is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.
The library provides the following functionalities:
- Text Normalization
- Indic Script Conversion
- Romanization of Indic Scripts (ITRANS) and vice-versa
- Indian Language Transliteration
- Tokenization
- Word Segmentation
Honors & Awards
-
Outstanding Paper at SCLeM workshop 2017
Workshop on Subword and Character level models in NLP 2017 (Collocated with EMNLP)
Our paper titled:
Learning variable length units for SMT between related languages via Byte Pair Encoding
co-authored with Prof. Pushpak Bhattacharyya has been awarded an outstanding paper award at the 1st Workshop on Subword and Character level models in NLP 2017, which is collocated with EMNLP 2017. The workshop is on 7th September 2017.
Here is the paper:
https://arxiv.org/abs/1610.06510
-
Best Thesis Talk at Research and Innovation Symposium in Computing
Department of Computer Science and Engineering, IIT Bombay
This talk was given at the Department of Computer Science and Engineering, IIT Bombay's annual research symposium. The abstract of the talk is given below:
Related languages are those that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time. Machine Translation between related languages is a major requirement since there is substantial government, commercial and cultural communication among people speaking…This talk was given at the Department of Computer Science and Engineering, IIT Bombay's annual research symposium. The abstract of the talk is given below:
Related languages are those that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time. Machine Translation between related languages is a major requirement since there is substantial government, commercial and cultural communication among people speaking related languages. However, most of these languages have few parallel corpora resources, an important requirement for building good quality statistical machine translation (SMT) systems.
A key property of related languages is lexical similarity, which means the languages share many words with the similar form (spelling/pronunciation) and meaning. These words could be cognates, lateral borrowings or loan words from other languages. Modelling lexical similarity among related languages is the key to building good-quality SMT systems with limited parallel corpora. We propose the use of two subword units of translation for modelling lexical similarity: (i) orthographic syllables motivated from the design of Indic scripts and (ii) byte pair encoded units inspired from compression theory. We show that the proposed significantly outperforms other units of representation (word, morpheme and character), over multiple language pairs, spanning different language families, with varying degrees of lexical similarity and is robust to domain changes too. -
Invited Talk at Inter-Research-Institute Student Seminar in Computer Science (ACM India Annual Meet)
ACM India
Title of talk: Orthographic Syllable as basic unit for SMT between Related Languages
Abstract:
We explore the use of theorthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts. We show that orthographic syllable level translation significantly outperforms models trained over other basic units (word, morpheme and character) when training over small parallel corpora. -
Invited Tutorial on Statistical Machine Translation between related languages
North American Chapter of the Association for Computational Linguistics - Human Language Technologies: System Demonstrations
With Pushpak Bhattacharyya and Mitesh Khapra
Abstract:
Language-independent Statistical Machine Translation (SMT) has proven to be very challenging. The diversity of languages makes high accuracy difficult and requires substantial parallel corpus as well as linguistic resources (parsers, morph analyzers, etc.). An interesting observation is that a large chunk of machine translation (MT) requirements involve related languages. They are either : (i) between related languages, or…With Pushpak Bhattacharyya and Mitesh Khapra
Abstract:
Language-independent Statistical Machine Translation (SMT) has proven to be very challenging. The diversity of languages makes high accuracy difficult and requires substantial parallel corpus as well as linguistic resources (parsers, morph analyzers, etc.). An interesting observation is that a large chunk of machine translation (MT) requirements involve related languages. They are either : (i) between related languages, or (ii) between a lingua franca (like English) and a set of related languages. For instance, India, the European Union and South-East Asia have such translation requirements due to government, business and socio-cultural communication needs.
Related languages share a lot of linguistic features and the divergences among them are at a lower level of the NLP pipeline. The objective of the tutorial is to discuss how the relatedness among languages can be leveraged to bridge this language divergence thereby achieving some/all of these goals: (i) improving translation quality, (ii) achieving better generalization, (iii) sharing linguistic resources, and (iv) reducing resource requirements.
We will look at the existing research in SMT from the perspective of related languages, with the goal to build a toolbox of methods that are useful for translation between related languages. This tutorial would be relevant to Machine Translation researchers and developers, especially those interested in translation between low-resource languages which have resource-rich related languages. It will also be relevant for researchers interested in multilingual computation.
Languages
-
English
Full professional proficiency
-
Hindi
Native or bilingual proficiency
-
Marathi
Native or bilingual proficiency
-
Malayalam
Native or bilingual proficiency
More activity by Anoop
-
After six months of my research internship at the Nilekani Centre at AI4Bhārat, Indian Institute of Technology, Madras under Dr. Mitesh Khapra, I am…
After six months of my research internship at the Nilekani Centre at AI4Bhārat, Indian Institute of Technology, Madras under Dr. Mitesh Khapra, I am…
Liked by Anoop Kunchukuttan
-
𝐌𝐢𝐜𝐫𝐨𝐬𝐨𝐟𝐭 𝐀𝐜𝐚𝐝𝐞𝐦𝐢𝐜 𝐏𝐚𝐫𝐭𝐧𝐞𝐫𝐬𝐡𝐢𝐩 𝐆𝐫𝐚𝐧𝐭 (𝐌𝐀𝐏𝐆) 𝟐𝟎𝟐𝟒 Microsoft strongly believes that academia and industry can…
𝐌𝐢𝐜𝐫𝐨𝐬𝐨𝐟𝐭 𝐀𝐜𝐚𝐝𝐞𝐦𝐢𝐜 𝐏𝐚𝐫𝐭𝐧𝐞𝐫𝐬𝐡𝐢𝐩 𝐆𝐫𝐚𝐧𝐭 (𝐌𝐀𝐏𝐆) 𝟐𝟎𝟐𝟒 Microsoft strongly believes that academia and industry can…
Liked by Anoop Kunchukuttan
-
It has been an amazing journey over the last eight months or so, culminating in a mesmerizing last two weeks at the Harvard Business School, where we…
It has been an amazing journey over the last eight months or so, culminating in a mesmerizing last two weeks at the Harvard Business School, where we…
Liked by Anoop Kunchukuttan
-
I am extremely happy and proud to share that my brother, Ayush Maheshwari, has successfully defended his PhD Thesis at Indian Institute of…
I am extremely happy and proud to share that my brother, Ayush Maheshwari, has successfully defended his PhD Thesis at Indian Institute of…
Liked by Anoop Kunchukuttan
-
Interesting talk by Vipul R. on Grammarly's research on multilingual text editing covering grammatical error correction, text simplification and…
Interesting talk by Vipul R. on Grammarly's research on multilingual text editing covering grammatical error correction, text simplification and…
Liked by Anoop Kunchukuttan
-
Happy to share that I have successfully defended my PhD thesis at IITB! 🎓 Title: Enriching Language Representations using Knowledge Resources This…
Happy to share that I have successfully defended my PhD thesis at IITB! 🎓 Title: Enriching Language Representations using Knowledge Resources This…
Liked by Anoop Kunchukuttan
-
Our “No Language Left Behind” paper is now published in Nature. I’m very proud to have been a part of this amazing team! https://lnkd.in/gShAJtHM
Our “No Language Left Behind” paper is now published in Nature. I’m very proud to have been a part of this amazing team! https://lnkd.in/gShAJtHM
Liked by Anoop Kunchukuttan
-
Great talk by today's guest speaker Vivek Gupta in our internalNLP seminar series on "Reasoning with Complex Data: Insights from Semi-Structured…
Great talk by today's guest speaker Vivek Gupta in our internalNLP seminar series on "Reasoning with Complex Data: Insights from Semi-Structured…
Liked by Anoop Kunchukuttan
-
Excited to be giving the keynote talk at the EURALI workshop @ LREC-COLING 2024 tomorrow! I'll be speaking about our learnings over the few years and…
Excited to be giving the keynote talk at the EURALI workshop @ LREC-COLING 2024 tomorrow! I'll be speaking about our learnings over the few years and…
Liked by Anoop Kunchukuttan
-
Prof Pushpak Bhattacharyya named Chairman of National Committee on Indian Language Standards This committee will evaluate encoding, fonts, search…
Prof Pushpak Bhattacharyya named Chairman of National Committee on Indian Language Standards This committee will evaluate encoding, fonts, search…
Liked by Anoop Kunchukuttan
-
It is thrilling to announce the release of new models as part of the Phi-3 series on both HuggingFace (https://aka.ms/phi3-hf) and Azure…
It is thrilling to announce the release of new models as part of the Phi-3 series on both HuggingFace (https://aka.ms/phi3-hf) and Azure…
Liked by Anoop Kunchukuttan
-
Thrilled to see GeekWire's Innovation of the Year award to Allen Institute for AI (AI2)'s OLMo team 😁 Learn more about the model:…
Thrilled to see GeekWire's Innovation of the Year award to Allen Institute for AI (AI2)'s OLMo team 😁 Learn more about the model:…
Liked by Anoop Kunchukuttan
Other similar profiles
-
Raj Dabre
Researcher at NICT, Adjunct Faculty at IIT Madras, Visiting Professor at IIT Bombay
Connect -
Mohammed Safi Ur Rahman Khan
PhD @ IIT Madras, @ AI4Bharat
Connect -
Ishvinder Sethi
Connect -
Jay Gala
Research Associate @ MBZUAI
Connect -
Pramod Varma
Connect -
Kaushal Kumar Maurya
|| Postdoctoral Researchers at MBZUAI, UAE || Former Graduate Alum of CSE, IITH || Research Interest: NLP/LLM for Education, Multilingual NLP and Responsible NLP ||
Connect -
Yash Madhani
AI4Bharat | IIT Madras
Connect -
Sanyam Jain
Connect -
Tahir Javed
PhD @ iit-madras | AI4Bharat | g-fellow
Connect -
Rupesh Mehta
Group Engineering Manager at Microsoft India
Connect
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More