Anoop Kunchukuttan

Hyderabad, Telangana, India Contact Info
4K followers 500+ connections

Join to view profile

About

I am interested in Natural Language Processing and Machine Learning.

My primary…

Activity

Join now to see all activity

Experience & Education

  • Microsoft

View Anoop’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach

    Preprint - arxiv

    We propose a novel geometric approach for learning bilingual mappings given monolingual embeddings and a bilingual dictionary. Our approach decouples learning the transformation from the source language to the target language into (a) learning rotations for language-specific embeddings to align them to a common space, and (b) learning a similarity metric in the common space to model similarities between the embeddings. We model the bilingual mapping problem as an optimization problem on smooth…

    We propose a novel geometric approach for learning bilingual mappings given monolingual embeddings and a bilingual dictionary. Our approach decouples learning the transformation from the source language to the target language into (a) learning rotations for language-specific embeddings to align them to a common space, and (b) learning a similarity metric in the common space to model similarities between the embeddings. We model the bilingual mapping problem as an optimization problem on smooth Riemannian manifolds. We show that our approach outperforms previous approaches on the bilingual lexicon induction and cross-lingual word similarity tasks. We also generalize our framework to represent multiple languages in a common latent space. In particular, the latent space representations for several languages are learned jointly, given bilingual dictionaries for multiple language pairs. We illustrate the effectiveness of joint learning for multiple languages in zero-shot word translation setting.

    Other authors
    See publication
  • Judicious Selection of Training Data in Assisting Language for Multilingual Neural NER

    Conference of the Association of Computational Linguistics (ACL)

    Multilingual learning for Neural Named Entity Recognition (NNER) involves jointly training a neural network for multiple languages. Typically, the goal is improving the NER performance of one of the languages (the primary language) using the other assisting languages. We show that the divergence in the tag distributions of the common named entities between the primary and assisting language can reduce the effectiveness of multilingual learning. To alleviate this problem, we propose a metric…

    Multilingual learning for Neural Named Entity Recognition (NNER) involves jointly training a neural network for multiple languages. Typically, the goal is improving the NER performance of one of the languages (the primary language) using the other assisting languages. We show that the divergence in the tag distributions of the common named entities between the primary and assisting language can reduce the effectiveness of multilingual learning. To alleviate this problem, we propose a metric based on symmetric KL divergence to filter out the highly divergent training instances in the assisting language. We empirically show that our data selection strategy improves NER performance on many languages, including those with very limited training data.

    Other authors
  • Leveraging Orthographic Similarity for Multilingual Neural Transliteration

    Transactions of the Association for Computational Linguistics (TACL)

    We address the task of joint training of transliteration models for multiple language pairs (multilingual transliteration). This is an instance of multitask learning, where individual tasks (language pairs) benefit from sharing knowledge with related tasks. We focus on transliteration involving related tasks i.e., languages sharing writing systems and phonetic properties (orthographically similar languages). We propose a modified neural encoder-decoder model that maximizes parameter sharing…

    We address the task of joint training of transliteration models for multiple language pairs (multilingual transliteration). This is an instance of multitask learning, where individual tasks (language pairs) benefit from sharing knowledge with related tasks. We focus on transliteration involving related tasks i.e., languages sharing writing systems and phonetic properties (orthographically similar languages). We propose a modified neural encoder-decoder model that maximizes parameter sharing across language pairs in order to effectively leverage orthographic similarity. We show that multilingual transliteration
    significantly outperforms bilingual transliteration in different scenarios (average increase of 58% across a variety of languages we experimented with). We also show that multilingual transliteration models can generalize well to languages/language pairs not encountered during training and hence perform well on the zeroshot transliteration task. We show that further improvements can be achieved by using phonetic feature input.

    Other authors
  • The IIT Bombay English-Hindi Parallel Corpus

    Language Resources and Evaluation Conference

    We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a compilation of parallel corpora previously available in the public domain as well as new parallel corpora we collected. The corpus contains 1.49 million parallel segments, of which 694k segments were not previously available in the public domain. The corpus has been pre-processed for machine translation, and we report baseline phrase-based SMT and NMT translation results on this corpus. This corpus has been used in two…

    We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a compilation of parallel corpora previously available in the public domain as well as new parallel corpora we collected. The corpus contains 1.49 million parallel segments, of which 694k segments were not previously available in the public domain. The corpus has been pre-processed for machine translation, and we report baseline phrase-based SMT and NMT translation results on this corpus. This corpus has been used in two editions of shared tasks at the Workshop on Asian Language Transation (2016 and 2017). The corpus is freely available for non-commercial research. To the best of our knowledge, this is the largest publicly available English-Hindi parallel corpus.

    Other authors
  • Utilizing Lexical Similarity between Related, Low-resource Languages for Pivot-based SMT

    International Joint Conference on Natural Language Processing

    We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting. We show that a subword-level pivot-based SMT model using a related pivot language is substantially better than word and morpheme-level pivot models. It is also highly competitive with the best direct translation model, which is encouraging as no direct source-target training corpus is used. We also show that combining multiple related language pivot models can rival a direct translation…

    We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting. We show that a subword-level pivot-based SMT model using a related pivot language is substantially better than word and morpheme-level pivot models. It is also highly competitive with the best direct translation model, which is encouraging as no direct source-target training corpus is used. We also show that combining multiple related language pivot models can rival a direct translation model. Thus, the use of subwords as translation units coupled with multiple related pivot languages can compensate for the lack of a direct parallel corpus.

    Other authors
    See publication
  • Learning variable length units for SMT between related languages via Byte Pair Encoding

     Workshop on Subword and Character level models in NLP (SCLeM 2017, co-located with EMNLP 2017)

    We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic…

    We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11% increase in BLEU score. While orthographic syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.

    Other authors
    See publication
  • Orthographic Syllable as basic unit for SMT between Related Languages

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    We explore the use of the orthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts. We show that orthographic syllable level translation significantly outperforms models trained over other basic units (word, morpheme and character) when training over small parallel corpora.

    Other authors
  • Substring-based unsupervised transliteration with phonetic and contextual knowledge

    SIGNLL Conference on Computational Natural Language Learning (CoNLL)

    We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions.

    Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi & Knight's (2009) state-of-the-art unsupervised transliteration…

    We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions.

    Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi & Knight's (2009) state-of-the-art unsupervised transliteration method and outperforms a rule-based baseline by up to 50\% for top-1 accuracy on multiple language pairs. We show that substring-based models are superior to character-based models, and observe that their top-10 accuracy is comparable to the top-1 accuracy of supervised systems.

    Our method only requires a phonemic representation of the words. This is possible for many language-script combinations which have a high grapheme-to-phoneme correspondence \textit{e.g.} scripts of Indian languages derived from the Brahmi script. Hence, Indian languages were the focus of our experiments. For other languages, a grapheme-to-phoneme converter would be required.

    Other authors
  • Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent

    North American Chapter of the Association for Computational Linguistics - Human Language Technologies: System Demonstrations (NAACL)

    Other authors
  • The IIT Bombay SMT System for ICON 2014 Tools Contest

    Proceedings of ICON, 2014

    Other authors
  • Shata-Anuvadak: Tackling Multiway Translation of Indian Languages

    Language and Resources and Evaluation Conference (LREC)

    Other authors
  • Tuning a Grammar Correction System for Increased Precision

    SIGNLL Conference on Computational Natural Language Learning (CoNLL)

    Other authors
  • IITB System for CoNLL 2013 Shared Task: A Hybrid Approach to Grammatical Error Correction

    SIGNLL Conference on Computational Natural Language Learning (CoNLL)

    Other authors
  • TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain

    Association of Computational Linguistics : System Demonstrations

    Other authors
  • A System for Compound Noun Multiword Expression Extraction for Hindi

    6th Intl. Conf. on Natural Language Processing, ICON 2008

    Identifying compound noun multiword
    expressions is important for applications like
    machine translation and information retrieval.
    We describe a system for extracting Hindi
    compound noun multiword expressions
    (MWE) from a given corpus. We identify
    major categories of compound noun MWEs,
    based on linguistic and psycholinguistic
    principles. Our extraction methods use
    various statistical co-occurrence measures to
    exploit the statistical idiosyncrasy of MWEs.
    We make…

    Identifying compound noun multiword
    expressions is important for applications like
    machine translation and information retrieval.
    We describe a system for extracting Hindi
    compound noun multiword expressions
    (MWE) from a given corpus. We identify
    major categories of compound noun MWEs,
    based on linguistic and psycholinguistic
    principles. Our extraction methods use
    various statistical co-occurrence measures to
    exploit the statistical idiosyncrasy of MWEs.
    We make use of various lexical cues from the
    corpus to enhance our methods. We also
    address the extraction of reduplicative
    expressions using lexical, semantic and
    phonetic knowledge. We have also built an
    evaluation resource of compound noun
    MWEs for Hindi. Our methods give a recall
    of 80% and precision of 23% at rank 1000.

    Other authors
    • Prof. Om Damani
    See publication

Projects

  • BrahmiNet

    Brahmi-Net is an online system for transliteration and script conversion for all major Indian language pairs (306 pairs). The system covers 13 Indo-Aryan languages, 4 Dravidian languages and English.
    Languages supported include:

    - Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani, Assamese, Odia, Sindhi, Sinhala, Nepali, Sanskrit
    - Dravidian languages: Tamil, Telugu, Malayalam, Kannada
    - English

    Other creators
    See project
  • Śata-Anuva̅dak: Indic Translator

    - Present

    Śata-Anuva̅dak (100 Translators) is a broad coverage Statisitical Machine Translation system for Indian languages. It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. It currently supports translation between 11 Indian languages:

    - Indo-Aryan languages: Hindi, Urdu, Bengali…

    Śata-Anuva̅dak (100 Translators) is a broad coverage Statisitical Machine Translation system for Indian languages. It is a Phrase-Based MT system with pre-processing and post-processing extensions. The pre-processing includes source-side reordering for English to Indian language translation. The post-processing includes transliteration between Indian languages for OOV words. It currently supports translation between 11 Indian languages:

    - Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi, Marathi, Konkani
    - Dravidian languages: Tamil, Telugu, Malayalam
    - English

    Other creators
    • Abhijit Mishra
    • Rajen Chatterjee
    • Ritesh Shah
    • Prof. Pushpak Bhattacharyya
    See project
  • Indic NLP Library

    - Present

    The goal of this project is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.

    The library provides the following functionalities:
    - Text Normalization
    - Indic Script Conversion
    - Romanization of Indic…

    The goal of this project is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.

    The library provides the following functionalities:
    - Text Normalization
    - Indic Script Conversion
    - Romanization of Indic Scripts (ITRANS) and vice-versa
    - Indian Language Transliteration
    - Tokenization
    - Word Segmentation

    See project

Honors & Awards

  • Outstanding Paper at SCLeM workshop 2017

    Workshop on Subword and Character level models in NLP 2017 (Collocated with EMNLP)

    Our paper titled:
    Learning variable length units for SMT between related languages via Byte Pair Encoding

    co-authored with Prof. Pushpak Bhattacharyya has been awarded an outstanding paper award at the 1st Workshop on Subword and Character level models in NLP 2017, which is collocated with EMNLP 2017. The workshop is on 7th September 2017.

    Here is the paper:
    https://arxiv.org/abs/1610.06510

  • Best Thesis Talk at Research and Innovation Symposium in Computing

    Department of Computer Science and Engineering, IIT Bombay

    This talk was given at the Department of Computer Science and Engineering, IIT Bombay's annual research symposium. The abstract of the talk is given below:

    Related languages are those that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time. Machine Translation between related languages is a major requirement since there is substantial government, commercial and cultural communication among people speaking…

    This talk was given at the Department of Computer Science and Engineering, IIT Bombay's annual research symposium. The abstract of the talk is given below:

    Related languages are those that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time. Machine Translation between related languages is a major requirement since there is substantial government, commercial and cultural communication among people speaking related languages. However, most of these languages have few parallel corpora resources, an important requirement for building good quality statistical machine translation (SMT) systems.

    A key property of related languages is lexical similarity, which means the languages share many words with the similar form (spelling/pronunciation) and meaning. These words could be cognates, lateral borrowings or loan words from other languages. Modelling lexical similarity among related languages is the key to building good-quality SMT systems with limited parallel corpora. We propose the use of two subword units of translation for modelling lexical similarity: (i) orthographic syllables motivated from the design of Indic scripts and (ii) byte pair encoded units inspired from compression theory. We show that the proposed significantly outperforms other units of representation (word, morpheme and character), over multiple language pairs, spanning different language families, with varying degrees of lexical similarity and is robust to domain changes too.

  • Invited Talk at Inter-Research-Institute Student Seminar in Computer Science (ACM India Annual Meet)

    ACM India

    Title of talk: Orthographic Syllable as basic unit for SMT between Related Languages

    Abstract:
    We explore the use of theorthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts. We show that orthographic syllable level translation significantly outperforms models trained over other basic units (word, morpheme and character) when training over small parallel corpora.

  • Invited Tutorial on Statistical Machine Translation between related languages

    North American Chapter of the Association for Computational Linguistics - Human Language Technologies: System Demonstrations

    With Pushpak Bhattacharyya and Mitesh Khapra

    Abstract:

    Language-independent Statistical Machine Translation (SMT) has proven to be very challenging. The diversity of languages makes high accuracy difficult and requires substantial parallel corpus as well as linguistic resources (parsers, morph analyzers, etc.). An interesting observation is that a large chunk of machine translation (MT) requirements involve related languages. They are either : (i) between related languages, or…

    With Pushpak Bhattacharyya and Mitesh Khapra

    Abstract:

    Language-independent Statistical Machine Translation (SMT) has proven to be very challenging. The diversity of languages makes high accuracy difficult and requires substantial parallel corpus as well as linguistic resources (parsers, morph analyzers, etc.). An interesting observation is that a large chunk of machine translation (MT) requirements involve related languages. They are either : (i) between related languages, or (ii) between a lingua franca (like English) and a set of related languages. For instance, India, the European Union and South-East Asia have such translation requirements due to government, business and socio-cultural communication needs.

    Related languages share a lot of linguistic features and the divergences among them are at a lower level of the NLP pipeline. The objective of the tutorial is to discuss how the relatedness among languages can be leveraged to bridge this language divergence thereby achieving some/all of these goals: (i) improving translation quality, (ii) achieving better generalization, (iii) sharing linguistic resources, and (iv) reducing resource requirements.

    We will look at the existing research in SMT from the perspective of related languages, with the goal to build a toolbox of methods that are useful for translation between related languages. This tutorial would be relevant to Machine Translation researchers and developers, especially those interested in translation between low-resource languages which have resource-rich related languages. It will also be relevant for researchers interested in multilingual computation.

Languages

  • English

    Full professional proficiency

  • Hindi

    Native or bilingual proficiency

  • Marathi

    Native or bilingual proficiency

  • Malayalam

    Native or bilingual proficiency

More activity by Anoop

View Anoop’s full profile

  • See who you know in common
  • Get introduced
  • Contact Anoop directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses