The main aim of my research is providing a fundamental understanding of how the principle of compositionality and structural generalization can allow solving different types of reasoning tasks and how they could be used for building data and resource efficient AI systems. For more than a decade, I have been working on formulating different language generation tasks requiring compositional and structural generalization, and developing novel models for enhancing the capability of state-of-the-art models across various inference tasks.
I am grateful that my research is supported by Microsoft Research and Google Deepmind.

Research directions

My work has three main directions that I find promising for enhancing the generalization capability of generative language models. More details of representative papers from each individual direction can be found below. For my full list of publications, you can refer to my Google Scholar.

Modeling

The primary direction of my research is the design and development of novel machine learning algorithms which improve the generalization capability of generative language models. Some of my recent work in this direction includes statistical algorithms for tokenization that could aid more efficient vocabulary construction in language models, as well as novel hybrid character-based open-vocabulary language models for compositional structure learning. I have also designed a novel Bayesian variational inference based decoding algorithm, which achieves compositional generalization into unseen word forms by implementing language generation by from a combinatorial space of discrete features representing morphosyntactic features. In parallel, I have also been working on multi-modal vision and text based language models which can better align semantic information across modalities for efficient foundation model development.

Analysis

Another research direction I find highly promising is the development of novel methodologies for analyzing state-of-the-art language models and investigating factors that help their generalization capability. For this purpose, a simple reasoning task I used as an evaluation setting was morphological generalization, which requires the model to generate a word it has never observed before, intuitively by relying on components learned from observations of the word in similar inflected forms. Recently, I also led studies analyzing how multilingual language models represent parameters across language specific spaces in order to efficiently perform knowledge transfer for understanding unseen languages. More findings in this direction could ideally allow a better understanding of how language models could be designed for optimal parameter sharing for solving different AI tasks requiring compositional generalization.

Evaluation Benchmarks and Measures

Due to increasingly rapid advancements in large language models and the problem of contamination that leads to unfair evaluation of models in commonly used public evaluation benchmarks, it has become crucial to create and maintain new data sets presenting challenging inference tasks for the evaluation of state-of-the-art models from various aspects. Thus, I have been actively working on developing new benchmarks for thorough and comprehensive assessment of generative language models. Another problem that I address in my research is the limited applicability of evaluation metrics for generated language, especially for languages with distinct grammatical properties. My students and I have also been working on analyzing conventionally used evaluation metrics for language generation and developing new metrics and measures more reliable and applicable across languages and settings.

Ataman, D., Negri, M., Turchi, M., & Federico, M. (2017). Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English. The Prague Bulletin of Mathematical Linguistics, 108(1), 331-342.

Modeling

Tokenization and its impact on compositional generalization

In my pioneering study on the effect of tokenization on neural machine translation model ability in generalization to rare and unseen word generation, I have shown significance of the choice of segmentation method and proposed an unsupervised controllable segmentation algorithm for linguistically-aligned vocabulary reduction.

Ataman, D., & Federico, M. (2018). An Evaluation of Two Vocabulary Reduction Methods for Neural Machine Translation. Vol. 1: MT Researchers’ Track, 97.

Ataman, D., & Federico, M. (2018, July). Compositional Representation of Morphologically-Rich Input for Neural Machine Translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 305-311).

Compositional character-level generative models

While explicitly learning sub-atomic units for combinatorial generalization of new generated forms might aid compositional generalization, practical language models tend to have huge vocabularies which makes surface level segmentation non-trivial in order to preserve the original semantic information after tokenization. Therefore I have designed and developed hybrid word-character level encoder and decoder models which could integrate the learning of compositional structure into the distributed representation space and achieve truly open-vocabulary generation. In order to reinforce the model in sharing and reusing the same set of subatomic units in the latent space, I further implemented a hybrid continuous/discrete space variational auto-encoder in the sequence model which models word generation through a combinatorial problem where a set of features are sampled from a multinomial distribution to represent the semantic and syntactic features of each word in a given sentence. By being the first practical compositionally generalizable generative model, the latent morphology model received the Spotlight Award at the International Conference on Learning Representations.

Ataman, D., Aziz, W., & Birch, A. A Latent Morphology Model for Open-Vocabulary Neural Machine Translation. In International Conference on Learning Representations.

Analysis

Our detailed analysis on multi-modal language models implementing different gated attention methods assess the conditions that support joint learning of visual and textual features, and show for optimal multi-modal learning the information should be complementary, such that models would be likely to attend visual information only when the text features are insufficient for solving the downstream task.
Paper

Li, J., Ataman, D., & Sennrich, R. (2021, November). Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8556-8562).

A recent detailed analysis of morphological properties of generated words in Transformer based translation models confirm that models tend to generate sentences reconstructing frequently observed word forms, and generally lack capability of generating complex or long words, as often observed in morphologically-rich synthetic or agglutinative languages.
Paper

Oncevay, A., Ataman, D., Van Berkel, N., Haddow, B., Birch, A., & Bjerva, J. (2022, July). Quantifying Synthesis and Fusion and their Impact on Machine Translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1308-1321).

An extensive study conducted in collaboration with various labs in academic and industry performed a detailed analysis of data sets used in training multilingual language models, and how they contain harmful, noisy or inaccurately labeled data, leading to many common problems such as generation in the wrong target languages, in particular in low-resourced languages, as well as harmful content.
Paper

Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., ... & Adeyemi, M. (2022). Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10, 50-72.

Bassi, S., Ataman, D., & Cho, K. (2024). Generalization Measures for Zero-Shot Cross-Lingual Transfer. arXiv preprint arXiv:2404.15928.

Metrics and Measures

An important topic I am currently working on with my students at New York University is the development of reliable and efficient measures for quantifying the generalization capability of generative language models. A recent study in Bassi et al., we explored generalization measures for zero-shot cross-lingual transfer relying on computation of sharpness in loss minima. In another study of my students Wang & Chen, we performed an extensive evaluation of automatic metrics for reference-based evaluation of generated language in generalization to rephrasing in 73 languages. Our findings showed although all existing metrics fail in being able to detect paraphrasing in the output sentence, the promising metrics for semantic comparison would be pre-trained language model based metrics. We are continuing our further studies on how better metrics for language generation should be developed.

Wang, Y., Chen, Q., & Ataman, D. (2023, November). Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems (pp. 23-31).

Data and resources

  • UniMorph 4.0: Universal Morphology

    Unimorph is large-scale corpus annotation project documenting morphological properties of languages across the world. It was also extended as a shared task for evaluating generative models in morphological reinflection generalization across languages.
    Paper

  • A Large-Scale Study of Machine Translation in the Turkic Languages

    The paper is the first study testing the applicability of neural machine translation systems across low-resourced languages from the Turkic language family. It also presents new parallel data sets for machine translation in 20 Turkic languages.
    Paper

  • Bianet: A Parallel News Corpus in Turkish, Kurdish and English

    Bianet corpus presents the first multi-way translation corpus spanning news translations in Turkish, Kurdish and English. It allows evaluate multiple aspects of knowledge transfer in multilingual translation models, such as geographic proximity and common lexicon.

    Paper

Shared tasks

  • https://researchportal.helsinki.fi/files/172327270/shared_task.pdf

    SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

  • MRL 2022 Shared Task on Multilingual Clause-level Morphology

  • MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval