Kata.ai
Publication year: 2020

Benchmarking Multidomain English-Indonesian Machine Translation

Written by:
Tri Wahyu Guntara, Alham Fikri Aji, Radityo Eko Prasojo

Abstract

In the context of Machine Translation (MT) from-and-to English, Bahasa Indonesia has been considered a low-resource language, and therefore applying Neural Machine Translation (NMT) which typically requires large training dataset proves to be problematic. In this paper, we show otherwise by collecting large, publicly-available datasets from the Web, which we split into several domains: news, religion, general, and conversation, to train and benchmark some variants of transformer-based NMT models across the domains. We show using BLEU that our models perform well across them , outperform the baseline Statistical Machine Translation (SMT) models, and perform comparably with Google Translate. Our datasets (with the standard split for training, validation, and testing), code, and models are available on https://github.com/gunnxx/indonesian-mt-data.

  • Share
Download Full Paper

Other case Paper

BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter
Publication year: 2021

BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter

IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism
Publication year: 2021

IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism

Benchmarking Multidomain English-Indonesian Machine Translation
Publication year: 2020

Benchmarking Multidomain English-Indonesian Machine Translation

Ready to build your conversational AI?

Get started
CTA