Kata.ai
Publication year: 2021

IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism

Written by:
Haryo Akbarianto Wibowo, Made Nindyatama Nityasya, Afra Feyza Akyürek, Suci Fitriany, Alham Fikri Aji, Radityo Eko Prasojo, Derry Tanti Wijaya

Abstract

Indonesian language is heavily riddled with colloquialism whether in written or spoken forms. In this paper, we identify a class of Indonesian colloquial words that have undergone morphological transformations from their standard forms, categorize their word formations, and propose a benchmark dataset of Indonesian Colloquial Lexicons (IndoCollex) consisting of informal words on Twitter expertly annotated with their standard forms and their word formation types/tags. We evaluate several models for character-level transduction to perform morphological word normalization on this testbed to understand their failure cases and provide baselines for future work. As IndoCollex catalogues word formation phenomena that are also present in the non-standard text of other languages, it can also provide an attractive testbed for methods tailored for cross-lingual word normalization and non-standard word formation.

  • Share
Download Full Paper

Other case Paper

BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter
Publication year: 2021

BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter

IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism
Publication year: 2021

IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism

Benchmarking Multidomain English-Indonesian Machine Translation
Publication year: 2020

Benchmarking Multidomain English-Indonesian Machine Translation

Ready to build your conversational AI?

Get started
CTA