Malaysian English News Decoded: A Linguistic Resource
for Named Entity and Relation Extraction
Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam
arXiv (arXiv: 2402.14521v1)
Generated on April 27, 2025
, Malaysian English News Decoded: A Linguistic Resource
for Named Entity and Relation Extraction
Abstract
Standard English and Malaysian English exhibit notable differences, posing challenges for natural
language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets
are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian
English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian
English news articles highlights that they cannot handle morphosyntactic variations in Malaysian
English. To the best of our knowledge, there is no annotated dataset available to improvise the model.
To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains
200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy
NER tool and validated that having a dataset tailor-made for Malaysian English could improve the
performance of NER in Malaysian English significantly. This paper presents our effort in the data
acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the
quality of the annotation, inter-annotator agreement was used, followed by adjudication of
disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a
dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup
and analysis on the NER performance. This unique dataset will contribute significantly to the
advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress,
particularly in NER and relation extraction. The dataset and annotation guideline has been published on
Github.
Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
Mohan Raj1, Lay-Ki Soon1, Ong Huey Fang1, and Bhawani Selvaretnam2 1School of Information
Technology, Monash University Malaysia,2Valiantlytix Sdn Bhd 1Jalan Lagoon Selatan, 47500
Selangor, Malaysia, 2Lorong Utara C, Pjs 52, 46200 Petaling Jaya, Selangor 1{mohan.chanthran,
soon.layki, ong.hueyfang}@monash.edu, Abstract Standard English and
Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP)
tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard
English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using
state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles
highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our
knowledge, there is no annotated dataset available to improvise the model. To address these issues,
we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are
manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated
that having a dataset tailor-made for Malaysian English could improve the performance of NER in
Malaysian English significantly. This paper presents our effort in the data acquisition, annotation
methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation,
inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter
expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and
3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER
performance. This unique dataset will contribute significantly to the advancement of NLP research in
Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation
extraction. The dataset and annotation guideline has been published on Github. Keywords: Annotated
Dataset, Malaysian English, Named Entity Recognition, Relation Extraction, Low- Resource Language
1. Introduction 1.1. Overview Relation Extraction (RE) is a natural language pro- cessing (NLP) task
that involves identifying rela- tions between a pair of entities mentioned in a text. This task requires
for Named Entity and Relation Extraction
Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam
arXiv (arXiv: 2402.14521v1)
Generated on April 27, 2025
, Malaysian English News Decoded: A Linguistic Resource
for Named Entity and Relation Extraction
Abstract
Standard English and Malaysian English exhibit notable differences, posing challenges for natural
language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets
are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian
English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian
English news articles highlights that they cannot handle morphosyntactic variations in Malaysian
English. To the best of our knowledge, there is no annotated dataset available to improvise the model.
To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains
200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy
NER tool and validated that having a dataset tailor-made for Malaysian English could improve the
performance of NER in Malaysian English significantly. This paper presents our effort in the data
acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the
quality of the annotation, inter-annotator agreement was used, followed by adjudication of
disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a
dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup
and analysis on the NER performance. This unique dataset will contribute significantly to the
advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress,
particularly in NER and relation extraction. The dataset and annotation guideline has been published on
Github.
Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
Mohan Raj1, Lay-Ki Soon1, Ong Huey Fang1, and Bhawani Selvaretnam2 1School of Information
Technology, Monash University Malaysia,2Valiantlytix Sdn Bhd 1Jalan Lagoon Selatan, 47500
Selangor, Malaysia, 2Lorong Utara C, Pjs 52, 46200 Petaling Jaya, Selangor 1{mohan.chanthran,
soon.layki, ong.hueyfang}@monash.edu, Abstract Standard English and
Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP)
tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard
English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using
state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles
highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our
knowledge, there is no annotated dataset available to improvise the model. To address these issues,
we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are
manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated
that having a dataset tailor-made for Malaysian English could improve the performance of NER in
Malaysian English significantly. This paper presents our effort in the data acquisition, annotation
methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation,
inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter
expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and
3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER
performance. This unique dataset will contribute significantly to the advancement of NLP research in
Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation
extraction. The dataset and annotation guideline has been published on Github. Keywords: Annotated
Dataset, Malaysian English, Named Entity Recognition, Relation Extraction, Low- Resource Language
1. Introduction 1.1. Overview Relation Extraction (RE) is a natural language pro- cessing (NLP) task
that involves identifying rela- tions between a pair of entities mentioned in a text. This task requires