Resumen

Summary english

Puntuación

Vendido

Páginas

Subido en

27-04-2025

Escrito en

2024/2025

The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally.

Mostrar más Leer menos

Institución

Freshman / 9th Grade

Grado

English language and composition

Vista previa del contenido

Malaysian English News Decoded: A Linguistic Resource
for Named Entity and Relation Extraction

Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

arXiv (arXiv: 2402.14521v1)

Generated on April 27, 2025

, Malaysian English News Decoded: A Linguistic Resource
for Named Entity and Relation Extraction

Abstract
Standard English and Malaysian English exhibit notable differences, posing challenges for natural
language processing (NLP) tasks on Malaysian English. Unfortunately, most of the existing datasets
are mainly based on standard English and therefore inadequate for improving NLP tasks in Malaysian
English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions on Malaysian
English news articles highlights that they cannot handle morphosyntactic variations in Malaysian
English. To the best of our knowledge, there is no annotated dataset available to improvise the model.
To address these issues, we constructed a Malaysian English News (MEN) dataset, which contains
200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy
NER tool and validated that having a dataset tailor-made for Malaysian English could improve the
performance of NER in Malaysian English significantly. This paper presents our effort in the data
acquisition, annotation methodology, and thorough analysis of the annotated dataset. To validate the
quality of the annotation, inter-annotator agreement was used, followed by adjudication of
disagreements by a subject matter expert. Upon completion of these tasks, we managed to develop a
dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup
and analysis on the NER performance. This unique dataset will contribute significantly to the
advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress,
particularly in NER and relation extraction. The dataset and annotation guideline has been published on
Github.

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
Mohan Raj1, Lay-Ki Soon1, Ong Huey Fang1, and Bhawani Selvaretnam2 1School of Information
Technology, Monash University Malaysia,2Valiantlytix Sdn Bhd 1Jalan Lagoon Selatan, 47500
Selangor, Malaysia, 2Lorong Utara C, Pjs 52, 46200 Petaling Jaya, Selangor 1{mohan.chanthran,
soon.layki, ong.hueyfang}@monash.edu, Abstract Standard English and
Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP)
tasks on Malaysian English. Unfortunately, most of the existing datasets are mainly based on standard
English and therefore inadequate for improving NLP tasks in Malaysian English. An experiment using
state-of-the-art Named Entity Recognition (NER) solutions on Malaysian English news articles
highlights that they cannot handle morphosyntactic variations in Malaysian English. To the best of our
knowledge, there is no annotated dataset available to improvise the model. To address these issues,
we constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are
manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated
that having a dataset tailor-made for Malaysian English could improve the performance of NER in
Malaysian English significantly. This paper presents our effort in the data acquisition, annotation
methodology, and thorough analysis of the annotated dataset. To validate the quality of the annotation,
inter-annotator agreement was used, followed by adjudication of disagreements by a subject matter
expert. Upon completion of these tasks, we managed to develop a dataset with 6,061 entities and
3,268 relation instances. Finally, we discuss on spaCy fine-tuning setup and analysis on the NER
performance. This unique dataset will contribute significantly to the advancement of NLP research in
Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation
extraction. The dataset and annotation guideline has been published on Github. Keywords: Annotated
Dataset, Malaysian English, Named Entity Recognition, Relation Extraction, Low- Resource Language
1. Introduction 1.1. Overview Relation Extraction (RE) is a natural language pro- cessing (NLP) task
that involves identifying rela- tions between a pair of entities mentioned in a text. This task requires

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Freshman / 9th grade
Grado: English language and composition
Año escolar: 1

Información del documento

Subido en: 27 de abril de 2025
Número de páginas: 11
Escrito en: 2024/2025
Tipo: RESUMEN

Temas

english
malaysian
news

$6.99

Accede al documento completo:

Escrito por estudiantes que aprobaron

Inmediatamente disponible después del pago

Leer en línea o como PDF

Conoce al vendedor

cleoellis

Conoce al vendedor

cleoellis University of the People

Ver perfil

Seguir

Vendido

Miembro desde

10 meses

Número de seguidores

Documentos

Última venta

Essay, Notes, Test, Quizzes

0.0

0 reseñas

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller cleoellis. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $6.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now