Resume

Summary english essay

Note

Vendu

Pages

Publié le

27-04-2025

Écrit en

2024/2025

The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally.

Montrer plus Lire moins

Établissement

Freshman / 9th Grade

Cours

English language and composition

Aperçu du contenu

Do "English" Named Entity Recognizers Work Well on
Global Englishes?

Alexander Shan, John Bauer, Riley Carlson, Christopher Manning

arXiv (arXiv: 2404.13465v1)

Generated on April 27, 2025

, Do "English" Named Entity Recognizers Work Well on
Global Englishes?

Abstract
The vast majority of the popular English named entity recognition (NER) datasets contain American or
British English data, despite the existence of many global varieties of English. As such, it is unclear
whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset,
the Worldwide English NER Dataset, to analyze NER model performance on low-resource English
variants from around the world. We test widely used NER toolkits and transformer models, including
models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a
commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset
OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets
experienced significant performance drops-over 10 F1 in some cases-when tested on the Worldwide
English dataset. Upon examination of region-specific errors, we observe the greatest performance
drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance.
Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or
OntoNotes lost only 1-2 F1 on both test sets.

Do “English” Named Entity Recognizers work well on Global Englishes? Alexander Shan ,John Bauer
,Riley Carlson andChristopher D. Manning Department of Computer Science Stanford University
Stanford, CA 94305-9030, U.S.A. {azshan, horatio, rileydc, manning}@stanford.edu Abstract The vast
majority of the popular English named entity recognition (NER) datasets con- tain American or British
English data, despite the existence of many global varieties of En- glish. As such, it is unclear whether
they gen- eralize for analyzing use of English globally. To test this, we build a newswire dataset, the
Worldwide English NER Dataset, to analyze NER model performance on “low-resource” English
variants from around the world. We test widely used NER toolkits and transformer models, including
RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset,
CoNLL 2003, a more American-focused dataset, OntoNotes, and our global dataset. All models trained
on the CoNLL or OntoNotes datasets experienced significant performance drops—over 10% F1 in
some cases—when tested on the Worldwide English dataset. Upon examination of region- specific
errors, we observe the greatest perfor- mance drops for Oceania and Africa, while Asia and the Middle
East had comparatively strong performance. Lastly, we find that a com- bined model trained on the
Worldwide dataset and either CoNLL or OntoNotes lost only 1–2% F1 on both test sets. 1 Introduction
Most of English Named Entity Recognition (NER) uses American or British English data, with less at-
tention paid to low-resource English contexts. Mul- tiple problems may occur in low-resource NER
settings; for example, named entities with region- specific meanings can be confused for common
words. Indeed, the Japanese Diet is a governmental body, but NER models focused on US and British
English may incorrectly interpret this entity as a medical term. Among many NER datasets released in
recent years,1the most widely used datasets are CoNLL 1A collection of NER references is available at
https: //github.com/juand-r/entity-recognition-datasets2003 (Tjong Kim Sang and De Meulder, 2003)
and OntoNotes (Weischedel et al., 2013), which focus on British and American English, with significant
European Parliament coverage. Other recently cre- ated NER datasets study the medical domain, such
as the n2c2 challenges (Henry et al., 2019), histor- ical English (Ehrmann et al., 2022), or music rec-
ommendation terminology (Epure and Hennequin, 2023), still using American and British English. The
lack of regional variety in these datasets sug- gests that models trained on these datasets might not
accurately recognize entities from more global contexts. Furthermore, the lack of test data for other
regions makes it difficult to even measure this phenomenon. In this work, we evaluate the performance
of a variety of NER tools, including Flair and SpaCy on this dataset. We then retrain two commonly

Signaler une violation de copyright

École, étude et sujet

Établissement: Freshman / 9th grade
Cours: English language and composition
Année scolaire: 1

Infos sur le Document

Publié le: 27 avril 2025
Nombre de pages: 12
Écrit en: 2024/2025
Type: RESUME

Sujets

englis
english
essay
work

€6,30

Accéder à l'intégralité du document:

Garantie de satisfaction à 100%

Disponible immédiatement après paiement

En ligne et en PDF

Tu n'es attaché à rien

Faites connaissance avec le vendeur

cleoellis

Faites connaissance avec le vendeur

cleoellis University of the People

Voir profil

Vendu

Membre depuis

10 mois

Nombre de followers

Documents

Dernière vente

Essay, Notes, Test, Quizzes

0,0

0 revues

Documents populaires

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur cleoellis. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €6,30. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis) 50907 résumés ont été vendus ces 30 derniers jours Fondée en 2010, la référence pour acheter des résumés depuis déjà 16 ans

Summary english essay

Aperçu du contenu

École, étude et sujet

Infos sur le Document

Sujets

Plus de cours sur Freshman / 9th grade >

Faites connaissance avec le vendeur

Documents populaires

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Le document ne convient pas ? Choisis un autre document

Paye comme tu veux, apprends aussitôt

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Garantie de remboursement : comment ça marche ?

Auprès de qui est-ce que j'achète ce résumé ?

Est-ce que j'aurai un abonnement?

Peut-on faire confiance à Stuvia ?