Do "English" Named Entity Recognizers Work Well on
Global Englishes?
Alexander Shan, John Bauer, Riley Carlson, Christopher Manning
arXiv (arXiv: 2404.13465v1)
Generated on April 27, 2025
, Do "English" Named Entity Recognizers Work Well on
Global Englishes?
Abstract
The vast majority of the popular English named entity recognition (NER) datasets contain American or
British English data, despite the existence of many global varieties of English. As such, it is unclear
whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset,
the Worldwide English NER Dataset, to analyze NER model performance on low-resource English
variants from around the world. We test widely used NER toolkits and transformer models, including
models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a
commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset
OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets
experienced significant performance drops-over 10 F1 in some cases-when tested on the Worldwide
English dataset. Upon examination of region-specific errors, we observe the greatest performance
drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance.
Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or
OntoNotes lost only 1-2 F1 on both test sets.
Do “English” Named Entity Recognizers work well on Global Englishes? Alexander Shan ,John Bauer
,Riley Carlson andChristopher D. Manning Department of Computer Science Stanford University
Stanford, CA 94305-9030, U.S.A. {azshan, horatio, rileydc, manning}@stanford.edu Abstract The vast
majority of the popular English named entity recognition (NER) datasets con- tain American or British
English data, despite the existence of many global varieties of En- glish. As such, it is unclear whether
they gen- eralize for analyzing use of English globally. To test this, we build a newswire dataset, the
Worldwide English NER Dataset, to analyze NER model performance on “low-resource” English
variants from around the world. We test widely used NER toolkits and transformer models, including
RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset,
CoNLL 2003, a more American-focused dataset, OntoNotes, and our global dataset. All models trained
on the CoNLL or OntoNotes datasets experienced significant performance drops—over 10% F1 in
some cases—when tested on the Worldwide English dataset. Upon examination of region- specific
errors, we observe the greatest perfor- mance drops for Oceania and Africa, while Asia and the Middle
East had comparatively strong performance. Lastly, we find that a com- bined model trained on the
Worldwide dataset and either CoNLL or OntoNotes lost only 1–2% F1 on both test sets. 1 Introduction
Most of English Named Entity Recognition (NER) uses American or British English data, with less at-
tention paid to low-resource English contexts. Mul- tiple problems may occur in low-resource NER
settings; for example, named entities with region- specific meanings can be confused for common
words. Indeed, the Japanese Diet is a governmental body, but NER models focused on US and British
English may incorrectly interpret this entity as a medical term. Among many NER datasets released in
recent years,1the most widely used datasets are CoNLL 1A collection of NER references is available at
https: //github.com/juand-r/entity-recognition-datasets2003 (Tjong Kim Sang and De Meulder, 2003)
and OntoNotes (Weischedel et al., 2013), which focus on British and American English, with significant
European Parliament coverage. Other recently cre- ated NER datasets study the medical domain, such
as the n2c2 challenges (Henry et al., 2019), histor- ical English (Ehrmann et al., 2022), or music rec-
ommendation terminology (Epure and Hennequin, 2023), still using American and British English. The
lack of regional variety in these datasets sug- gests that models trained on these datasets might not
accurately recognize entities from more global contexts. Furthermore, the lack of test data for other
regions makes it difficult to even measure this phenomenon. In this work, we evaluate the performance
of a variety of NER tools, including Flair and SpaCy on this dataset. We then retrain two commonly
Global Englishes?
Alexander Shan, John Bauer, Riley Carlson, Christopher Manning
arXiv (arXiv: 2404.13465v1)
Generated on April 27, 2025
, Do "English" Named Entity Recognizers Work Well on
Global Englishes?
Abstract
The vast majority of the popular English named entity recognition (NER) datasets contain American or
British English data, despite the existence of many global varieties of English. As such, it is unclear
whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset,
the Worldwide English NER Dataset, to analyze NER model performance on low-resource English
variants from around the world. We test widely used NER toolkits and transformer models, including
models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a
commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset
OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets
experienced significant performance drops-over 10 F1 in some cases-when tested on the Worldwide
English dataset. Upon examination of region-specific errors, we observe the greatest performance
drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance.
Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or
OntoNotes lost only 1-2 F1 on both test sets.
Do “English” Named Entity Recognizers work well on Global Englishes? Alexander Shan ,John Bauer
,Riley Carlson andChristopher D. Manning Department of Computer Science Stanford University
Stanford, CA 94305-9030, U.S.A. {azshan, horatio, rileydc, manning}@stanford.edu Abstract The vast
majority of the popular English named entity recognition (NER) datasets con- tain American or British
English data, despite the existence of many global varieties of En- glish. As such, it is unclear whether
they gen- eralize for analyzing use of English globally. To test this, we build a newswire dataset, the
Worldwide English NER Dataset, to analyze NER model performance on “low-resource” English
variants from around the world. We test widely used NER toolkits and transformer models, including
RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset,
CoNLL 2003, a more American-focused dataset, OntoNotes, and our global dataset. All models trained
on the CoNLL or OntoNotes datasets experienced significant performance drops—over 10% F1 in
some cases—when tested on the Worldwide English dataset. Upon examination of region- specific
errors, we observe the greatest perfor- mance drops for Oceania and Africa, while Asia and the Middle
East had comparatively strong performance. Lastly, we find that a com- bined model trained on the
Worldwide dataset and either CoNLL or OntoNotes lost only 1–2% F1 on both test sets. 1 Introduction
Most of English Named Entity Recognition (NER) uses American or British English data, with less at-
tention paid to low-resource English contexts. Mul- tiple problems may occur in low-resource NER
settings; for example, named entities with region- specific meanings can be confused for common
words. Indeed, the Japanese Diet is a governmental body, but NER models focused on US and British
English may incorrectly interpret this entity as a medical term. Among many NER datasets released in
recent years,1the most widely used datasets are CoNLL 1A collection of NER references is available at
https: //github.com/juand-r/entity-recognition-datasets2003 (Tjong Kim Sang and De Meulder, 2003)
and OntoNotes (Weischedel et al., 2013), which focus on British and American English, with significant
European Parliament coverage. Other recently cre- ated NER datasets study the medical domain, such
as the n2c2 challenges (Henry et al., 2019), histor- ical English (Ehrmann et al., 2022), or music rec-
ommendation terminology (Epure and Hennequin, 2023), still using American and British English. The
lack of regional variety in these datasets sug- gests that models trained on these datasets might not
accurately recognize entities from more global contexts. Furthermore, the lack of test data for other
regions makes it difficult to even measure this phenomenon. In this work, we evaluate the performance
of a variety of NER tools, including Flair and SpaCy on this dataset. We then retrain two commonly