PidginUNMT: Unsupervised Neural Machine Translation
from West African Pidgin to English
Kelechi Ogueji, Orevaoghene Ahia
arXiv (arXiv: 1912.03444v1)
Generated on April 27, 2025
, PidginUNMT: Unsupervised Neural Machine Translation
from West African Pidgin to English
Abstract
Over 800 languages are spoken across West Africa. Despite the obvious diversity among people who
speak these languages, one language significantly unifies them all - West African Pidgin English. There
are at least 80 million speakers of West African Pidgin English. However, there is no known natural
language processing (NLP) work on this language. In this work, we perform the first NLP work on the
most popular variant of the language, providing three major contributions. First, the provision of a
Pidgin corpus of over 56000 sentences, which is the largest we know of. Secondly, the training of the
first ever cross-lingual embedding between Pidgin and English. This aligned embedding will be helpful
in the performance of various downstream tasks between English and Pidgin. Thirdly, the training of an
Unsupervised Neural Machine Translation model between Pidgin and English which achieves BLEU
scores of 7.93 from Pidgin to English, and 5.18 from English to Pidgin. In all, this work greatly reduces
the barrier of entry for future NLP works on West African Pidgin English.
arXiv:1912.03444v1 [cs.CL] 7 Dec 2019PidginUNMT: Unsupervised Neural Machine Translation from
West African Pidgin to English Kelechi Ogueji InstaDeep Ahia
InstaDeep Abstract Over 800 languages are spoken across West Africa.
Despite th e obvious diversity among people who speak these languages, one language signi■ cantly
uni■es them all - West African Pidgin English. There are at least 80 milli on speakers of West African
Pidgin English. However, there is no known natural l anguage processing (NLP) work on this language.
In this work, we perform the ■rst NLP work on the most popular variant of the language, providing three
major contributions. First, the provision of a Pidgin corpus of over 56000 sentences, whi ch is the
largest we know of. Secondly, the training of the ■rst ever cross-lingu al embedding between Pidgin
and English. This aligned embedding will be helpful i n the performance of various downstream tasks
between English and Pidgin. Thi rdly, the training of an Unsupervised Neural Machine Translation
model between P idgin and English which achieves BLEU scores of 7.93 from Pidgin to English, an d
5.18 from En- glish to Pidgin. In all, this work greatly reduces the barrie r of entry for future NLP works
on West African Pidgin English. 1 Introduction A lot of natural language processing (NLP) work has
been done on the major languages in the world. However, little to no work has been done on the over
180 0 African languages. The little work that has been done is on the major languages like Afrikaa ns,
Zulu, Yoruba, Igbo, Hausa and Swahili. Pidgin English is one of the the most widely spoken l anguages
in West Africa with 75 million speakers estimated in Nigeria as at 2016, and over 5 m illion speakers
estimated in Ghana. The language originated from the Atlantic slave trade in the late 17th and 18th
Centuries, where it was used by British slave merchants to communicate with the l ocal African traders.
It then spread across other West African regions because of its use as a trad e language among
regions who spoke different languages [1]. Even though different countries h ave different variants of
Pidgin English, the language is fairly uniform across the continent. The var iant of West African Pidgin
English used in this work is the Nigerian Pidgin English (hereafter refer red to as Pidgin), which has the
highest population of speakers. This research work is the ■rst - that we know of - that tackles a West
African Pidgin English NLP problem. In summary, this paper makes the following main contributio ns:
•We provide the ■rst Pidgin corpus containing 56,695 sentenc es and 32,925 unique words. •We train
cross-lingual word vectors between Pidgin and Engl ish, achieving a translation retrieval accuracy of of
0.1282 compared to the random basel ine of 0.009. •We train the ■rst ever machine translation model
between pid gin and English - an unsuper- vised neural machine translation model - achieving a BLEU
sc ore of 7.93 from Pidgin to English and 5.18 from English to Pidgin Preprint. Accepted to NeurIPS
2019 Workshop on Machine Lear ning for the Developing World (ML4D).
from West African Pidgin to English
Kelechi Ogueji, Orevaoghene Ahia
arXiv (arXiv: 1912.03444v1)
Generated on April 27, 2025
, PidginUNMT: Unsupervised Neural Machine Translation
from West African Pidgin to English
Abstract
Over 800 languages are spoken across West Africa. Despite the obvious diversity among people who
speak these languages, one language significantly unifies them all - West African Pidgin English. There
are at least 80 million speakers of West African Pidgin English. However, there is no known natural
language processing (NLP) work on this language. In this work, we perform the first NLP work on the
most popular variant of the language, providing three major contributions. First, the provision of a
Pidgin corpus of over 56000 sentences, which is the largest we know of. Secondly, the training of the
first ever cross-lingual embedding between Pidgin and English. This aligned embedding will be helpful
in the performance of various downstream tasks between English and Pidgin. Thirdly, the training of an
Unsupervised Neural Machine Translation model between Pidgin and English which achieves BLEU
scores of 7.93 from Pidgin to English, and 5.18 from English to Pidgin. In all, this work greatly reduces
the barrier of entry for future NLP works on West African Pidgin English.
arXiv:1912.03444v1 [cs.CL] 7 Dec 2019PidginUNMT: Unsupervised Neural Machine Translation from
West African Pidgin to English Kelechi Ogueji InstaDeep Ahia
InstaDeep Abstract Over 800 languages are spoken across West Africa.
Despite th e obvious diversity among people who speak these languages, one language signi■ cantly
uni■es them all - West African Pidgin English. There are at least 80 milli on speakers of West African
Pidgin English. However, there is no known natural l anguage processing (NLP) work on this language.
In this work, we perform the ■rst NLP work on the most popular variant of the language, providing three
major contributions. First, the provision of a Pidgin corpus of over 56000 sentences, whi ch is the
largest we know of. Secondly, the training of the ■rst ever cross-lingu al embedding between Pidgin
and English. This aligned embedding will be helpful i n the performance of various downstream tasks
between English and Pidgin. Thi rdly, the training of an Unsupervised Neural Machine Translation
model between P idgin and English which achieves BLEU scores of 7.93 from Pidgin to English, an d
5.18 from En- glish to Pidgin. In all, this work greatly reduces the barrie r of entry for future NLP works
on West African Pidgin English. 1 Introduction A lot of natural language processing (NLP) work has
been done on the major languages in the world. However, little to no work has been done on the over
180 0 African languages. The little work that has been done is on the major languages like Afrikaa ns,
Zulu, Yoruba, Igbo, Hausa and Swahili. Pidgin English is one of the the most widely spoken l anguages
in West Africa with 75 million speakers estimated in Nigeria as at 2016, and over 5 m illion speakers
estimated in Ghana. The language originated from the Atlantic slave trade in the late 17th and 18th
Centuries, where it was used by British slave merchants to communicate with the l ocal African traders.
It then spread across other West African regions because of its use as a trad e language among
regions who spoke different languages [1]. Even though different countries h ave different variants of
Pidgin English, the language is fairly uniform across the continent. The var iant of West African Pidgin
English used in this work is the Nigerian Pidgin English (hereafter refer red to as Pidgin), which has the
highest population of speakers. This research work is the ■rst - that we know of - that tackles a West
African Pidgin English NLP problem. In summary, this paper makes the following main contributio ns:
•We provide the ■rst Pidgin corpus containing 56,695 sentenc es and 32,925 unique words. •We train
cross-lingual word vectors between Pidgin and Engl ish, achieving a translation retrieval accuracy of of
0.1282 compared to the random basel ine of 0.009. •We train the ■rst ever machine translation model
between pid gin and English - an unsuper- vised neural machine translation model - achieving a BLEU
sc ore of 7.93 from Pidgin to English and 5.18 from English to Pidgin Preprint. Accepted to NeurIPS
2019 Workshop on Machine Lear ning for the Developing World (ML4D).