Non-native English lexicon creation for bilingual speech
synthesis
Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga, Sharath
Adavanne
arXiv (arXiv: 2106.10870v1)
Generated on April 27, 2025
, Non-native English lexicon creation for bilingual speech
synthesis
Abstract
Bilingual English speakers speak English as one of their languages. Their English is of a non-native
kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech
(TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme
sequence used by non-native speakers. However, due to the lack of non-native English lexicon,
existing bilingual TTS systems employ native English lexicons that are widely available, in addition to
their native language lexicon. Due to the inconsistency between the non-native English pronunciation in
the audio and native English lexicon in the text, the intelligibility of synthesized speech in such TTS
systems is significantly reduced. This paper is motivated by the knowledge that the native language of
the speaker highly influences non-native English pronunciation. We propose a generic approach to
obtain rules based on letter to phoneme alignment to map native English lexicon to their non-native
version. The effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi)
TTS systems trained with and without the proposed rules. The subjective evaluation shows that the
bilingual TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute
improvement in preference.
Non-native English lexicon creation for bilingual speech synthesis Arun Baby, Pranav Jawale, Saranya
Vinnaitherthan, Sumukh Badam, Nagaraj Adiga, Sharath Adavanne Zapr Media Labs (Red Brick Lane
Marketing Solutions Pvt. Ltd.), India Abstract Bilingual English speakers speak
English as one of their languages. Their English is of a non-native kind, and their con- versations are of
a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native
En- glish speakers depends on a lexicon that captures the phoneme sequence used by non-native
speakers. However, due to the lack of non-native English lexicon, existing bilingual TTS systems
employ native English lexicons that are widely available, in ad- dition to their native language lexicon.
Due to the inconsistency between the non-native English pronunciation in the audio and native English
lexicon in the text, the intelligibility of synthe- sized speech in such TTS systems is signi■cantly
reduced. This paper is motivated by the knowledge that the native language of the speaker highly
in■uences non-native English pronunciation. We propose a generic approach to obtain rules based on
letter to phoneme alignment to map native English lexicon to their non-native version. The
effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi) TTS
systems trained with and without the proposed rules. The subjective evaluation shows that the bilingual
TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute
improvement in preference. Index Terms : Bilingual speech synthesis, non-native English, L2 English,
lexicon creation, Common phones 1. Introduction Developing a bilingual text-to-speech (TTS) system
[1] is nec- essary for countries like India where the majority of the pop- ulation speak more than one
language. Generally, this popu- lation speaks their native language as the ■rst and English as their
second language. The pronunciation of English words by a non-native speaker is strongly in■uenced
by their native lan- guage and is most often different from the native English pro- nunciation [2]. Indian
languages, which have a high grapheme to phoneme correlation (phonemic language), derive
pronunci- ation directly from the spellings of the word. On the contrary, English is an alphabetic and
highly non-phonemic language. Hence native phonemic language speakers whose pronuncia- tion is
in■uenced by the spelling of the word often pronounce English words differently from native English
speakers. This mispronunciation is further enhanced for native speakers from languages whose
phonemes are different from the English lan- guage. These speakers generally replace the English
phoneme with the closest phoneme in their native language. Given these challenges, building a TTS
synthesis
Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga, Sharath
Adavanne
arXiv (arXiv: 2106.10870v1)
Generated on April 27, 2025
, Non-native English lexicon creation for bilingual speech
synthesis
Abstract
Bilingual English speakers speak English as one of their languages. Their English is of a non-native
kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech
(TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme
sequence used by non-native speakers. However, due to the lack of non-native English lexicon,
existing bilingual TTS systems employ native English lexicons that are widely available, in addition to
their native language lexicon. Due to the inconsistency between the non-native English pronunciation in
the audio and native English lexicon in the text, the intelligibility of synthesized speech in such TTS
systems is significantly reduced. This paper is motivated by the knowledge that the native language of
the speaker highly influences non-native English pronunciation. We propose a generic approach to
obtain rules based on letter to phoneme alignment to map native English lexicon to their non-native
version. The effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi)
TTS systems trained with and without the proposed rules. The subjective evaluation shows that the
bilingual TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute
improvement in preference.
Non-native English lexicon creation for bilingual speech synthesis Arun Baby, Pranav Jawale, Saranya
Vinnaitherthan, Sumukh Badam, Nagaraj Adiga, Sharath Adavanne Zapr Media Labs (Red Brick Lane
Marketing Solutions Pvt. Ltd.), India Abstract Bilingual English speakers speak
English as one of their languages. Their English is of a non-native kind, and their con- versations are of
a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native
En- glish speakers depends on a lexicon that captures the phoneme sequence used by non-native
speakers. However, due to the lack of non-native English lexicon, existing bilingual TTS systems
employ native English lexicons that are widely available, in ad- dition to their native language lexicon.
Due to the inconsistency between the non-native English pronunciation in the audio and native English
lexicon in the text, the intelligibility of synthe- sized speech in such TTS systems is signi■cantly
reduced. This paper is motivated by the knowledge that the native language of the speaker highly
in■uences non-native English pronunciation. We propose a generic approach to obtain rules based on
letter to phoneme alignment to map native English lexicon to their non-native version. The
effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi) TTS
systems trained with and without the proposed rules. The subjective evaluation shows that the bilingual
TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute
improvement in preference. Index Terms : Bilingual speech synthesis, non-native English, L2 English,
lexicon creation, Common phones 1. Introduction Developing a bilingual text-to-speech (TTS) system
[1] is nec- essary for countries like India where the majority of the pop- ulation speak more than one
language. Generally, this popu- lation speaks their native language as the ■rst and English as their
second language. The pronunciation of English words by a non-native speaker is strongly in■uenced
by their native lan- guage and is most often different from the native English pro- nunciation [2]. Indian
languages, which have a high grapheme to phoneme correlation (phonemic language), derive
pronunci- ation directly from the spellings of the word. On the contrary, English is an alphabetic and
highly non-phonemic language. Hence native phonemic language speakers whose pronuncia- tion is
in■uenced by the spelling of the word often pronounce English words differently from native English
speakers. This mispronunciation is further enhanced for native speakers from languages whose
phonemes are different from the English lan- guage. These speakers generally replace the English
phoneme with the closest phoneme in their native language. Given these challenges, building a TTS