Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling
for Accentedness and Intelligibility Assessment

Mu Yang1, Kevin Hirschi2, Stephen D. Looney3, Okim Kang2, John H. L. Hansen1

1Center for Robust Speech Systems (CRSS), University of Texas at Dallas, USA

2Northern Arizona University, USA

3Pennsylvania State University, USA

Mispronunciation Detection and Diagnosis Audio Samples

Audio samples for our paper: Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment. Code is available here.

We present the mispronunciation detection and diagnosis (MDD) model outputs on L2-ARCTIC [1] test set and Indian-accent speech in UTD-4Accents (as our open test).

UTD-4Accents:

For UTD-4Accents, since human-labeled phonemes are not available, we showcase the text transcriptions, canonical phonemes (generated by a G2P model), and model predictions for 6 Indian-accent speakers. We also include the model PERs, as well as intelligibility scores and accentedness scores from human raters. Accentedness scores are in scale of 1-9 where 1 means heavy accent; Intelligibility scores are in scale of 0-100 where 0 means not intelligible at all. These 6 speakers reside in 3 different clusters in Fig. 2 (b) of our paper, with each cluster having 2 speakers.

Low intelligibility, heavy accent (cluster (iii))

speaker = IN000_M00, PER = 31.98, Intelligibility Score = 83.23, Accentedness Score = 3.41

Transcription why do people talk during a movie
Canonical w ay d uw p iy p ah l t ao k d uh r ih ng ah m uw v iy
MDD Prediction w ay d uw p iy p ah l d aa d uw r ih ng ah m uw iy

Transcription is my garage door open
Canonical ih z m ay g er aa zh d ao r ow p ah n
MDD Prediction ih z m ay g ae l ah zh d uh r ow b ah n

speaker = IN037_M030, PER = 28.34, Intelligibility Score = 86.53, Accentedness Score = 3.00

Transcription is my garage door open
Canonical ih z m ay g er aa zh d ao r ow p ah n
MDD Prediction iy s m iy g ae r ah zh d ao r ow p ah n

Transcription how's the new job working out it's been so long since we met
Canonical hh aw z dh ah n uw jh aa b w er k ih ng aw t ih t s b ih n s ow l ao ng s ih n s w iy m eh t
MDD Prediction aa w ah z dh ah n y uw jh ah b w er k ih ng aw t ih t z b ih n s ow l ao ng s ih n s v w iy m eh t

High intelligibility, middle-level accent (cluster (iv))

speaker = IN064_F13, PER = 16.29, Intelligibility Score = 95.37, Accentedness Score = 5.49

Transcription search for vegetarian pasta recipes
Canonical s er ch f ao r v eh jh ah t eh r iy ah n p aa s t ah r eh s ah p iy z
MDD Prediction s ah ch ah v eh jh ah t ae eh r iy ah n p aa s t ah r ih s ih p iy z

Transcription where is the closest shopping mall
Canonical w eh r ih z dh ah k l ow s ah s t sh aa p ih ng m ao l
MDD Prediction w eh r ih z dh ah k l ow z ah s t sh aa p ih ng m ao l

speaker = IN094_F39, PER = 15.59, Intelligibility Score = 99.47, Accentedness Score = 5.29

Transcription is my garage door open
Canonical ih z m ay g er aa zh d ao r ow p ah n
MDD Prediction ih z m ay g ah r aa jh d ao r ow p ah n

Transcription hi joe there is no band practice today mr. collins will let us know the new time
Canonical hh ay jh ow dh eh r ih z n ow b ae n d p r ae k t ah s t ah d ey m ih s t er k aa l ih n z w ih l l eh t ah s n ow dh ah n uw t ay m
MDD Prediction hh ay jh ow dh eh r ih z n ow b ae n d p r ae k t ih s t ah d ey m ih s t ah k ao l ih n z w ih l eh t ah s n ow d ah n y uw d ay m

High intelligibility, light accent (cluster (v))

speaker = IN028_F028, PER = 15.45, Intelligibility Score = 98.40, Accentedness Score = 8.41

Transcription is my garage door open
Canonical ih z m ay g er aa zh d ao r ow p ah n
MDD Prediction ih s m ay g r aa zh d ao r ow p ah n

Transcription where is the hottest place on earth
Canonical w eh r ih z dh ah hh aa t ah s t p l ey s aa n er th
MDD Prediction w eh r ih z d ah hh aa t ah s t p l ey s aa n er th

speaker = IN057_M43, PER = 16.61, Intelligibility Score = 99.43, Accentedness Score = 8.22

Transcription is my garage door open
Canonical ih z m ay g er aa zh d ao r ow p ah n
MDD Prediction ih z m ay g er r aa jh d ao r ow p ah n

Transcription search for vegetarian pasta recipes
Canonical s er ch f ao r v eh jh ah t eh r iy ah n p aa s t ah r eh s ah p iy z
MDD Prediction s er ch f ao r v eh jh ah t eh r iy ah n p aa s t ah r eh s ah p iy s


L2-ARCTIC:

For L2-ARCTIC, we showcase the text transcriptions, canonical phonemes, human-labeled phonemes, as well as the MDD model predictions for 2 speakers, TNI (Indian-accent) and TXHC (Chinese-accent).

speaker = TNI

Transcription he had barely entered this when he saw the glow of a fire
Canonical hh iy hh ae d b eh r l iy eh n t er d dh ih s w eh n hh iy s ao dh ah g l ow ah v ah f ay r
Human Perceived hh iy ae d b eh r l iy eh n t er d dh ih s w eh n hh iy s ao dh ah g l ow ah v ah f ay r
MDD Prediction hh iy ae d b eh d l iy eh n t er d ih dh ih s w eh n hh iy s ao dh ah g l ow ah v ah f ay r

Transcription it was beating and waiting in the ambush of those black pits
Canonical ih t w ah z b iy t ih ng ah n d w ey t ih ng ih n dh ah ae m b uh sh ah v dh ow z b l ae k p ih t s
Human Perceived ih t v ah z b iy t ih ng ah n d v ey t ih ng ih n d ah ae m b uh sh ah v d ow z b l ae k b ih t s
MDD Prediction ih t v ah z b iy t ih ng ah n d ey t ih ng ih n d ah ae m b uh sh ah v d ow z b l ae k b ih t s

speaker = TXHC

Transcription he was a head shorter than his companion of almost delicate physique
Canonical hh iy w ah z ah hh eh d sh ao r t er dh ah n hh ih z k ah m p ae n y ah n ah v ao l m ow s t d eh l ah k ah t f ah z iy k
Human Perceived hh iy w ah s ah hh ae d sh ao t er dh ah ng hh ih s k ah m p ae n y ah n ah v ao m ow s t d ae l ah k ah t ah f ah z iy k
MDD Prediction hh iy w ah z ah hh ae t sh ao t er dh ah n hh ih z k ah m p ae n y ah n ah v ao m ow s t ah d eh l ah k ah t f ih z iy k

Transcription hardly were our plans made public before we were met by powerful opposition
Canonical hh aa r d l iy w er aa r p l ae n z m ey d p ah b l ih k b iy f ao r w iy w er m eh t b ay p aw er f ah l aa p ah z ih sh ah n
Human Perceived hh aa r d ah l iy w er aa r p l ae n s m ey d p aa b l ih k b iy f ao r w iy w ah m eh t b ay p aw er f uh aa p ah z ih sh ah n
MDD Prediction hh aa r d l iy w er aw er p l ae n s m ey d p ah b l ih k b iy f ao r w iy w ah m eh t b ay p aw v er f ah l aa p ah z ih sh ah n


References

[1] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A Non-native English Speech Corpus,” in Proc. Interspeech 2018, 2018, pp. 2783–2787.