Mispronunciation Detection and Diagnosis Audio Samples
Audio samples for our paper: Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment. Code is available here.
We present the mispronunciation detection and diagnosis (MDD) model outputs on L2-ARCTIC [1] test set and Indian-accent speech in UTD-4Accents (as our open test).
UTD-4Accents:
For UTD-4Accents, since human-labeled phonemes are not available, we showcase the text transcriptions, canonical phonemes (generated by a G2P model), and model predictions for 6 Indian-accent speakers. We also include the model PERs, as well as intelligibility scores and accentedness scores from human raters. Accentedness scores are in scale of 1-9 where 1 means heavy accent; Intelligibility scores are in scale of 0-100 where 0 means not intelligible at all. These 6 speakers reside in 3 different clusters in Fig. 2 (b) of our paper, with each cluster having 2 speakers.
Low intelligibility, heavy accent (cluster (iii))
speaker = IN000_M00, PER = 31.98, Intelligibility Score = 83.23, Accentedness Score = 3.41
Transcription | why | do | people | talk | during | a | movie | |
Canonical | w ay | d uw | p iy p ah l | t ao k | d uh r ih ng | ah | m uw v iy | |
MDD Prediction | w ay | d uw | p iy p ah l | d aa | d uw r ih ng | ah | m uw iy |
Transcription | is | my | garage | door | open | |
Canonical | ih z | m ay | g er aa zh | d ao r | ow p ah n | |
MDD Prediction | ih z | m ay | g ae l ah zh | d uh r | ow b ah n |
speaker = IN037_M030, PER = 28.34, Intelligibility Score = 86.53, Accentedness Score = 3.00
Transcription | is | my | garage | door | open | |
Canonical | ih z | m ay | g er aa zh | d ao r | ow p ah n | |
MDD Prediction | iy s | m iy | g ae r ah zh | d ao r | ow p ah n |
Transcription | how's | the | new | job | working | out | it's | been | so | long | since | we | met | |
Canonical | hh aw z | dh ah | n uw | jh aa b | w er k ih ng | aw t | ih t s | b ih n | s ow | l ao ng | s ih n s | w iy | m eh t | |
MDD Prediction | aa w ah z | dh ah | n y uw | jh ah b | w er k ih ng | aw t | ih t z | b ih n | s ow | l ao ng | s ih n s | v w iy | m eh t |
High intelligibility, middle-level accent (cluster (iv))
speaker = IN064_F13, PER = 16.29, Intelligibility Score = 95.37, Accentedness Score = 5.49
Transcription | search | for | vegetarian | pasta | recipes | |
Canonical | s er ch | f ao r | v eh jh ah t eh r iy ah n | p aa s t ah | r eh s ah p iy z | |
MDD Prediction | s ah ch | ah | v eh jh ah t ae eh r iy ah n | p aa s t ah | r ih s ih p iy z |
Transcription | where | is | the | closest | shopping | mall | |
Canonical | w eh r | ih z | dh ah | k l ow s ah s t | sh aa p ih ng | m ao l | |
MDD Prediction | w eh r | ih z | dh ah | k l ow z ah s t | sh aa p ih ng | m ao l |
speaker = IN094_F39, PER = 15.59, Intelligibility Score = 99.47, Accentedness Score = 5.29
Transcription | is | my | garage | door | open | |
Canonical | ih z | m ay | g er aa zh | d ao r | ow p ah n | |
MDD Prediction | ih z | m ay | g ah r aa jh | d ao r | ow p ah n |
Transcription | hi | joe | there | is | no | band | practice | today | mr. | collins | will | let | us | know | the | new | time | |
Canonical | hh ay | jh ow | dh eh r | ih z | n ow | b ae n d | p r ae k t ah s | t ah d ey | m ih s t er | k aa l ih n z | w ih l | l eh t | ah s | n ow | dh ah | n uw | t ay m | |
MDD Prediction | hh ay | jh ow | dh eh r | ih z | n ow | b ae n d | p r ae k t ih s | t ah d ey | m ih s t ah | k ao l ih n z | w ih | l eh t | ah s | n ow | d ah | n y uw | d ay m |
High intelligibility, light accent (cluster (v))
speaker = IN028_F028, PER = 15.45, Intelligibility Score = 98.40, Accentedness Score = 8.41
Transcription | is | my | garage | door | open | |
Canonical | ih z | m ay | g er aa zh | d ao r | ow p ah n | |
MDD Prediction | ih s | m ay | g r aa zh | d ao r | ow p ah n |
Transcription | where | is | the | hottest | place | on | earth | |
Canonical | w eh r | ih z | dh ah | hh aa t ah s t | p l ey s | aa n | er th | |
MDD Prediction | w eh r | ih z | d ah | hh aa t ah s t | p l ey s | aa n | er th |
speaker = IN057_M43, PER = 16.61, Intelligibility Score = 99.43, Accentedness Score = 8.22
Transcription | is | my | garage | door | open | |
Canonical | ih z | m ay | g er aa zh | d ao r | ow p ah n | |
MDD Prediction | ih z | m ay | g er r aa jh | d ao r | ow p ah n |
Transcription | search | for | vegetarian | pasta | recipes | |
Canonical | s er ch | f ao r | v eh jh ah t eh r iy ah n | p aa s t ah | r eh s ah p iy z | |
MDD Prediction | s er ch | f ao r | v eh jh ah t eh r iy ah n | p aa s t ah | r eh s ah p iy s |
L2-ARCTIC:
For L2-ARCTIC, we showcase the text transcriptions, canonical phonemes, human-labeled phonemes, as well as the MDD model predictions for 2 speakers, TNI (Indian-accent) and TXHC (Chinese-accent).
speaker = TNI
Transcription | he | had | barely | entered | this | when | he | saw | the | glow | of | a | fire | |
Canonical | hh iy | hh ae d | b eh r l iy | eh n t er d | dh ih s | w eh n | hh iy | s ao | dh ah | g l ow | ah v | ah | f ay r | |
Human Perceived | hh iy | ae d | b eh r l iy | eh n t er d | dh ih s | w eh n | hh iy | s ao | dh ah | g l ow | ah v | ah | f ay r | |
MDD Prediction | hh iy | ae d | b eh d l iy | eh n t er d ih | dh ih s | w eh n | hh iy | s ao | dh ah | g l ow | ah v | ah | f ay r |
Transcription | it | was | beating | and | waiting | in | the | ambush | of | those | black | pits | |
Canonical | ih t | w ah z | b iy t ih ng | ah n d | w ey t ih ng | ih n | dh ah | ae m b uh sh | ah v | dh ow z | b l ae k | p ih t s | |
Human Perceived | ih t | v ah z | b iy t ih ng | ah n d | v ey t ih ng | ih n | d ah | ae m b uh sh | ah v | d ow z | b l ae k | b ih t s | |
MDD Prediction | ih t | v ah z | b iy t ih ng | ah n d | ey t ih ng | ih n | d ah | ae m b uh sh | ah v | d ow z | b l ae k | b ih t s |
speaker = TXHC
Transcription | he | was | a | head | shorter | than | his | companion | of | almost | delicate | physique | |
Canonical | hh iy | w ah z | ah | hh eh d | sh ao r t er | dh ah n | hh ih z | k ah m p ae n y ah n | ah v | ao l m ow s t | d eh l ah k ah t | f ah z iy k | |
Human Perceived | hh iy | w ah s | ah | hh ae d | sh ao t er | dh ah ng | hh ih s | k ah m p ae n y ah n | ah v | ao m ow s t | d ae l ah k ah t ah | f ah z iy k | |
MDD Prediction | hh iy | w ah z | ah | hh ae t | sh ao t er | dh ah n | hh ih z | k ah m p ae n y ah n | ah v | ao m ow s t ah | d eh l ah k ah t | f ih z iy k |
Transcription | hardly | were | our | plans | made | public | before | we | were | met | by | powerful | opposition | |
Canonical | hh aa r d l iy | w er | aa r | p l ae n z | m ey d | p ah b l ih k | b iy f ao r | w iy | w er | m eh t | b ay | p aw er f ah l | aa p ah z ih sh ah n | |
Human Perceived | hh aa r d ah l iy | w er | aa r | p l ae n s | m ey d | p aa b l ih k | b iy f ao r | w iy | w ah | m eh t | b ay | p aw er f uh | aa p ah z ih sh ah n | |
MDD Prediction | hh aa r d l iy | w er | aw er | p l ae n s | m ey d | p ah b l ih k | b iy f ao r | w iy | w ah | m eh t | b ay | p aw v er f ah l | aa p ah z ih sh ah n |
References
[1] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A Non-native English Speech Corpus,” in Proc. Interspeech 2018, 2018, pp. 2783–2787.