Fre-MaskCycleGAN-VC: A method of speech articulation and original timbre retention in non parallel corpus of stroke dysarthria
Abstract
Aiming at the problem of blurred pronunciation caused by dysarthria in stroke patients, we propose a non parallel corpus speech articulation method based on Fre-MaskCycleGAN-VC. The method consists of three core stages: 1) Aiming at the coexistence of fuzzy speech segments and clear retention segments in dysarthria speech of stroke patients, dynamic speech segmentation preprocessing based on equivalent sound level (Leq) is used to accurately locate the fuzzy segments that need to be enhanced; 2) Feature extraction combining dynamic mask and retro production statistical features; 3) The resolution connected generator and resolution wise discriminators architecture that integrate the frequency processing model. Multiple groups of experiments were carried out on the stroke dysarthria speech data set. The experimental results show that the Fre-MaskCycleGAN-VC method has significantly improved the naturalness of speech (Mean Opinion Score (MOS) increased by 14.2%), intelligibility (WA increased by 2.6%) and timbre fidelity (MFCC correlation coefficient 0.92, F0 error rate 4.2%). Phased evolution experiments show that the model can generate four gradual repair versions from heavily blurred speech to near healthy speech, and the repair effect of grade 2–3 is better than that of the original healthy speech. Through multi-stage feature processing and adversarial training mechanism, we provide a clear speech generation scheme that retains the original timbre for patients with dysarthria
Copyright (c) 2025 Ning Jia, Chunjun Zheng

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
[1]Wang YJ, Li ZX, Gu HQ, et al. Brief report on stroke prevention and treatment in China. Chinese Journal of Cerebrovascular Diseases. 2020; 15(10): 272–281. Available online: https://www.chinastroke.org.cn/CN/article/openArticlePDF.jsp?id=3145 (in Chinese)
[2]Krishna G, Carnahan M, Shamapant S, et al. Brain signals to rescue aphasia, apraxia and dysarthria speech recognition, In: Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); 1 November 2021; Guadalajara, Mexico. pp. 6008–6014. doi: 10.1109/EMBC46164.2021.9629802
[3]Yue Z, Loweimi E, Cvetkovic Z, et al. Multi-modal acoustic-articulatory feature fusion for dysarthric speech recognition. In: Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 23 May 2022; Singapore. pp. 7372–7376. doi: 10.1109/ICASSP43922.2022.9746855
[4]Kaneko T, Kameoka H, Tanaka K, et al. Cyclegan-VC2: improved cyclegan-based non-parallel voice conversion. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 12–17 May 2019; Brighton, UK. pp. 6820–6824. doi: 10.1109/ICASSP.2019.8682897
[5]Kaneko T, Kameoka H, Tanaka K, et al. CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. In: Proceedings of the Interspeech 2020; 25–29 October 2020; Shanghai, China. pp. 2017–2021. doi: 10.21437/Interspeech.2020-2280
[6]Kameoka H, Kaneko T, Tanaka K, et al. StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. In: Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT); 18–21 December 2018; Athens, Greece. pp. 266–273. doi: 10.1109/SLT.2018.8639535
[7]Kaneko T, Kameoka H, Tanaka K, et al. Maskcyclegan-VC: learning non-parallel voice conversion with filling in frames. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 6 June 2021; Toronto, ON, Canada. pp. 5919–5923. doi: 10.1109/ICASSP39728.2021.9414851
[8]Kroll L, Herbrandt S, Kemper N, et al. Determination of the sound level during different management measures in piglet rearing related to animal welfare and human health and safety. Livestock Science. 2024; 280: 105410. doi: 10.1016/j.livsci.2024.105410
[9]Xu X, Liao X, Zhou T, et al. Vibration-based identification of lubrication starved bearing using spectral centroid indicator combined with minimum entropy deconvolution. Measurement. 2024; 226: 114156. doi: 10.1016/j.measurement.2024.114156
[10]Kong J, Kim J, Bae J. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. arXiv preprint. 2020. doi: 10.48550/ARXIV.2010.05646
[11]Zhao LB, Liu Q, Fu FL, et al. Automatic detection of hypernasality grades based on discrete wavelet transformation and cepstrum analysis. Computer Science. 2018. 4: 284–290. (in Chinese)
[12]Yatabe K, Masuyama Y, Oikawa Y. Rectified linear unit can assist griffin-lim phase recovery. In: Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC); 17–20 September 2018; Tokyo, Japan. pp. 555–559. doi: 10.1109/IWAENC.2018.8521304
[13]Morise M, Yokomori F, Ozawa K. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems. 2016; E99.D(7): 1877–1884. doi: 10.1587/transinf.2015EDP7457
[14]Oord A, Dieleman S, Zen H, et al. WaveNet: a generative model for raw audio. arXiv preprint. 2016. doi: 10.48550/ARXIV.1609.03499
[15]Kalchbrenner N, Elsen E, Simonyan K, et al. Efficient neural audio synthesis. arXiv preprint. 2018. doi: 10.48550/ARXIV.1802.08435
[16]Yamamoto R, Song E, Kim J-M. Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 4–8 May 2020; Barcelona, Spain. pp. 6199–6203. doi: 10.1109/ICASSP40776.2020.9053795
[17]Yang G, Yang S, Liu K, et al. Multi-band melgan: faster waveform generation for high-quality text-to-speech. In: Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT); 19 January 2021; Shenzhen, China. pp. 492–498. doi: 10.1109/SLT48900.2021.9383551
[18]Sahu S, Gupta R, Espy-Wilson C. On enhancing speech emotion recognition using generative adversarial networks. In: Proceedings of the Interspeech 2018; 2 September 2018; Hyderabad, India. pp. 3693–3697. doi: 10.21437/Interspeech.2018-1883
[19]Ni Z, Han M, Chen F, et al. VILAS: exploring the effects of vision and language context in automatic speech recognition. arXiv preprint. 2023. doi: 10.48550/arXiv.2305.19972
[20]He Y, Seng KP, Ang LM. Multimodal sensor-input architecture with deep learning for audio-visual speech recognition in wild. Sensors. 2023; 23(4): 1834. doi: 10.3390/s23041834
[21]Filippidou F, Moussiades L. Α benchmarking of IBM, google and wit automatic speech recognition systems. In: Maglogiannis I, Iliadis L, Pimenidis E (editors). Artificial Intelligence Applications and Innovations, IFIP Advances in Information and Communication Technology. Springer International Publishing; 2020. pp. 73–82. doi: 10.1007/978-3-030-49161-1_7
[22]Zach C. Fully variational noise-contrastive estimation. In: Gade R, Felsberg M, Kämäräinen J-K (editors). Image Analysis, Lecture Notes in Computer Science. Springer Nature; 2023. 13886, pp. 175–190. doi: 10.1007/978-3-031-31438-4_12
[23]Liu S, Wang Y, Sun J, et al. An efficient Spatial–Temporal model based on gated linear units for trajectory prediction. Neurocomputing. 2022; 492: 593–600. doi: 10.1016/j.neucom.2021.12.051
[24]Pons J, Pascual S, Cengarle G, et al. Upsampling artifacts in neural audio synthesis. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 6 June 2021; Toronto, ON, Canada. pp. 3005–3009. doi: 10.1109/ICASSP39728.2021.9414913
[25]Liu J, Liu X, Yang Y, et al. The open-access mandarin subacute stroke dysarthria multimodal (MSDM) database for intelligent assessment. In: Proceedings of the 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP); 7 November 2024; Beijing, China. pp. 131–135. doi: 10.1109/ISCSLP63861.2024.10799983




