Fre-MaskCycleGAN-VC: A method of speech articulation and original timbre retention in non parallel corpus of stroke dysarthria

Ning  Jia; Chunjun Zheng

doi:10.59400/sv3603

Fre-MaskCycleGAN-VC: A method of speech articulation and original timbre retention in non parallel corpus of stroke dysarthria

Ning Jia
College of Applied Technology, Dalian Neusoft University of Information, Dalian 116023, China
Chunjun Zheng
College of Applied Technology, Dalian Neusoft University of Information, Dalian 116023, China

Article ID: 3603

DOI: https://doi.org/10.59400/sv3603

Keywords: articulation disorder correction; speech articulation; Fre-MaskCycleGANVC; dynamic mask; frequency processing

Abstract

Aiming at the problem of blurred pronunciation caused by dysarthria in stroke patients, we propose a non parallel corpus speech articulation method based on Fre-MaskCycleGAN-VC. The method consists of three core stages: 1) Aiming at the coexistence of fuzzy speech segments and clear retention segments in dysarthria speech of stroke patients, dynamic speech segmentation preprocessing based on equivalent sound level (Leq) is used to accurately locate the fuzzy segments that need to be enhanced; 2) Feature extraction combining dynamic mask and retro production statistical features; 3) The resolution connected generator and resolution wise discriminators architecture that integrate the frequency processing model. Multiple groups of experiments were carried out on the stroke dysarthria speech data set. The experimental results show that the Fre-MaskCycleGAN-VC method has significantly improved the naturalness of speech (Mean Opinion Score (MOS) increased by 14.2%), intelligibility (WA increased by 2.6%) and timbre fidelity (MFCC correlation coefficient 0.92, F0 error rate 4.2%). Phased evolution experiments show that the model can generate four gradual repair versions from heavily blurred speech to near healthy speech, and the repair effect of grade 2–3 is better than that of the original healthy speech. Through multi-stage feature processing and adversarial training mechanism, we provide a clear speech generation scheme that retains the original timbre for patients with dysarthria

Published

2026-02-01

How to Cite

Jia, N., & Zheng, C. (2026). Fre-MaskCycleGAN-VC: A method of speech articulation and original timbre retention in non parallel corpus of stroke dysarthria. Sound & Vibration, 60(1). https://doi.org/10.59400/sv3603

Download Citation

Issue

Vol. 60 No. 1 (2026)

Section

Article

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

[1]Wang YJ, Li ZX, Gu HQ, et al. Brief report on stroke prevention and treatment in China. Chinese Journal of Cerebrovascular Diseases. 2020; 15(10): 272–281. Available online: https://www.chinastroke.org.cn/CN/article/openArticlePDF.jsp?id=3145 (in Chinese)

[2]Krishna G, Carnahan M, Shamapant S, et al. Brain signals to rescue aphasia, apraxia and dysarthria speech recognition, In: Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); 1 November 2021; Guadalajara, Mexico. pp. 6008–6014. doi: 10.1109/EMBC46164.2021.9629802

[3]Yue Z, Loweimi E, Cvetkovic Z, et al. Multi-modal acoustic-articulatory feature fusion for dysarthric speech recognition. In: Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 23 May 2022; Singapore. pp. 7372–7376. doi: 10.1109/ICASSP43922.2022.9746855

[4]Kaneko T, Kameoka H, Tanaka K, et al. Cyclegan-VC2: improved cyclegan-based non-parallel voice conversion. In: Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 12–17 May 2019; Brighton, UK. pp. 6820–6824. doi: 10.1109/ICASSP.2019.8682897

[5]Kaneko T, Kameoka H, Tanaka K, et al. CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. In: Proceedings of the Interspeech 2020; 25–29 October 2020; Shanghai, China. pp. 2017–2021. doi: 10.21437/Interspeech.2020-2280

[6]Kameoka H, Kaneko T, Tanaka K, et al. StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. In: Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT); 18–21 December 2018; Athens, Greece. pp. 266–273. doi: 10.1109/SLT.2018.8639535

[7]Kaneko T, Kameoka H, Tanaka K, et al. Maskcyclegan-VC: learning non-parallel voice conversion with filling in frames. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 6 June 2021; Toronto, ON, Canada. pp. 5919–5923. doi: 10.1109/ICASSP39728.2021.9414851

[8]Kroll L, Herbrandt S, Kemper N, et al. Determination of the sound level during different management measures in piglet rearing related to animal welfare and human health and safety. Livestock Science. 2024; 280: 105410. doi: 10.1016/j.livsci.2024.105410

[9]Xu X, Liao X, Zhou T, et al. Vibration-based identification of lubrication starved bearing using spectral centroid indicator combined with minimum entropy deconvolution. Measurement. 2024; 226: 114156. doi: 10.1016/j.measurement.2024.114156

[10]Kong J, Kim J, Bae J. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. arXiv preprint. 2020. doi: 10.48550/ARXIV.2010.05646

[11]Zhao LB, Liu Q, Fu FL, et al. Automatic detection of hypernasality grades based on discrete wavelet transformation and cepstrum analysis. Computer Science. 2018. 4: 284–290. (in Chinese)

[12]Yatabe K, Masuyama Y, Oikawa Y. Rectified linear unit can assist griffin-lim phase recovery. In: Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC); 17–20 September 2018; Tokyo, Japan. pp. 555–559. doi: 10.1109/IWAENC.2018.8521304

[13]Morise M, Yokomori F, Ozawa K. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems. 2016; E99.D(7): 1877–1884. doi: 10.1587/transinf.2015EDP7457

[14]Oord A, Dieleman S, Zen H, et al. WaveNet: a generative model for raw audio. arXiv preprint. 2016. doi: 10.48550/ARXIV.1609.03499

[15]Kalchbrenner N, Elsen E, Simonyan K, et al. Efficient neural audio synthesis. arXiv preprint. 2018. doi: 10.48550/ARXIV.1802.08435

[16]Yamamoto R, Song E, Kim J-M. Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 4–8 May 2020; Barcelona, Spain. pp. 6199–6203. doi: 10.1109/ICASSP40776.2020.9053795

[17]Yang G, Yang S, Liu K, et al. Multi-band melgan: faster waveform generation for high-quality text-to-speech. In: Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT); 19 January 2021; Shenzhen, China. pp. 492–498. doi: 10.1109/SLT48900.2021.9383551

[18]Sahu S, Gupta R, Espy-Wilson C. On enhancing speech emotion recognition using generative adversarial networks. In: Proceedings of the Interspeech 2018; 2 September 2018; Hyderabad, India. pp. 3693–3697. doi: 10.21437/Interspeech.2018-1883

[19]Ni Z, Han M, Chen F, et al. VILAS: exploring the effects of vision and language context in automatic speech recognition. arXiv preprint. 2023. doi: 10.48550/arXiv.2305.19972

[20]He Y, Seng KP, Ang LM. Multimodal sensor-input architecture with deep learning for audio-visual speech recognition in wild. Sensors. 2023; 23(4): 1834. doi: 10.3390/s23041834

[21]Filippidou F, Moussiades L. Α benchmarking of IBM, google and wit automatic speech recognition systems. In: Maglogiannis I, Iliadis L, Pimenidis E (editors). Artificial Intelligence Applications and Innovations, IFIP Advances in Information and Communication Technology. Springer International Publishing; 2020. pp. 73–82. doi: 10.1007/978-3-030-49161-1_7

[22]Zach C. Fully variational noise-contrastive estimation. In: Gade R, Felsberg M, Kämäräinen J-K (editors). Image Analysis, Lecture Notes in Computer Science. Springer Nature; 2023. 13886, pp. 175–190. doi: 10.1007/978-3-031-31438-4_12

[23]Liu S, Wang Y, Sun J, et al. An efficient Spatial–Temporal model based on gated linear units for trajectory prediction. Neurocomputing. 2022; 492: 593–600. doi: 10.1016/j.neucom.2021.12.051

[24]Pons J, Pascual S, Cengarle G, et al. Upsampling artifacts in neural audio synthesis. In: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 6 June 2021; Toronto, ON, Canada. pp. 3005–3009. doi: 10.1109/ICASSP39728.2021.9414913

[25]Liu J, Liu X, Yang Y, et al. The open-access mandarin subacute stroke dysarthria multimodal (MSDM) database for intelligent assessment. In: Proceedings of the 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP); 7 November 2024; Beijing, China. pp. 131–135. doi: 10.1109/ISCSLP63861.2024.10799983

Editor-in-Chief

Prof. Jun Yang

Institute of Acoustics, Chinese Academy of Sciences, China

ISSN

1541-0161 (Print)

2693-1443 (Online)

Publication Frequency

Bi-monthly

Indexing

Web of Science Coverage

Emerging Sources Citation Index (2024 Impact Factor 1.0)

Elsevier Solutions

Scopus (2024 CiteScore 2.5; 2024 SNIP 0.5);

Portico, etc.

About the Publisher

Academic Publishing insists on taking academic exchange and publication as the main line, carrying out comprehensive management based on science and technology, and fully exploring excellent international publishing resources. Within 5 years, it will form a strategic framework and scale with science (S), technology (T), medicine (M), education (E), and humanities and arts (H) as the main publishing fields. Academic Publishing is headquartered in Singapore and based in Malaysia, with the United States and China providing the main scientific and academic resources. At the same time, it has established long-term good cooperative relations with other publishing companies, scientific research communities, and academic organizations in more than a dozen countries and regions. Academic Publishing uses English and Chinese as its main publishing languages, mainly publishing books, journals, and conference papers in print and online. The vast majority of publications follow the international open access policy, providing stable and long-term quality and professional publications. With the joint efforts of the expert team and our professional editorial team, our publications will gradually be indexed by international databases in stages to provide convenient and professional retrieval for various scholars. At the same time, manuscripts we accept will be subject to the peer review principle, and cutting-edge and innovative research articles will be preferentially accepted for peer reference and discussion. All kinds of our publications are welcome for peer to contribute, access, and download.

more

Member of ASC

Volume Arrangement

Featured Articles

New scaling of critical damping and reduced frequency for mechanically excited systems

This paper introduces a universal framework for understanding the vibration responses of systems subjected to harmonic excitation. By examining a simplified cylinder-spring-damper model, the study refurbishes traditional scaling methods for the excitation frequency ratio and critical damping ratio. The findings indicate that in damped systems, the maximum amplitude of vibration does not align with the natural frequency. This observation leads to the introduction of a new scaling method for reduced frequency. This new approach aligns resonance peaks at the new reduced velocity of 1.0 across different damping ratios, providing a consistent characterization of vibration behavior. A new critical damping ratio of 0.707 is identified for an excited system as opposed to the traditional damping ratio of 1.0 for an unexcited system. Key properties such as maximum amplitude, phase lag, bandwidth, and quality factor are analyzed, demonstrating that the proposed reduced frequency and critical damping ratio effectively capture the dynamics of both damped and undamped excited systems. The findings offer significant insights for practical applications in engineering and various scientific fields.

Ultrasonic wave velocity as a universal metric for defect detection in timber structures: A case study on Japanese cedar wood (Cryptomeria japonica)

This study makes significant contributions to the field of ultrasonic testing (UT) by offering a novel approach to the identification of artificially introduced defects within Japanese cedar wood (Cryptomeria japonica). The findings are of particular relevance for the heritage conservation and construction sectors, where non-invasive defect detection is paramount. The study establishes a robust framework for assessing the structural integrity of timber by correlating ultrasonic wave velocity reductions with defect size and distribution. Big-sized defects led to more substantial decreases in wave velocity. The study establishes a robust framework for assessing the structural integrity of historical timber by correlating ultrasonic wave velocity reductions with defect size and distribution. This framework has the potential to be applicable to diverse wood species and defect types.

Vehicle structural road noise prediction based on an improved Long Short-Term Memory method

The control of vehicle interior noise has become a critical metric for assessing noise, vibration, and harshness (NVH) in vehicles. During the initial phases of vehicle development, accurately predicting the impact of road noise on interior noise is essential for reducing noise levels and expediting the product development cycle. In recent years, data-driven methods based on machine learning have gained significant attention due to their robust capability in navigating complex data mapping relationships. Notably, surrogate models have demonstrated exceptional performance in this domain. Numerous researchers have integrated diverse intelligent algorithms into the study of vehicle noise, leveraging advantages such as the elimination of precise modeling requirements, extensive solution space exploration, continuous learning from data, and robust algorithmic versatility. However, in NVH engineering applications, data-driven models face inherent limitations, particularly in interpretability and stability. To address these issues, this paper introduces an improved Long Short-Term Memory (LSTM) network that combines knowledge and data. Inspired by the physical information neural network concept, this approach incorporates values calculated through empirical formulas into the neural network as constraints. Comparative assessments with traditional LSTM networks highlight the advantages of this deep learning model. By integrating empirical formulas constraints, the model not only enhances interpretability but also achieves robust generalization with fewer data samples. The proposed method is validated on a specific vehicle model, showing significant improvements in prediction accuracy and efficiency.