A comparison of cepstral and spectral features using recurrent neural network for spoken language identification

Irshad Ahmad Thukroo; Rumaan Bashir; Kaiser Javeed Giri

doi:10.59400/cai.v2i1.440

A comparison of cepstral and spectral features using recurrent neural network for spoken language identification

Irshad Ahmad Thukroo Department of Computer Science, Islamic University of Science and Technology, Kashmir 192122, India
Rumaan Bashir Department of Computer Science, Islamic University of Science and Technology, Kashmir 192122, India
Kaiser Javeed Giri Department of Computer Science, Islamic University of Science and Technology, Kashmir 192122, India

Article ID: 440

DOI: https://doi.org/10.59400/cai.v2i1.440

Keywords: MFCC; RASTA-PLP; spectral features; RNN-LSTM; SNG

Abstract

Spoken language identification is the process of confirming labels regarding the language of an audio slice regardless of various features such as length, ambiance, duration, topic or message, age, gender, region, emotions, etc. Language identification systems are of great significance in the domain of natural language processing, more specifically multi-lingual machine translation, language recognition, and automatic routing of voice calls to particular nodes speaking or knowing a particular language. In his paper, we are comparing results based on various cepstral and spectral feature techniques such as Mel-frequency Cepstral Coefficients (MFCC), Relative spectral-perceptual linear prediction coefficients (RASTA-PLP), and spectral features (roll-off, flatness, centroid, bandwidth, and contrast) in the process of spoken language identification using Recurrent Neural Network-Long Short Term Memory (RNN-LSTM) as a procedure of sequence learning. The system or model has been implemented in six different languages, which contain Ladakhi and the five official languages of Jammu and Kashmir (Union Territory). The dataset used in experimentation consists of TV audio recordings for Kashmiri, Urdu, Dogri, and Ladakhi languages. It also consists of standard corpora IIIT-H and VoxForge containing English and Hindi audio data. Pre-processing of the dataset is done by slicing different types of noise with the use of the Spectral Noise Gate (SNG) and then slicing into audio bursts of 5 seconds duration. The performance is evaluated using standard metrics like F1 score, recall, precision, and accuracy. The experimental results showed that using spectral features, MFCC and RASTA-PLP achieved an average accuracy of 76%, 83%, and 78%, respectively. Therefore, MFCC proved to be the most convenient feature to be exploited in language identification using a recurrent neural network long short-term memory classifier.

References

[1]China Bhanja C, Laskar MA, Laskar RH. Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system. Language Resources and Evaluation. 2021, 55(3): 689-730. doi: 10.1007/s10579-020-09527-z

[2]Lee HS, Tsao Y, Jeng SK, et al. Subspace-Based Representation and Learning for Phonotactic Spoken Language Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020, 28: 3065-3079. doi: 10.1109/taslp.2020.3037457

[3]Chandak C, Raeesy Z, Rastrow A, et al. Streaming language identification using combination of acoustic representations and ASR hypotheses. arXiv. 2020. doi.org/10.48550/arXiv.2006.00703

[4]Gemmeke JF, Van Hamme H, Cranen B, et al. Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition. IEEE Journal of Selected Topics in Signal Processing. 2010, 4(2): 272-287. doi: 10.1109/jstsp.2009.2039171

[5]Wang P, Tan K, Wang DL. Bridging the Gap Between Monaural Speech Enhancement and Recognition With Distortion-Independent Acoustic Modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020, 28: 39-48. doi: 10.1109/taslp.2019.2946789

[6]Albadr MAA, Tiun S, Ayob M, et al. Mel-Frequency Cepstral Coefficient Features Based on Standard Deviation and Principal Component Analysis for Language Identification Systems. Cognitive Computation. 2021, 13(5): 1136-1153. doi: 10.1007/s12559-021-09914-w

[7]Biswas M, Rahaman S, Kundu S, et al. Spoken Language Identification of Indian Languages Using MFCC Features. Machine Learning for Intelligent Multimedia Analytics. Published online 2021: 249-272. doi: 10.1007/978-981-15-9492-2_12

[8]Wicaksana VS, S.Kom AZ. Spoken Language Identification on Local Language using MFCC, Random Forest, KNN, and GMM. International Journal of Advanced Computer Science and Applications. 2021, 12(5). doi: 10.14569/ijacsa.2021.0120548

[9]Athiyaa N, Jacob G. Spoken language identification system using MFCC features and gaussian mixture model for Tamil and Telugu Languages. International Research Journal of Engineering and Technology(IRJET). 2019, 6(4): 4243–4248.

[10]Das A, Guha S, Singh PK, et al. A Hybrid Meta-Heuristic Feature Selection Method for Identification of Indian Spoken Languages From Audio Signals. IEEE Access. 2020, 8: 181432-181449. doi: 10.1109/access.2020.3028241

[11]Das HS, Roy P. Bottleneck Feature-Based Hybrid Deep Autoencoder Approach for Indian Language Identification. Arabian Journal for Science and Engineering. 2020, 45(4): 3425-3436. doi: 10.1007/s13369-020-04430-9

[12]Qu D, Wang B, Wei X. Language identification using vector quantization. In: Proceedings of the 6th International Conference on Signal Processing; 26–30 August 2002; Beijing, China. 492–495. doi: 10.1109/ICOSP.2002.1181100

[13]Maity S, Kumar Vuppala A, Rao KS, et al. IITKGP-MLILSC speech database for language identification. 2012 National Conference on Communications (NCC). Published online February 2012. doi: 10.1109/ncc.2012.6176831

[14]Sarthak, Shukla S, Mittal G. Spoken Language Identification Using ConvNets. Ambient Intelligence. Published online 2019: 252-265. doi: 10.1007/978-3-030-34255-5_17

[15]Lopez-moreno I, Gonzalez-dominguez J, Plchot, D. Martinez O, et al. Google Inc ., New York, USA ATVS-Biometric Recognition Group, Universidad Autonoma de Madrid, Spain Brno University of Technology, Czech Republic Aragon Institute for Engineering Research (I3A), University of Zaragoza, Spain. 2014. pp. 0–4.

[16]Hermansky H, Morgan N. RASTA processing of speech. IEEE Transactions on Speech and Audio Processing. 1994, 2(4): 578-589. doi: 10.1109/89.326616

[17]Hermansky H, Morgan N, Bayya A, Kohn P. RASTA-PLP speech analysis technique. In: Proceedings of the ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing; 23–26 March 1992; San Francisco, CA, USA. pp. 121-124. doi: 10.1109/ICASSP.1992.225957

[18]Kingsbury BED, Morgan N. Recognizing reverberant speech with RASTA-PLP. In: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing; 21–24 April 1997; Munich, Germany. pp. 1259–1262. doi: 10.1109/ICASSP.1997.596174

[19]Razia Sulthana A, Mathur A. A State of Art of Machine Learning Algorithms Applied Over Language Identification and Speech Recognition Models. International Virtual Conference on Industry 40. Published online 2021: 123-132. doi: 10.1007/978-981-16-1244-2_10

[20]Ghanghor N, Krishnamurthy P, Thavareesan S, et al. IIITK@DravidianLangTech-EACL2021: Offensive language identification and meme classification in Tamil, Malayalam and Kannada. In: Chakravarthi B, Priyadharshini R, Kumar MA, et al., Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages; 20 April 2021; Kyiv, Ukraine. Association for Computational Linguistics; 2021. pp. 222–229.

[21]Anusuya MA, Katti SK. Speech recognition by machine: A review. International Journal of Computer Science and Information Security. 2009. 6(3): 181–205.

[22]Schutte KT. Parts-Based Models and Local Features for Automatic Speech Recognition [PhD thesis]. Massachusetts Institute of Technology; 2009.

[23]Deshwal D, Sangwan P, Kumar D. Feature Extraction Methods in Language Identification: A Survey. Wireless Personal Communications. 2019, 107(4): 2071-2103. doi: 10.1007/s11277-019-06373-3

[24]Han W, Chan CF, Choy CS, Pun KP. An efficient MFCC extraction method in speech recognition. In: Proceedings of the 2006 IEEE International Symposium on Circuits and Systems (ISCAS); 21–24 May 2006; Kos, Greece. pp. 145–148. doi: 10.1109/ISCAS.2006.1692543

[25]Dewi Renanti M, Buono A, Ananta Kusuma W. Infant cries identification by using codebook as feature matching, and MFCC as feature extraction. Journal of Theoretical and Applied Information Technology. 2013, 56(3): 437–442.

[26]Trang H, Tran Hoang Loc, Huynh Bui Hoang Nam. Proposed combination of PCA and MFCC feature extraction in speech recognition system. 2014 International Conference on Advanced Technologies for Communications (ATC 2014). Published online October 2014. doi: 10.1109/atc.2014.7043477

[27]Ahmed AI, Chiverton JP, Ndzi DL, et al. Speaker recognition using PCA-based feature transformation. Speech Communication. 2019, 110: 33-46. doi: 10.1016/j.specom.2019.04.001

[28]Krishna SR, Rajeswara R. SVM based emotion recognition using spectral features and PCA. International Journal of Pure and Applied Mathematics. 2017, 114(9): 227–235.

[29]Sabab MdN, Chowdhury MAR, Nirjhor SMMI, et al. Bangla Speech Recognition Using 1D-CNN and LSTM with Different Dimension Reduction Techniques. Emerging Technologies in Computing. Published online 2020: 158-169. doi: 10.1007/978-3-030-60036-5_11

[30]Saleh MAM, Ibrahim NS, Ramli DA. Data reduction on MFCC features based on kernel PCA for speaker verification system. WALIA Journal. 2014, 30(S2): 56–62.

[31]Winursito A, Hidayat R, Bejo A. Improvement of MFCC feature extraction accuracy using PCA in Indonesian speech recognition. 2018 International Conference on Information and Communications Technology (ICOIACT). Published online March 2018. doi: 10.1109/icoiact.2018.8350748

[32]Mukherjee H, Obaidullah SM, Santosh KC, et al. A lazy learning-based language identification from speech using MFCC-2 features. International Journal of Machine Learning and Cybernetics. 2019, 11(1): 1-14. doi: 10.1007/s13042-019-00928-3

[33]Boussaid L, Hassine M. Arabic isolated word recognition system using hybrid feature extraction techniques and neural network. International Journal of Speech Technology. 2017, 21(1): 29-37. doi: 10.1007/s10772-017-9480-7

[34]Guha S, Das A, Singh PK, et al. Hybrid Feature Selection Method Based on Harmony Search and Naked Mole-Rat Algorithms for Spoken Language Identification From Audio Signals. IEEE Access. 2020, 8: 182868-182887. doi: 10.1109/access.2020.3028121

[35]Bashir R, Quadri S. Identification of Kashmiri script in a bilingual document image. 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013). Published online December 2013. doi: 10.1109/iciip.2013.6707658

[36]Thukroo IA, Bashir R. Spoken Language Identification System for Kashmiri and Related Languages Using Mel-Spectrograms and Deep Learning Approach. 2021 7th International Conference on Signal Processing and Communication (ICSC). Published online November 25, 2021. doi: 10.1109/icsc53193.2021.9673212

[37]van Keeken A. Understanding Records. A Field Guide to Recording Practice. Second Edition. By Jay Hodgson. New York: Bloomsbury, 2019. 233 pp. ISBN 978-1-5013-4237-0. Popular Music. 2021, 40(1): 172-174. doi: 10.1017/s0261143021000192

[38]Deshwal D, Sangwan P, Kumar D. A Language Identification System using Hybrid Features and Back-Propagation Neural Network. Applied Acoustics. 2020, 164: 107289. doi: 10.1016/j.apacoust.2020.107289

[39]Sharma G, Umapathy K, Krishnan S. Trends in audio signal feature extraction methods. Applied Acoustics. 2020, 158: 107020. doi: 10.1016/j.apacoust.2019.107020

Published

2024-01-30

How to Cite

Thukroo, I. A., Bashir, R., & Giri, K. J. (2024). A comparison of cepstral and spectral features using recurrent neural network for spoken language identification. Computing and Artificial Intelligence, 2(1), 440. https://doi.org/10.59400/cai.v2i1.440

Download Citation

Issue

Vol. 2 No. 1 (2024)

Section

Article

This work is licensed under a Creative Commons Attribution 4.0 International License.

Editor-in-Chief

Prof. Shaohua Wan
University of Electronic Science and Technology of China, China

eISSN

3029-2786

Publication Frequency

Quarterly (since 2025)

About the Publisher

Academic Publishing insists on taking academic exchange and publication as the main line, carrying out comprehensive management based on science and technology, and fully exploring excellent international publishing resources. Within 5 years, it will form a strategic framework and scale with science (S), technology (T), medicine (M), education (E), and humanities and arts (H) as the main publishing fields. Academic Publishing is headquartered in Singapore and based in Malaysia, with the United States and China providing the main scientific and academic resources. At the same time, it has established long-term good cooperative relations with other publishing companies, scientific research communities, and academic organizations in more than a dozen countries and regions. Academic Publishing uses English and Chinese as its main publishing languages, mainly publishing books, journals, and conference papers in print and online. The vast majority of publications follow the international open access policy, providing stable and long-term quality and professional publications. With the joint efforts of the expert team and our professional editorial team, our publications will gradually be indexed by international databases in stages to provide convenient and professional retrieval for various scholars. At the same time, manuscripts we accept will be subject to the peer review principle, and cutting-edge and innovative research articles will be preferentially accepted for peer reference and discussion. All kinds of our publications are welcome for peer to contribute, access, and download.

more

Volume Arrangement

2025

2024

2023

Featured Articles

Identifying voices using convolution neural network models AlexNet and ResNet

Deep learning (DL) techniques which implement deep neural networks became popular due to the increase of high-performance computing facilities. DL achieves higher power and flexibility due to its ability to process many features when it deals with unstructured data. DL algorithm passes the data through several layers; each layer is capable of extracting features progressively and passes it to the next layer. Initial layers extract low-level features, and succeeding layers combine features to form a complete representation. This research attempts to utilize DL techniques for identifying sounds. The development in DL models has extensively covered classification and verification of objects through images. However, there have not been any notable findings concerning identification and verification of the voice of an individual from different other individuals using DL techniques. Hence, the proposed research aims to develop DL techniques capable of isolating the voice of an individual from a group of other sounds and classify them based on the use of convolutional neural networks models AlexNet and ResNet, that are used in voice identification. We achieved the classification accuracy of ResNet and AlexNet model for the problem of voice identification is 97.2039 % and 65.95% respectively, in which ResNet model achieves the best result.

Revolutionizing Neurosurgery and Neurology: The transformative impact of artificial intelligence in healthcare

The integration of artificial intelligence (AI) has brought about a paradigm shift in the landscape of Neurosurgery and Neurology, revolutionizing various facets of healthcare. This article meticulously explores seven pivotal dimensions where AI has made a substantial impact, reshaping the contours of patient care, diagnostics, and treatment modalities. AI’s exceptional precision in deciphering intricate medical imaging data expedites accurate diagnoses of neurological conditions. Harnessing patient-specific data and genetic information, AI facilitates the formulation of highly personalized treatment plans, promising more efficacious therapeutic interventions. The deployment of AI-powered robotic systems in neurosurgical procedures not only ensures surgical precision but also introduces remote capabilities, mitigating the potential for human error. Machine learning models, a core component of AI, play a crucial role in predicting disease progression, optimizing resource allocation, and elevating the overall quality of patient care. Wearable devices integrated with AI provide continuous monitoring of neurological parameters, empowering early intervention strategies for chronic conditions. AI’s prowess extends to drug discovery by scrutinizing extensive datasets, offering the prospect of groundbreaking therapies for neurological disorders. The realm of patient engagement witnesses a transformative impact through AI-driven chatbots and virtual assistants, fostering increased adherence to treatment plans. Looking ahead, the horizon of AI in Neurosurgery and Neurology holds promises of heightened personalization, augmented decision-making, early intervention, and the emergence of innovative treatment modalities. This narrative is one of optimism and collaboration, depicting a synergistic partnership between AI and healthcare professionals to propel the field forward and significantly enhance the lives of individuals grappling with neurological challenges. This article provides an encompassing view of AI’s transformative influence in Neurosurgery and Neurology, highlighting its potential to redefine the landscape of patient care and outcomes.

Enhancing user experience in large language models through human-centered design: Integrating theoretical insights with an experimental study to meet diverse software learning needs with a single document knowledge base

This paper begins with a theoretical exploration of the rise of large language models (LLMs) in Human-Computer Interaction (HCI), their impact on user experience (HX) and related challenges. It then discusses the benefits of Human-Centered Design (HCD) principles and the possibility of their application within LLMs, subsequently deriving six specific HCD guidelines for LLMs. Following this, a preliminary experiment is presented as an example to demonstrate how HCD principles can be employed to enhance user experience within GPT by using a single document input to GPT’s Knowledge base as new knowledge resource to control the interactions between GPT and users, aiming to meet the diverse needs of hypothetical software learners as much as possible. The experimental results demonstrate the effect of different elements’ forms and organizational methods in the document, as well as GPT’s relevant configurations, on the interaction effectiveness between GPT and software learners. A series of trials are conducted to explore better methods to realize text and image displaying, and jump action. Two template documents are compared in the aspects of the performances of the four interaction modes. Through continuous optimization, an improved version of the document was obtained to serve as a template for future use and research.

Clustering data analytics of urban land use for change detection

In this study, the author proposes and details a workflow for the spatial-temporal demarcation of urban areal features in 8 cities of Tamilnadu, India. During the inception phase, functional requirements and non-functional parameters are analyzed and designed, within a suitable pixel area and object-oriented derived paradigm. Land use categories are defined from OpenStreetMap (OSM) related works with the scope of conducting climate change, using multispectral sensors onboard Landsat series. Furthermore, we augment the bands dataset with Spatially Invariant Feature Transform (SIFT), Normalized Difference Vegetation Index (NDVI), Normalized Difference Built-Up Index (NDBI), Leaf Area Index (LAI), and Texture base indices, as a means of spatially integrating auto-covariance to stationarity patterns. In doing so, change detection can be pursuit by scaling up the segmentation of regional/zonal boundaries in a multi-dimensional environment, with the aid of Wide Area Networks (WAN) cluster computers such as the BEOWULF/Google Earth Engine clusters. GeoAnalytical measures are analyzed in the design of local and zonal spatial models (GRID, RASTER, DEM, IMAGE COLLECTION). Finally, multi variate geostatistical works are ensued for precision and recall in predictive data analytics. The author proposes reusing machine learning tools (filtering by attribute-based indexing in PaaS clouds) for pattern recognition and visualization of features and feature collection.

Application of computer vision in livestock and crop production—A review

Nowadays, it is a challenge for farmers to produce healthier food for the world population and save land resources. Recently, the integration of computer vision technology in field and crop production ushered in a new era of innovation and efficiency. Computer vision, a subfield of artificial intelligence, leverages image and video analysis to extract meaningful information from visual data. In agriculture, this technology is being utilized for tasks ranging from disease detection and yield prediction to animal health monitoring and quality control. By employing various imaging techniques, such as drones, satellites, and specialized cameras, computer vision systems are able to assess the health and growth of crops and livestock with unprecedented accuracy. The review is divided into two parts: Livestock and Crop Production giving the overview of the application of computer vision applications within agriculture, highlighting its role in optimizing farming practices and enhancing agricultural productivity.