NDLI: Large vocabulary Uyghur continuous speech recognition based on stems and suffixes

Content Provider	IEEE Xplore Digital Library
Author	Xin Li Shang Cai Jielin Pan Yonghong Yan Yafei Yang
Copyright Year	2010
Description	Author affiliation: THINKIT Speech Laboratory, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China (Xin Li; Shang Cai; Jielin Pan; Yonghong Yan) \|\| Xinjiang Public Security Department, Wulumuqi, China (Yafei Yang)
Abstract	In this paper, we study the vocabulary design problem in Uyghur large vocabulary continuous speech recognition (LVCSR). Uyghur is an agglutinative language in which words can be formed by concatenating several suffixes to the stem. As a result, the number of word types in Uyghur is unlimited. If the word is used as the recognition unit, the out-of-vocabulary (OOV) rate will be very large with typical vocabulary sizes of 60k–100k. To avoid this problem, we split words into stems and suffixes and use these sub-words as the recognition units. Speech recognition experiments are performed in two test sets, one including sentences in books and another including sentences in conversations. Compared to the 80k-word baseline, the use of stems and suffixes can alleviate the OOV rate problem dramatically and the best system reduces the word error rate (WER) from 46.5% to 44.5% in the book sentences test set and from 57.6% to 47.5% in the conversation sentences test set.
Starting Page	220
Ending Page	223
File Size	194344
Page Count	4
File Format	PDF
ISBN	9781424462445
e-ISBN	9781424462469
DOI	10.1109/ISCSLP.2010.5684909
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2010-11-29
Publisher Place	Taiwan
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Vocabulary Databases Stems and suffixes based language model Agglutinative language Hidden Markov models Speech recognition Uyghur large vocabulary continuous speech recognition Speech Acoustics Books
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Large vocabulary continuous speech recognition in uyghur: data preparation and experimental results.

Turkish Large Vocabulary Continuous Speech Recognition by using limited audio corpus

Morpheme concatenation approach in language modeling for large-vocabulary Uyghur speech recognition

Advances in Large Vocabulary Continuous Speech Recognition in Greek: Modeling and nonlinear features

HMM-Based Uyghur Continuous Speech Recognition System

Linguistic stem concatenation for malay large vocabulary continuous speech recognition

Large Vocabulary Continuous Speech Recognition in Uyghur: Data Preparation and Experimental Results

Language model adaptation for automatic call transcription

Baseform adaptation for large vocabulary hidden Markov model based speech recognition systems

Large vocabulary Uyghur continuous speech recognition based on stems and suffixes

Similar Documents

Large vocabulary continuous speech recognition in uyghur: data preparation and experimental results.

Turkish Large Vocabulary Continuous Speech Recognition by using limited audio corpus

Morpheme concatenation approach in language modeling for large-vocabulary Uyghur speech recognition

Advances in Large Vocabulary Continuous Speech Recognition in Greek: Modeling and nonlinear features

HMM-Based Uyghur Continuous Speech Recognition System

Linguistic stem concatenation for malay large vocabulary continuous speech recognition

Large Vocabulary Continuous Speech Recognition in Uyghur: Data Preparation and Experimental Results

Language model adaptation for automatic call transcription

Baseform adaptation for large vocabulary hidden Markov model based speech recognition systems

Large vocabulary Uyghur continuous speech recognition based on stems and suffixes