NDLI: Investigation of using different Chinese word segmentation standards and algorithms for automatic speech recognition

Content Provider	IEEE Xplore Digital Library
Author	Chongjia Ni Cheung-Chi Leung
Copyright Year	2014
Description	Author affiliation: Inst. for Infocomm Res. (I2R), A*STAR, Singapore, Singapore (Chongjia Ni; Cheung-Chi Leung)
Abstract	Chinese word segmentation (CWS) is a necessary step in Mandarin Chinese automatic speech recognition (ASR), and it has an impact on the results of ASR. However, there are few works on the relations between CWS and ASR. CWS settings, including segmentation standards and algorithms, are involved in building a segmenter. In this paper, four CWS standards and three CWS algorithms, including maximum matching, term frequency based and conditional random field (CRF) based algorithms, are investigated for ASR performance. Our experiments on the second Sighan Bakeoff data and Mandarin Chinese conversational telephone speech show that a better segmentation performance does not necessarily lead to a better ASR performance. Maximum matching and the term frequency based algorithm, which are classified as lexicon-based algorithms, are more flexible to update their vocabulary inventories according to the application need. We find that these two algorithms can provide similar ASR performance as the CRF-based algorithm. Motivated by the availability of huge amounts of web text data, we investigate whether this can improve the term frequency based algorithm and thus the ASR performance. Lastly we find that combining the two lexicon-based algorithms through language model interpolation can further improve the ASR performance.
Starting Page	44
Ending Page	48
File Size	151565
Page Count	5
File Format	PDF
ISBN	9781479942190
DOI	10.1109/ISCSLP.2014.6936684
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2014-09-12
Publisher Place	Singapore
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Training Computational modeling Training data Speech Data models Classification algorithms Chinese word segmentation automatic speech recognition Chinese word segmentation combination Standards
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Joint Decoding for Chinese Word Segmentation and POS Tagging Using Character-Based and Word-Based Discriminative Models

Joint n-gram Chinese language modeling with an application to Chinese word segmentation

Automatic training set segmentation for multi-pass speech recognition

Minimum word classification error training of HMMS for automatic speech recognition

Statistical segmentation and word modeling techniques in isolated word recognition

Traditional Chinese parser and language modeling for Mandadin ASR

Word-level rate of speech modeling using rate-specific phones and pronunciations

Iterative Bayesian word segmentation for unsupervised vocabulary discovery from phoneme lattices

Word recognition using whole word and subword models

Investigation of using different Chinese word segmentation standards and algorithms for automatic speech recognition

Similar Documents

Joint Decoding for Chinese Word Segmentation and POS Tagging Using Character-Based and Word-Based Discriminative Models

Joint n-gram Chinese language modeling with an application to Chinese word segmentation

Automatic training set segmentation for multi-pass speech recognition

Minimum word classification error training of HMMS for automatic speech recognition

Statistical segmentation and word modeling techniques in isolated word recognition

Traditional Chinese parser and language modeling for Mandadin ASR

Word-level rate of speech modeling using rate-specific phones and pronunciations

Iterative Bayesian word segmentation for unsupervised vocabulary discovery from phoneme lattices

Word recognition using whole word and subword models

Investigation of using different Chinese word segmentation standards and algorithms for automatic speech recognition