NDLI: A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Content Provider	PubMed Central
Author	Wang, Longyue Wong, Derek F. Chao, Lidia S. Lu, Yi Xing, Junwen
Copyright Year	2014
Abstract	Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.
Related Links	http://dx.doi.org/10.1155/2014/745485
Starting Page	745485
File Format	PDF
ISSN	1537744X
e-ISSN	1537744X
Journal	The Scientific World Journal
Volume Number	2014
Language	English
Publisher	Hindawi Publishing Corporation
Publisher Date	2014-02-11
Access Restriction	Open
Rights Holder	Hindawi Publishing Corporation
Subject Keyword	Research in Higher Education
Content Type	Text
Resource Type	Article
Subject	Medicine Biochemistry, Genetics and Molecular Biology Environmental Science

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Domain Adaptation for Pedestrian Detection Based on Prediction Consistency

Genetic Variability and Selection Criteria in Rice Mutant Lines as Revealed by Quantitative Traits

Lightweight Adaptation of Classifiers to Users and Contexts: Trends of the Emerging Domain

Self-Adaptive MOEA Feature Selection for Classification of Bankruptcy Prediction Data

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

Antioxidants for Preventing Preeclampsia: A Systematic Review

A Domain Decomposition Method for Time Fractional Reaction-Diffusion Equation

A Systematic Method of Interconnection Optimization for Dense-Array Concentrator Photovoltaic System

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Similar Documents

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Domain Adaptation for Pedestrian Detection Based on Prediction Consistency

Genetic Variability and Selection Criteria in Rice Mutant Lines as Revealed by Quantitative Traits

Lightweight Adaptation of Classifiers to Users and Contexts: Trends of the Emerging Domain

Self-Adaptive MOEA Feature Selection for Classification of Bankruptcy Prediction Data

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

Antioxidants for Preventing Preeclampsia: A Systematic Review

A Domain Decomposition Method for Time Fractional Reaction-Diffusion Equation

A Systematic Method of Interconnection Optimization for Dense-Array Concentrator Photovoltaic System

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation