Loading...
Please wait, while we are loading the content...
Similar Documents
Extracting parallel sentences from comparable corpora using document level alignment (2010)
Content Provider | CiteSeerX |
---|---|
Author | Smith, Jason R. Quirk, Chris Toutanova, Kristina |
Description | The quality of a statistical machine translation (SMT) system is heavily dependent upon the amount of parallel sentences used in training. In recent years, there have been several approaches developed for obtaining parallel sentences from non-parallel, or comparable data, such as news articles published within the same time period (Munteanu and Marcu, 2005), or web pages with a similar structure (Resnik and Smith, 2003). One resource not yet thoroughly explored is Wikipedia, an online encyclopedia containing linked articles in many languages. We advance the state of the art in parallel sentence extraction by modeling the document level alignment, motivated by the observation that parallel sentence pairs are often found in close proximity. We also include features which make use of the additional annotation given by Wikipedia, and features using an automatically induced lexicon model. Results for both accuracy in sentence extraction and downstream improvement in an SMT system are presented. 1 |
File Format | |
Language | English |
Publisher Date | 2010-01-01 |
Publisher Institution | In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’10 |
Access Restriction | Open |
Subject Keyword | Downstream Improvement Web Page Comparable Data Online Encyclopedia Statistical Machine Translation Time Period Parallel Sentence Extraction Parallel Sentence Similar Structure Document Level Alignment Parallel Sentence Pair Sentence Extraction Smt System Additional Annotation Many Language Close Proximity Recent Year Several Approach News Article Comparable Corpus Induced Lexicon Model |
Content Type | Text |
Resource Type | Article |