Loading...
Please wait, while we are loading the content...
Similar Documents
Mining Parallel Corpora from Web and Its Application in Machine Translation
| Content Provider | Semantic Scholar |
|---|---|
| Author | Xi-Rong, Ma |
| Copyright Year | 2010 |
| Abstract | Bilingual parallel corpora can be used in many applications of NLP,but it's not easy to acquire the large-scale corpora automatically.This paper proposes an effective solution to mine high-quality bilingual parallel corpora from web pages and analyses the key technology of obtaining candidate mix-languages web pages and sentence alignment.We have extracted 1.67 million parallel sentences,which average accuracy is 93.75%,and the accuracy of the first 1 million sentences is 96%.This paper also proposes the sentences re-ranking method and domain information retrieval method to apply the web data to the training of SMT model.Experiments conducted on the IWSLT tasks show 2 to 5 BLEU gains over baseline. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://nlp.ict.ac.cn/Admin/kindeditor/attached/file/20130513/20130513180444_93004.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |