Loading...
Please wait, while we are loading the content...
Similar Documents
English–Welsh Cross-Lingual Embeddings
Content Provider | MDPI |
---|---|
Author | Irena, Spasić Espinosa-Anke, Luis Palmer, Geraint Corcoran, Padraig Filimonov, Maxim Knight, Dawn |
Copyright Year | 2021 |
Description | Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points. |
Starting Page | 6541 |
e-ISSN | 20763417 |
DOI | 10.3390/app11146541 |
Journal | Applied Sciences |
Issue Number | 14 |
Volume Number | 11 |
Language | English |
Publisher | MDPI |
Publisher Date | 2021-07-16 |
Access Restriction | Open |
Subject Keyword | Applied Sciences Natural Language Processing Distributional Semantics Machine Learning Language Model Word Embeddings Machine Translation Sentiment Analysis |
Content Type | Text |
Resource Type | Article |