Loading...
Please wait, while we are loading the content...
Similar Documents
An approach to reduce part of speech ambiguity using semantically annotated lexicon definitions
| Content Provider | Semantic Scholar |
|---|---|
| Author | Minca, Andrei Diaconescu, Stefan |
| Copyright Year | 2012 |
| Abstract | In computational linguistics, the problem of word-sense disambiguation (WSD) is a difficult one and methods using a flat topology of the tokens are not very effective. One solution to this is to use a Part of Speech (POS) tagger before starting the WSD process. However, POS taggers show their limitations when high precision tagging is required or large texts are processed. This paper presents a technique to reduce the POS ambiguity using semantic information. As benchmarks we use as following standard WSD corpuses: Senseval2, Senseval3 and Semcor. Moreover, we tested our approach on WordNet semantically tagged glosses for English and on our own semantically tagged lexicon glosses for Romanian language. Introduction “All-Words” task for word sense disambiguation (WSD) is a complex pursuit in the field of natural language processing (NLP). WSD systems have improved over time and achieve now 6570% accuracy for the fined-grained all-words task and 78-83% accuracy when a coarse-grained sense inventory is used [8]. Systems using knowledge-based methods are beginning to be the predominant research direction for WSD ([7],[9],[11]). Such knowledge-based methods have the property of low or no variation in decision making when solving sense ambiguity. Best-known results for this type of systems can go up to an accuracy of 83% on coarse-grained all-words task using an algorithm called Structural Semantic Interconnections [3]. However, obtaining a high accuracy comes at a price, i.e. large space and time consumption is needed. For tasks using fined-grained sense inventory, this problem is usually solved with heuristic approaches. Even if the POS tagging process has a lower complexity than the WSD process, its complexity is high enough to become problematic for most NLP applications. Moreover, comprehensive grammar analysis becomes difficult to accomplish [6]. Based on the above observations, we decided to investigate a partial WSD analysis before the POS tagger process. We do this in a research project in our company, codenamed SenDiS (Sense Disambiguation System). The purpose of this paper is to significantly reduce the word sense ambiguity and thus the POS ambiguity, but still to preserve all or most of POS tagging solutions. For this purpose, we adjusted the methods and algorithms used in the SenDiS project such that the system will provide WSD variants, for a given text, with semantic similarity scores greater than a threshold relative to the maximum discovered. SenDiS WSD approach The SenDiS research project addresses the WSD process in a knowledge-based fashion. It mainly relies on semantic networks, especially semantic networks built from semantically annotated lexicon glosses, for establishing the sense semantic similarity costs used to solve sense ambiguity. The main WSD usage scenario in the SenDiS system is: 1. the text is tokenized in text items; 2. each text items is matched with sense interpretations; 2nd International Conference on Management Science and Industrial Engineering (MSIE 2013) © 2013. The authors Published by Atlantis Press 443 3. for each sense interpretation assigned to the text items, a sense semantic signature is built based on the lexicon network; 4. relevant sense pairs with senses of different text items, are identified and then sense semantic signatures are compared; 5. semantic similarity costs of the senses in each pair are further used to compute the best WSD variant or variants. Reference [10] describes the main ingredients of WSD approach used in the SenDiS project. This WSD approach mainly consists of the following steps detailed below: A. Lexicon nework A lexicon network is obtained from semantically annotated lexicon glosses [5]. This is similar to other hierarchical networks built on lexicons [4]. However, it best resembles the Lesk algorithm approach ([1],[2]) in the way that it extends the definition domain of a word sense from a set of words in a gloss to a spanning tree like structure inside this lexicon network. Significant efforts were undertaken to achieve high quality semantic annotation of lexicon glosses. Semi-automated annotation is generally employed, but manual annotation is the gold standard even if the cost is much higher. B. Ordering the lexicon network The original lexicon network can be preprocessed in order to better fit different WSD methods that operate on it. This optimization task is often challenging considering the large dimensions of such networks. C. Building sense semantic signatures Using this large lexicon network, sense semantic signatures can be built having one of the forms: • spanning tree with node (sense) and relation information embedded • sets of nodes or/and relations • sequences of nodes or/and relations • combinations of the above. D. Comparing sense semantic signatures Semantic similarity cost for two senses can be obtained by comparing their semantic signatures. Various comparisons algorithms can be imagined depending on the form of the sense semantic signatures. E. Computation of WSD variants The final step in this WSD approach is to use these semantic similarity costs between the senses of the text items to compute best WSD variant or variants. One method will be to compute the complete sub-graph for the text, consisting of senses as nodes and edges between them with semantic similarity cost as edge rank, which maximizes the semantic similarity score of a variant. Reducing POS ambiguity using semantic information We propose, in analyzing a text, the computation of specific several WSD variants, especially those with strong semantic similarity scores. These should preserve the POS solutions for the text with high precision and, at the same time, should reduce the POS ambiguity for the POS tagger process. The normal output of the SenDiS system after disambiguating a text is a WSD variant or a set of WSD variants that have the highest score of semantic similarity. We modified the last step in the system, i.e. the computation of WSD variants, in order to obtain more WSD variants with a semantic similarity score close to the highest one determined. The last step of the SenDiS system, the computation of WSD variants with the highest semantic similarity score, has as input a set of sense pairs from the system’s input text and each sense pair is associated with a semantic similarity cost. These pairs can be seen as edges in a graph where words senses are nodes. In fact this graph is actually an N-partite graph, as seen in Figure 1, where N is the number of words in the text and each partition of nodes consists of nodes representing senses of the |
| Starting Page | 323 |
| Ending Page | 327 |
| Page Count | 5 |
| File Format | PDF HTM / HTML |
| DOI | 10.1109/ESTC.2012.6485604 |
| Alternate Webpage(s) | https://download.atlantis-press.com/article/9764.pdf |
| Alternate Webpage(s) | https://doi.org/10.1109/ESTC.2012.6485604 |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |