Loading...
Please wait, while we are loading the content...
Similar Documents
Extending A Thesaurus By Classifying Words
| Content Provider | Semantic Scholar |
|---|---|
| Author | Tokunaga, Takenobu Fujii, Atsushi Naoyuki, Sakurai Tanaka, Hozumi |
| Copyright Year | 1997 |
| Abstract | This paper proposes a method for extending an existing thesaurus through classification of new words in terms of that thesaurus. New words are classified on the basis of relative probabilities of.a word belonging to a given word class, with the probabilities calculated using nounverb co-occurrence pairs. Experiments using the Japanese Bunruigoihy5 thesaurus on about 420,000 co-occurrences showed that new words can be classified correctly with a maximum accuracy of more than 80%. 1 I n t r o d u c t i o n For most natural language processing (NLP) systems, thesauri comprise indispensable linguistic knowledge. Roger's International Thesaurus [Chapman, 1984] and WordNet [Miller et al., 1993] are typical English thesauri which have been widely used in past NLP research [Resnik, 1992; Yarowsky, 1992]. They are handcrafted, machine-readable and have fairly broad coverage. However, since these thesauri were originally compiled for human use, they are not always suitable for computer-based natural language processing. Limitations of handcrafted thesauri can be summarized as follows [Hatzivassiloglou and McKeown, 1993; Uramoto, 1996; Hindle, 1990]. • limited vocabulary size • unclear classification criteria • building thesauri by hand requires considerable time and effort The vocabulary size of typical handcrafted thesauri ranges from 50,000 to 100,000 words, including general words in broad domains. From the viewpoint of NLP systems dealing with a particular domain, however, these thesauri include many unnecessary (general) words and do not include necessary domain-specific words. The second problem with handcrafted thesauri is that their classification is based on the intuition of lexicographers, with their classification criteria not always being clear. For the purposes of NLP systems, their classification of words is sometimes too coarse and does not provide sufficient distinction between words, or is some times unnecessarily detailed. Lastly, building thesauri by hand requires significant amounts of time and effort even for restricted domains. Furthermore, this effort is repeated when a system is ported to another domain. This criticism leads us to automatic approaches for building thesauri from large corpora [Hirschman et al., 1975; Hindle, 1990; Hatzivassiloglou and McKeown, 1993; Pereira et al., 1993; Tokunaga et aL, 1995; Ushioda, 1996]. Past attempts have basically taken the following steps [Charniak, 1993]. (1) extract word co-occurrences (2) define similarities (distances) between words on the basis of co-occurrences (3) cluster words on the basis of similarities The most crucial part of this approach is gathering word co-occurrence data. Co-occurrences are usually gathered on the basis of certain relations such as predicateargument, modifier-modified, adjacency, or mixture of these. However, it is very difficult to gather sufficient co-occurrences to calculate similarities reliably [Resnik, 1992; Basili et al., 1992]. It is sometimes impractical to build a large thesaurus from scratch based on only co-occurrence data. Based on this observation, a third approach has been proposed, namely, combining linguistic knowledge and co-occurrence data [Resnik, 1992; Uramoto, 1996]. This approach aims at compensating the sparseness of co~ occurrence data by using existing linguistic knowledge, such as WordNet. This paper follows this line of research and proposes a method to extend an existing thesaurus by classifying new words in terms of that thesaurus. In other words, the proposed method identifies appropriate |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://aclweb.org/anthology/W97-0803 |
| Alternate Webpage(s) | http://aclweb.org/anthology//W/W97/W97-0803.pdf |
| Alternate Webpage(s) | http://anthology.aclweb.org/W/W97/W97-0803.pdf |
| Alternate Webpage(s) | http://acl.ldc.upenn.edu/W/W97/W97-0803.pdf |
| Alternate Webpage(s) | http://aclweb.org/anthology/W/W97/W97-0803.pdf |
| Alternate Webpage(s) | http://www.hum.uva.nl/~ewn/workshop/Takunaga.ps |
| Alternate Webpage(s) | http://www.aclweb.org/anthology/W97-0803 |
| Alternate Webpage(s) | http://ucrel.lancs.ac.uk/acl/W/W97/W97-0803.pdf |
| Alternate Webpage(s) | http://www.aclweb.org/anthology/W/W97/W97-0803.pdf |
| Alternate Webpage(s) | http://wing.comp.nus.edu.sg/~antho/W/W97/W97-0803.pdf |
| Alternate Webpage(s) | http://www.aclweb.org/anthology-new/W/W97/W97-0803.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |