NDLI: Naming Clusters in Visualization Studies: Parsing and Filtering of Noun Phrases from Citation Contexts

Please wait, while we are loading the content...

Naming Clusters in Visualization Studies: Parsing and Filtering of Noun Phrases from Citation Contexts

Content Provider	Semantic Scholar
Author	Schneider, Jesper W.
Copyright Year	2005
Abstract	The present study presents a semi-automatic method for parsing and filtering of noun phrases from citation contexts. The purpose of the method is to extract contextual, agreed upon, and pertinent noun phrases, to be used in visualization studies for naming clusters (concept groups) or concept symbols. The method is applied in a case study, which forms part of a larger dissertation work concerning the applicability of bibliometric methods for thesaurus construction. The case study is carried out within periodontology, a specialty area of dentistry. The result of the case study indicates that the method is able to identify highly important noun phrases, and that these phrases accurately describe their parent clusters. Hence, the method is able to reduce the labour intensive work of manual citation context analysis, though further refinements are still needed. Introduction A common challenge in literature visualization studies is how to interpret the actual mappings of document entities. Dimensionality and link reduction algorithms are typically applied to investigate co-citation networks for their salient structures (e.g., Börner, Chen, & Boyack, 2003). However, it greatly enhances the interpretability of the resulting mappings, if co-citation networks are somehow transformed into some sort of conceptual networks (e.g., Small, 1986). In this paper, we introduce a semi-automatic parsing method designed to transform a document co-citation network into a conceptual network of noun phrases. Most often in document co-citation analyses, the aggregate clusters of cited references are named by single words (White & McCain, 1989; 1997; Wilson, 1999). The process of naming clusters is usually automatic. Specific entities, from documents citing the individual members of a cluster, are extracted and subsequently subjected to a frequency analysis. Consequently, the most frequently occurring citing document entities in the research front are used to name the topic(s) or concept(s) of the cluster (e.g., White & McCain, 1989; Wilson, 1999). It is important to emphasize that the automatic extraction of entities typically means extraction of single word entities, and not multiple word entities, such as noun phrases. A notable exception is the studies by Noyons and colleagues (e.g., 1999), where noun phrases are extracted from titles and abstracts of citing papers. For practical reasons, the composition of document representations in the citation databases of ISI, usually determine the entities available for naming clusters (White & McCain, 1989). The most commonly used of these entities are title words or ISI’s special subject categories. Conversely, domain dependent databases are often used in co-word studies (He, 1999). Further, domain dependent databases can also be used in conjunction with citation databases in document co-citation studies. In the latter case, the same bibliographic reference is identified in the citation database, as well as in the domain dependent database (Ingwersen & Christensen, 1997). The document representation of the citation database provides the references needed for the co-citation analysis, whereas, the domain specific indexing descriptors and classification codes, for the same document representation, can be obtained from the domain dependent database (Ingwersen & Christensen, 1997). As a result, the latter can be used for naming or evaluation of the generated co-citation clusters. The seminal work by Small (1978) is a more sophisticated approach to the transformation of document co-citation networks into conceptual networks. Small (1978) established that highly cited documents symbolize concepts to those who cite them. While it has long been known, that when references are turned into citations they can be construed as subject headings (e.g., Garfield, 1974), different people may construe the same cited document differently. Small (1978) showed, however, that citing authors in chemistry tend to be both specific and highly uniform in the meanings they assign to cited documents, as revealed by the contexts of the references. Scientists tend to give earlier Naming Clusters in Visualization Studies: Parsing and Filtering of Noun Phrases from Citation Contexts 407 works consensual meaning by ‘piling up’ identical or similar words and phrases in the sentences in which their citation markers are embedded (Small, 1978). Consequently, when citation contexts show that citing authors have used a cited document to stand for a given idea more or less uniformly over many papers, the document has, according to Small (1978), attained the status of a concept symbol. Accordingly, the highly cited document communicates a specific topic and resembles a subject heading or descriptor. The focus of citation context analysis is naming of individual cited references. As a result, the basis for naming their aggregate clusters is the common concept(s) identified among the member concept symbols. It is believed that naming references and clusters by use of citation context analysis ensures contextual and pertinent phrasal concepts (e.g., Small, 1986). Unfortunately, citation context analysis is usually labour intensive work, where the full text of citing documents is manually scanned to identify citation contexts in order to select key phrasal concepts that capture major aspects of the cited documents (e.g., Small, 1978; 1986; Rees-Potter, 1989). Research by O’Conner (1982; 1983) shows that citation contexts can be identified automatically within the structure of full text documents. Nevertheless, O’Conner (1982; 1983) only extracted single words from the citation contexts. Furthermore, Small (1979) has pointed out that it is unlikely that the identification process of concept symbols can be done entirely automatically. According to Small (1979), a computer cannot recognize unforeseen synonymy, thus, the words and phrases that show consensus on what documents symbolize must therefore be recognized by a human reader. The purpose of the present paper is to introduce a semi-automatic method for extraction of noun phrases by natural language parsing of citation contexts in citing documents. This is different from the studies by Noyons and colleagues (e.g., 1999) mentioned above, where phrases are extracted from titles and abstracts of citing papers. A subsequent frequency analysis and filtering procedure create a portfolio of important noun phrases, which is attached to each of the highly cited references. The portfolio of noun phrases constitutes the basis used to characterize the cited references and eventually naming their parent clusters. Accordingly, the method transforms a document co-citation network into a conceptual network of noun phrases. The aim of the present work is to improve the single word naming of clusters, by use of noun phrases instead. We believe that noun phrases more accurately describe the topic(s) and concept(s) of a cluster and its individual members. As demonstrated by Small (1978), such noun phrases should be extracted from citation contexts of citing documents. Further, these noun phrases are contextual, they probably reflect consensus usage of terminology, accordingly, they resemble agreed upon indexing descriptors. The aim is therefore to develop and explore a method that can reduce the labour intensive work in connection with citation context analysis. We believe that parsing of citation contexts reduces the workload and eventually improves the process of naming clusters and their individual members in mapping studies. The paper is composed of four main sections. Section two presents the integrated method of document co-citation analysis, visualization, and citation context analysis used for construction of the conceptual network. Section three outlines the main results. Finally, in section four we discuss the main findings. Method The present study derives from a comprehensive dissertation work that investigates the applicability of bibliometric methods for semi-automatic thesaurus construction (Schneider, 2004; Schneider & Borlund, 2004). Only the basic methodical steps are outlined here. Readers are referred to Schneider (2004) for a more detailed description of the method and the results. In this study, we apply an integrated method of document co-citation analysis, visualization, including complete-link cluster analysis and Pathfinder network scaling, and citation context analysis, including semi-automatic parsing of citation contexts. The study is based on bibliographic data retrieved and downloaded from Science Citation Index® (SCI®) hosted by Dialog®. The bibliographic data contain 801 research and review papers published within periodontology in 2001. Periodontology is a specialty area within dentistry.
File Format	PDF HTM / HTML
Alternate Webpage(s)	http://www.issi-society.org/proceedings/issi_2005/Schneider_ISSI2005.pdf
Alternate Webpage(s)	http://www.db.dk/binaries/schneider%20(2005).pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in