Please wait, while we are loading the content...
Please wait, while we are loading the content...
Content Provider | ACM Digital Library |
---|---|
Author | Olagunju, Amos O. |
Abstract | The question of how to index documents is a central problem in document retrieval. The indexing problem can be stated as follows. There exists a large document collection, together with population of retrieval system customers, each of whom wants information that he thinks might be supplied by documents in the collection. How should the documents in the collection be identified (“indexed,” “cataloged,” etc.) so that the collection can be searched to the maximal collective benefit of the customers?The problem under investigation is that of developing a set of formal statistical rules for selecting the keywords of a document, the words likely to be useful as index terms for that document. A number of simple weighting techniques have been suggested for selecting the keywords of a document. These are: (i) frequency of occurrence in a document, (ii) frequency/document length, (iii) frequency/frequency in document collection, and (iv) frequency/(document length x frequency in collection). These have been examined in detail by Spark Jones, [Sp73]. The major results of her experiments show that there is no best technique except that (i) is consistently outperformed by the others. Her experiments also show that automatic indexing sometimes, but not always, outperforms manual controlled indexing. This has led to more sophisticated procedures for selecting keywords.The first technique for keyword recognition was developed by Salton, [Sa75], and is known as the discrimination value model. The technique measures the effectiveness of a term by examining what happens if that term is removed from the index. The assumption is made that if all the documents seem more similar to one another after a term has been removed from the index, then that term has a descriptive power whose magnitude is represented by change in total similarities. Salton has found significant retrieval improvement by making use of the discrimination value model to select the index terms for certain collections of documents. A second more sophisticated technique has been developed by Harter, [Ha75]. The technique is based upon the distribution characteristics of terms throughout the document collection. Harter's technique is based upon the hypothesis that authors choose terms, other than those directly related to the subject under discussion, randomly from a fixed vocabulary when composing a document. If this is in fact the case, then the distribution characteristics of the non-descriptive terms should be described by a Poisson distribution.It has been further hypothesized that the descriptive terms are chosen by authors randomly in relation to a particular topic. If this is the case, the distribution of these terms within documents dealing with the topic in question should also be describable by the Poisson function defined as f(k) = EXP(-1+k* LN(1))/k1 which gives the probability, f(k), that a document contains k occurrences of a particular term, 1 being the mean number of occurrences of the term in each document of the collection, and where the term is randomly distributed. This gives rise to the 2-Poisson model, [Bo75], which states that the distribution of a term within a document collection should be descritable by two Poisson distributions, one of which describes the usage of the term as a “background” term and the other its usage as a keyword. Thus the overall model is a combination of two Poisson functions and takes the form f(k) = p*EXP(-L1+k*LN(L1))/ k1+(1-p)*EXP(-L2+k*LN(L2))/k1 where L1 and L2 represent the mean number of occurrences of the term in each of the two classes and p is the proportion of documents in which the term is a keyword. Bookstein and Swanson, [Bo74], found that 2-Poisson model did not successfully describe the distributions of all keywords since the complete validity of the model is based on the rather naive assumption that there exactly two ways in which a term is used. Harter, [Ha75], suggests (L1-L2)/SQRT(L1+L2) as being an effective measurement of the usefulness of an index term.In his probabilistic approach to keyword selection, Harter [Ha75] used the less efficient moment estimators for estimating the parameters of mixtures of discrete distributions. Harter emphasized that the method of maximum likelihood provides iterative solutions rather than exact solutions for a mixture of two distributions, and that the solutions are very slow to converge, in general. Contending that it was back in the 1930's when computers were unavailable to the statisticians that the method of moment estimators would have been acceptable for estimating the parameters of the 2-Poisson distribution, Olagunju, [O180], has investigated the properties of the 2-Poisson model.In this presentation we show how a combination of the method of moment and the method of maximum likelihood can be used for estimating the parameters of the 2-Poisson distribution. The likelihood function for the 2-Poisson model is given by L(Xi) = PRODUCT [F(Xi/pi,L1,L2, i=1 to ∞], and Log[L(Xi)] = SUM [Ni*Log(pi*EXP(-L1+i*LN(L1))/i! +(1-pi)* EXP(-L2+i*LN(L2))/i!, i=0 to ∞]. The estimator Log[L(Xi)] is used to estimate the parameters pi, L1 and L2 since it is easier to find the maximum of the likelihood by it. In fact, by Taylor's series expansion, the point where the likelihood is a maximum is a solution of three systems of equations. The logarithm of the likelihood function for the Degenerate 2-Poisson model is given by Log [Xi] = N0*Log[pi*EXP(-L1)+(1-pi)] + SUM [Ni*Log(pi*EXP(-L1 + i*LN(L1)))/i!, i=1 to ∞]. In Olagunju's thesis, [O180], the combination of the 2-Poisson model and the Degenerate 2-Poisson model are examined in detail as models of keyword distribution, and formulae expressing the parameters of the models in terms of empirical frequency statistics are derived. Finally, a measure, consistent with the 2-Poisson and the Degenerate 2-Poisson models, intended to identify keywords is proposed. |
File Format | |
ISBN | 0897912187 |
DOI | 10.1145/322917.323048 |
Language | English |
Publisher | Association for Computing Machinery (ACM) |
Publisher Date | 1987-02-01 |
Publisher Place | New York |
Access Restriction | Subscribed |
Content Type | Text |
Resource Type | Article |
National Digital Library of India (NDLI) is a virtual repository of learning resources which is not just a repository with search/browse facilities but provides a host of services for the learner community. It is sponsored and mentored by Ministry of Education, Government of India, through its National Mission on Education through Information and Communication Technology (NMEICT). Filtered and federated searching is employed to facilitate focused searching so that learners can find the right resource with least effort and in minimum time. NDLI provides user group-specific services such as Examination Preparatory for School and College students and job aspirants. Services for Researchers and general learners are also provided. NDLI is designed to hold content of any language and provides interface support for 10 most widely used Indian languages. It is built to provide support for all academic levels including researchers and life-long learners, all disciplines, all popular forms of access devices and differently-abled learners. It is designed to enable people to learn and prepare from best practices from all over the world and to facilitate researchers to perform inter-linked exploration from multiple sources. It is developed, operated and maintained from Indian Institute of Technology Kharagpur.
Learn more about this project from here.
NDLI is a conglomeration of freely available or institutionally contributed or donated or publisher managed contents. Almost all these contents are hosted and accessed from respective sources. The responsibility for authenticity, relevance, completeness, accuracy, reliability and suitability of these contents rests with the respective organization and NDLI has no responsibility or liability for these. Every effort is made to keep the NDLI portal up and running smoothly unless there are some unavoidable technical issues.
Ministry of Education, through its National Mission on Education through Information and Communication Technology (NMEICT), has sponsored and funded the National Digital Library of India (NDLI) project.
Sl. | Authority | Responsibilities | Communication Details |
---|---|---|---|
1 | Ministry of Education (GoI), Department of Higher Education |
Sanctioning Authority | https://www.education.gov.in/ict-initiatives |
2 | Indian Institute of Technology Kharagpur | Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project | https://www.iitkgp.ac.in |
3 | National Digital Library of India Office, Indian Institute of Technology Kharagpur | The administrative and infrastructural headquarters of the project | Dr. B. Sutradhar bsutra@ndl.gov.in |
4 | Project PI / Joint PI | Principal Investigator and Joint Principal Investigators of the project |
Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon |
5 | Website/Portal (Helpdesk) | Queries regarding NDLI and its services | support@ndl.gov.in |
6 | Contents and Copyright Issues | Queries related to content curation and copyright issues | content@ndl.gov.in |
7 | National Digital Library of India Club (NDLI Club) | Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach | clubsupport@ndl.gov.in |
8 | Digital Preservation Centre (DPC) | Assistance with digitizing and archiving copyright-free printed books | dpc@ndl.gov.in |
9 | IDR Setup or Support | Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops | idr@ndl.gov.in |
Loading...
|