Loading...
Please wait, while we are loading the content...
Similar Documents
Finding influential people from a historical news repository
Content Provider | Indraprastha Institute of Information Technology, Delhi |
---|---|
Author | Gupta, Aayushee |
Abstract | Historical newspaper archives provide a wealth of information. They are of particular interest to genealogists, historians and scholars for People Search. In this thesis, we design a People Gazetteer from the noisy OCR text of historical newspapers and identify \in uential" people from it. A People Gazetteer is a dictionary of personal names; each entry of the gazetteer is a tuple containing a person name and a list of articles in which his name occurs along with the corresponding topic associated with each article. To build the People Gazetteer, we rst spell correct the noisy text using an edit distance based algorithm. A novel N-gram based evaluation algorithm is designed for measuring the perfor- mance of the spell corrector. Next, a Named Entity Recognizer is run on the text of each article to identify person entities and an LDA-based topic detector to assign categories to articles. To identify in uential people across each category of People Gazetteer, we de ne the notion of an In uential Person Index (IPI) and rank based on it. Our corpus is a sample of 14020 OCR newspaper articles (roughly two months' data) obtained from \The Sun" newspaper in the Chronicling America project. We present results on the top-K in uential people obtained from our algorithm by varying its parameters and verify results using Wikipedia. |
File Format | |
Language | English |
Publisher | IIIT Delhi |
Access Restriction | Open |
Subject Keyword | Gazetteer Text Mining Information Retrieval OCR Spelling Correction Historical data In uential people detection |
Content Type | Text |
Educational Degree | Master of Technology (M.Tech.) |
Resource Type | Thesis |
Subject | Data processing & computer science |