Loading...
Please wait, while we are loading the content...
Similar Documents
Utilizing the Subjective Intent of Authoring Formats to Perform Focused Web Crawling
| Content Provider | Semantic Scholar |
|---|---|
| Author | Leung, Hok Peng Hsu, Wynne |
| Copyright Year | 2001 |
| Abstract | A successful web information retrieval system requires the ability to determine quickly and accurately whether a document or a link should be further explored. Current state-of-the-art web search engines typically use the meta-information in the HTML header to determine the relevancy of the documents. However, many documents on the web do not have such HTML header information. On the other hand, most web documents are formatted carefully to convey some messages to the readers. The hidden information, embedded in these formatting tags, serves as a good source for determining the relevancy of a document with respect to the query context. In this paper, we propose a fast and accurate approach to determining the relevancy of a document by taking into account the information embedded within these formatting tags. Using such information, we are able to quickly narrow down the scope of our search to the most promising sites. In addition, a new query formulation strategy is proposed to further improve the accuracy of the new approach. Based on this new approach, a crawling strategy has been proposed. A number of experiments have been conducted to test the effectiveness of the proposed approach and the crawling strategy. Experiment results indicate that we are able to achieve a significant improvement over the standard information retrieval algorithm based on tf*idf. Furthermore, our algorithm, unlike the tf*idf scheme, does not require the whole document space to be known in advance. This feature makes our algorithm suitable to be used on the web where it is impossible to known in advance the entire document space. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.comp.nus.edu.sg/~whsu/publication/2000/www9.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |