Loading...
Please wait, while we are loading the content...
Similar Documents
Size estimation of non-cooperative data collections.
| Content Provider | CiteSeerX |
|---|---|
| Author | Hiemstra, Djoerd Khelghati, Mohammadreza Keulen, Maurice Van |
| Abstract | With the increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping the crawling or sampling processes which can be so costly in some cases [14]. This tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [7], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [19], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in the case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [17]. In this paper, the suggested approaches for this purpose in the literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent. |
| File Format | |
| Access Restriction | Open |
| Subject Keyword | Non-cooperative Data Collection Size Estimation Sampling Process Real Environment Data Source Size Distributed Search Field Public Sector Search Engine Selection Non-cooperative Collection Quality Assessment Data Coverage Suggested Approach Federated Search Engine Search Engine Recent Approach Available Technique Data Source Deep Web Source Accurate Decision Web Form Resource Collection Selection Useful Statistic General Search Engine |
| Content Type | Text |