Loading...
Please wait, while we are loading the content...
Similar Documents
Semi-structured data extraction from heterogenous sources
| Content Provider | Semantic Scholar |
|---|---|
| Author | Gao, Xiaoying Sterling, Leon |
| Copyright Year | 2000 |
| Abstract | This paper concerns the extraction of semi-structured data from Web pages generated from multiple on-line services. This task is addressed by representing the schemas for semi-structured data and crafting generic wrappers based on the schemas. We introduce a hybrid representation method for schemas of semi-structured data, consisting of a concept hierarchy and a set of knowledge unit frames. A content-based and structure-bounded information extraction algorithm is developed to build the generic wrapper, which utilizes the schemas and takes advantage of the semi-structured page layouts. The main advantages of the system are that a single wrapper can be applied to multiple Web sites, and the wrapper can handle resources with missing data and data presented in free texts, which can not be wrapped by existing techniques. The hybrid representation has been used for writing schemas for seven domains. Experiments in two domains, on-line real estate advertisements and car advertisements, show that the generic wrapper is robust for many flexible data presentations and page structures. |
| Starting Page | 83 |
| Ending Page | 102 |
| Page Count | 20 |
| File Format | PDF HTM / HTML |
| DOI | 10.4018/978-1-878289-82-7.ch005 |
| Alternate Webpage(s) | http://www.cs.mu.oz.au/~xga/iiis99.ps |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |