Loading...
Please wait, while we are loading the content...
Similar Documents
Data Extraction from Structured HTML Sources
| Content Provider | Semantic Scholar |
|---|---|
| Author | Winston, A. A. |
| Copyright Year | 2005 |
| Abstract | Data Extraction from Structured HTML Sources By Alexis Winston Masters of Computer Science California State University Chico 2004 The Tree Mapping System (TMS) makes use of a template to automatically extract data from a set of HTML documents sharing a common structure. This template is generated in a semi-automated manner with the user providing example documents from the document set. The user then marks regions of interesting content to be extracted and the system creates a template encoding the document structure. During extraction the system maps the template onto the documents in the set to locate the target data. TMS employs an original mapping algorithm which calculates the similarity between nodes in trees representing the documents by comparing the node properties and the tree structures. This algorithm finds the mapping from the nodes in one tree to the nodes in the other which maximizes this similarity measure. The mapping algorithm is used to generate a template from a set of example documents, as well as to locate repeated regions in a single document, and finally to map a document to a template during extraction. 1 |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.ecst.csuchico.edu/~bjuliano/Papers/PDF/thesis_winston.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Thesis |