Loading...
Please wait, while we are loading the content...
An unsupervised approach for product record normalization across different web sites ∗.
| Content Provider | CiteSeerX |
|---|---|
| Author | Lam, Wai Wong, Tik-Shun Wong, Tak-Lam |
| Abstract | An unsupervised probabilistic learning framework for normalizing product records across different retailer Web sites is presented. Our framework decomposes the problem into two tasks to achieve the goal. The first task aims at extracting attribute values of products from different sites and normalizing them into appropriate reference attributes. This task is challenging because the set of reference attributes is unknown in advance. Besides, the layout formats are different in different Web sites. The second task is to conduct product record normalization aiming at identifying product records referring to the same reference product based on the results of the first task. We develop a graphical model for the generation of text fragments in Web pages to accomplish the two tasks. One characteristic of our model is that the product attributes to be extracted are not required to be specified in advance and an unlimited number of previously unseen product attributes can be handled. We compare our framework with existing methods. Extensive experiments using over 300 Web pages from over 150 real-world Web sites from three different domains have been conducted demonstrating the effectiveness of our framework. |
| File Format | |
| Access Restriction | Open |
| Subject Keyword | Unsupervised Probabilistic Learning Framework Different Retailer Web First Task Different Domain Second Task Unseen Product Attribute Real-world Web Site Appropriate Reference Attribute Unlimited Number Web Page Product Record First Task Aim Graphical Model Layout Format Attribute Value Different Site Extensive Experiment Unsupervised Approach Product Record Normalization Reference Product Reference Attribute Text Fragment Different Web Site |
| Content Type | Text |