Loading...
Please wait, while we are loading the content...
Similar Documents
Evaluating Continuous Top-k Queries over Text Streams
| Content Provider | Semantic Scholar |
|---|---|
| Author | Vouzoukidou, Nelly |
| Copyright Year | 2012 |
| Abstract | Web 2.0 technologies have transformed the Web from a publishing-only environment into a vibrant information place where yesterday’s end users become content generators themselves. Besides traditional information sources, such as press websites, nowadays, social networks, blogs and forums are publishing on the web millions of items on daily basis. Given the immense volume and vast diversity of the information generated in Web 2.0, there is a vital need for efficient real-time filtering methods across streams of information which allow millions of users to effectively follow personally interesting information. In this respect, users usually issue keyword-based queries, which can either be directly evaluated by the underlying search engines or submitted to an alerting service and be continuously notified about newly published items matching their filtering criteria. In both settings, scoring functions are used both to measure the relevance of an item to the keywords employed by a query as well as the importance of the item according to queryindependent quality criteria. The latter considers parameters such as information novelty, diversity, authority, as well as the significance of the thematic collection in which they belong (e.g. clustered to describe the same real-word event). The effectiveness of the employed scoring function lies on how well are combined a keywordbased similarity with item importance usually using a weighted sum over both scores. Moreover, to guarantee freshness of information and deliver as recent information as possible, combination of time decay and sliding windows techniques are considered over the employed scoring functions. In this thesis, we are interested in efficient algorithms and data-structures for online monitoring of web 2.0 content and more precisely for efficiently evaluating continuous top-k queries over textual information streams. It should be stressed that existing commercial alerting essentially transform a continuous query to a series of periodically executed snapshot queries. This approach incurs serious limitations: for large numbers of users queries and high arrival information rates it is practical impossible to repeatedly evaluate all queries at almost all new information items. For this reason, commercial systems usually decrease the frequency of snapshot query evaluation and thus important news may be missed. Unlike existing research work on top-k continuous textual queries, in our work we consider complex scoring functions that capture both the relevance of an information item with respect to the keywords of a user query as well as its importance and freshness with respect to what has been already published. Then, we are proposing a representation of queries based on their scores that enable us to resolve efficiently the problem of user query selection: given an incoming information item determine which top-k query result lists need to be updated (i.e., have to insert the item). The novelty of our approach lies on the effectiveness of the induced boundary conditions to drastically prune this search space of queries during matching, and with a small update overhead when top-k lists are refreshed, for a fairly large class of scoring functions (weighed sum of query dependent scores and query independent scores modulated by decay functions). Supervisor: Vassilis Christophides Professor |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.ics.forth.gr/_publications/Vouzoukidou_MasterThesis.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |