Loading...
Please wait, while we are loading the content...
Similar Documents
Diversity-based Interestingness Measures for Association Rule Mining
| Content Provider | Semantic Scholar |
|---|---|
| Author | Huebner |
| Copyright Year | 2009 |
| Abstract | Association rule interestingness measures are used to help select and rank association rule patterns. Diversity-based measures have been used to determine the relative interestingness of summaries. However, little work has been done that investigates diversity measures with association rule mining. Besides support, confidence, and lift, there are other interestingness measures, which include generality (also known as coverage), reliability, peculiarity, novelty, surprisingness, utility, and applicability. This paper investigates the application of diversity-based measures to association rule mining. INTRODUCTION Interestingness measures are necessary to help select and rank association rule patterns. Each interestingness measure produces different results, and experts have different opinions of what constitutes a good rule (Lenca, Meyer, Vaillant, & Lallich, 2008). The interestingness of discovered association rules is an important and active area within data mining research (Geng & Hamilton, 2006). The primary problem is the selection of interestingness measures for a given application domain. However, there is no formal agreement on a definition for what makes rules interesting. Association rule algorithms produce thousands of rules, many of which are redundant (Li & Zhang, 2003; McGarry, 2005). In order to filter the rules, the user generally supplies a minimum threshold for support and confidence. Support and confidence are basic measures of association rule interestingness. Additionally, these are the most common measures of interest. However, generating rules that meet minimum thresholds for support and confidence may not be interesting. This is because rules are often produced that are already known by a user who is familiar with the application domain. The challenge in association rule mining (ARM) essentially becomes one of determining which rules are the most interesting. With so many interestingness rules to choose from, it is difficult to determine which one to use for a given domain. This problem is exacerbated by the fact that different interestingness measures produce different results for the same data set, thus making it difficult for the user to interpret the measures (McGarry, 2005). The purpose of this paper is to review a few of the interestingness measures based on diversity. Diversity of a data set is defined as when comparing two data sets, the one with more diverse rules is more interesting. Diversity will be used to compare two data sets to determine which data set contains rules that are more interesting. Even though diversity is a criterion for measuring summaries, little work has been done that focuses on the diversity of association rules. Measures can be used with either summaries, association rules, or classification rules. This paper focuses exclusively on association rule interestingness measures which are based on diversity. ASSOCIATION RULE MINING Association rule mining is a category of data mining tasks that correlate a set of items with other sets of items in a database. Association rules "aim to extract interesting correlations, frequent patterns, associations or causal structures among sets of items in the transaction databases or other repositories" ASBBS Annual Conference: Las Vegas February 2009 Proceedings of ASBBS Volume 16 Number 1 (Kotsiantis & Kanellopoulos, 2006, p. 71). Association rule mining is one of the most important data mining techniques used today and is a mature field of research (Ceglar & Roddick, 2006; Xu & Li, 2007). Association rules were first proposed by Agrawal et al. (Agrawal, Imielinski, & Swami, 1993). The main driver for research on association rules was the analysis of customer market basket transactions. An example of an association rule is as follows. 60% of customers that purchase potato chips also purchase soda in the same transaction. Agrawal et al.'s work established a formal model for association rules and establishes algorithms that find large itemsets, confidence, and support of each rule discovered in the itemset. Association rules have been applied to a wide variety of application areas, which will be covered later in this paper. Association rule algorithms can generate thousands of rules, many of which can be redundant. These redundant rules are essentially useless, so researchers have solved this problem by defining new interestingness measures, incorporating constraints, or by designing templates to mine for restricted rules (Xu & Li, 2007). Also, a primary goal of knowledge discovery in databases is to produce interesting rules that can be interpreted by a user (Lenca et al., 2008). One research team (Lee & Siau, 2001) outlined the requirements and challenges associated with data mining. First, data mining must be able to handle different types of data. Second, data mining algorithms must be scalable and efficient. Third, data mining must be able to handle noisy and missing data. Fourth, data mining techniques should present results in a way that is easy to understand. Fifth, data mining techniques should support requests at different levels of granularity. That is, data mining can be done at different levels of abstraction. Sixth, data mining algorithms should be flexible enough to deal with data from different sources. Finally, a major concern within data mining today is the threat to privacy and data security. This is because data mining makes it easy to establish profiles of individuals based on data from multiple sources (Lee & Siau, 2001). General issues related to data mining include the identification of missing information, dealing with noise or missing values, and operating with very large databases (VLDBs). Additionally, data mining is normally used to access data contained in a data warehouse, which contain high degrees of dimensionality, thus making data mining more complex (Marakas, 2003). In order to produce accurate data mining results, it is important that the underlying data is complete. Without complete data, accurate rules cannot be produced. The field of privacy-preserving data mining (PPDM) investigates the issues pertaining to mining association rules when there are missing values in the database or data warehouse. Several privacy-preserving association rule algorithms have also been proposed to address this issue (Chen & Weng, 2008; Zhan, Matwin, & Chang, 2007). However, there are still many open issues related to privacy-preserving association rule mining (PPARM). INTERESTINGNESS MEASURES Two important measures within association rule mining are support and confidence. Support for an association rule is the percentage of transactions in the database that contain Y X ⇒ Y X ∪ . Confidence for an association rule (sometimes denoted as strength, or α) is the ratio of the number of transactions that contain Y X ⇒ Y X ∪ to the number of transactions that contain X. (Dunham, 2003). In other words, support describes how often the rule would appear in the database, while confidence measures the strength of the rule. A user establishes minimum support (minsup) and minimum confidence (minconf). Rules are then generated based on those criteria. Users can select minsup and minconf parameters before or after rule generation. An example follows. Given a database of supermarket transaction data, a rule might be generated that infers milk eggs, with support = 40% and confidence = 75%. This means that milk eggs occurred in 40% of the transactions in the database. It also means that 75% of the time that milk occurs, so do eggs. The antecedent for this rule is milk, while the consequent is eggs. Larger values of confidence and smaller values of support are normally selected when determining which association rules to keep. A third measure of interestingness is lift. Lift is a measure of the probability of finding the ASBBS Annual Conference: Las Vegas February 2009 Proceedings of ASBBS Volume 16 Number 1 consequent in any random basket. In other words, lift “measures how well the associative rule performs by comparing its performance to the “null” rule” (Marakas, 2003, p. 342). What makes a rule interesting? One way to define interestingness is that a rule must be valid, new and comprehensive (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Commonly used measures of interestingness include support and confidence as described above. Many other interestingness measures for association rules have been established, but there is no formal agreement on how interestingness should be defined. Some interestingness measures include conciseness, generality (also known as coverage), reliability, peculiarity, diversity, novelty, surprisingness, utility, and applicability (Geng & Hamilton, 2006). Interestingness measures can be classified into two categories: objective and subjective. Objective measures are based on statistics, while subjective measures are based on an understanding of the user’s domain knowledge. For example, objective measures include generality and reliability, conciseness, peculiarity, diversity and surprisingness. The subjective interestingness measures include novelty, utility, and applicability. These measures assist in validating association rule results (Tamir & Singer, 2006). Interestingness measures based on diversity have received little attention in the literature (Geng & Hamilton, 2006), thus the need for further research in this particular area. The generality (or coverage) of a pattern is determined by how comprehensive the pattern is and the fraction of the number of records that match the pattern. General patterns include frequent itemsets, which is also the most frequently studied type of patterns in association rule mining (Geng & Hamilton, 2006). The reliability of a pattern is determined by the percentage of cases found in the itemset. In other words, the rule might be interesting if a high percentage of cases contain the rule. One study applied the reliability measure to evaluate clinical datasets |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://asbbs.org/files/2009/PDF/H/HuebnerR.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |