First of all, I would like to mention that I am not familiar with data mining and its technology So you can take my review as a summary of the book with my personal opinion -not a professional one- when it is needed.
Now, I'm reading:
**UNIT6 Mining Frequent Patterns,dealing with finding all frequent itemsets and generate strong associating rules.
Every rule holds in transaction set D with Support s and Confidence c:
Support = probability of two items A and B are chosen together.
Confidence = Probability of B in the transaction set D which contains A.
Apriori Algorithm is the fundamental theory to find Frequent Itemsets by confined candidate generation (It is time consuming) P:248.
Improving the efficiency of Apriori can be done using different variations: P:255-256
- Hash-based Techniques.
- Transaction reduction.
- Partitioning.
- Sampling.
- Dynamic itemset counting.
* Frequent Pattern Growth (FP-growth)method for finding frequent itemsets without costly candidate generation process.P:257.
* Using Vertical Data format (personally didn't find it interesting)
* Mining Closed and Max Patterns: This requires us to prune the search space as soon as possible using one of the next strategies:
- Item merging.
- Sub-itemset pruning.
- Item skipping.
6.3 (P:264) is presenting important idea that (Support and Confidence) are not enough and could result in a mislead "strong" association rules.For that reason we are suggested to use correlation measures:
- Lift (A,B)=1 means A and B are independent.
Lift (A,B)<1 means A and B are negatively correlated. Lift (A,B)>1 means A and B are positively correlated.
- X^2 (squared difference between observed and expected values,divided by expected value).
Other patterns evaluation measures which gain interests lately are:
all_confidence, max_confidence, Kulczynski and cosine.measure value(0~1), The higher the value,the closer the relationship between A and B.
all_conf(A,B) = min{P(A|B),P(B|A)}
max_conf(A,B) = max{P(A|B),P(B|A)}
Kulc(A,B)= 1/2 {P(A|B)+P(B|A)}
cosine(A,B)= square{P(A|B)*P(B|A)}
Previous six measures were examined on six typical data-sets (Page:269)
Lift and X^2 are strongly influenced by the number of null-transaction.
null-invariant measure if its value is free from the influence of null-transactions.
Imbalance Ratio(IR): which assesses the imbalance of two itemsets, A and B, in rule implications.
IR(A,B)= |sup(A)-sup(b)|/(sup(A)+sup(B)-sup(A&B)) [0~...]
With imbalance data with confusing values of the latest fore measures, we use the two measures together IR and Kulc.