Sunday, March 22, 2009

Association and Correlation Analysis

1. What is Association and Correlation Analysis ?
Association Analysis is closely related to frequent pattern analysis. Association rules are generated during Association analysis and they specify how items / objects are associated to each other. A simple example is {milk} -> {bread} is a association rule that specifies that people who buy milk also buys bread.
A frequent pattern mining could typically help discover thousands of association rules and many of them might not be interesting or useful rules. So a correlation measure helps to narrow down on the useful association rules.

2. Motivation
Suppose you are the VP of Marketing at a huge retail store like walmart and want to know how to personalize ads and marketing campaigns for your customers then simply do association and correlation analysis on your dataset. On doing an association analysis, you would get rules like {diapers, baby food} -> {baby toys} which translates as someone buying diapers and baby good are more likely to buy baby toys. Now you know who all buy diapers and baby food together from the little "discount card" your customer uses. So you can target them for baby toys. cool rite :)

3. Basic Idea
The general format of an association rule is
buys(X,"laptop") => buys(X,"printer") [support = 10%, confidence = 70%]
where support of 10% means that 10% of all transactions show that laptop and printer have been bought together. Confidence of 70% means that the possibility that someone buying a laptop will also buy a printer is 70%.
So why do we need correlation analysis ? The reason is that not all rules generated are valid in the real-world. For example, from a univ database, we get the association rule as play basketball => eat cereal [40%, 66.7%] is misleading since the overall % of students eating cereal is 75% > 66.7%. Hence we need to use different correlation measure to augment the support-confidence framework for association rules.

4. Applications
1. Analyse Market basket data
2. Bioinformatics
3. Web Mining
4. Scientific data analysis

5. Important papers / Algorithms
I guess frequent pattern mining techniques are the underlying framework that would help generate association rules.

6. Math concepts used
1. Statistics
Statistical correlation measures like All-Confidence, Cosine measure, Jaccard coefficient will help in correlation Analysis

No comments: