Sunday, March 15, 2009

DM Series - Frequent Pattern Mining

I am planning to follow the below pattern for each of the DM functionalities / techniques I had mentioned in the preamble post of this series.

1. What is it
2. Motivation
3. Basic Idea
4. Applications
5. Important papers / Algorithms
6. Math concepts widely used

1. What is Frequent Pattern Mining ?
Frequent Patterns are those patterns that occur frequently in data. Depending on the data set, there are 3 main types of frequent patterns.
a. Frequent itemset - set of items that occur frequently generally in a transaction database like milk and bread
b. Frequent sequential pattern - It is a subsequence that occurs frequently. Eg: People who buy computers tend to buy printers.
c. Frequent structured pattern - when structures like graphs, trees etc occur frequently they form a frequent structured pattern. Eg: DNA structure analysis
The techniques used to mine / find knowledge in frequent patterns are called frequent pattern mining

2. Motivation:
Some of the motivating thoughts that leads to the need for FP mining are
- What items are generally bought together
- What is generally bought after buying say a PC
- What DNA is sensitive to this new drug

3. Basic Idea:
For performing FP Mining, we need to find the frequent patterns. A k-itemset is frequent if it satisfies a minimum support(min_sup) threshold quantity ie if this k-itemset occurs more than the min_sup quantity then this itemset is a frequent item set. For eg, if min_sup is 1000 and we scan a walmart shopping database, then the itemset {bread,milk} would have occured more than 1000 and qualify as a frequent itemset.
The issue with identifying the frequent itemset is combinatorics. For example, if walmart as 10,000 products then imagine the number of combination of items that would be purchased

4. Applications:
1. It is the basis for many DM tasks like Association analysis, Classification, Clustering etc.
2. Market basket analysis
3. Web log analysis
4. DNA sequence Analysis

5. Important papers / Algorithms:
a. Apriori Algorithm
b. FP-Tree algorithm -> Mining Frequent Patterns without Candidate Generation by Jiawei Han, Jian Pei, Yiwen Yin.
On a personal note, FP-Tree is developed by Prof.Han who is my DM course professor. He is just too good and am proud to learn from him.

6. Math concepts widely used
Statistics, Probability and Combinatorics are pretty widely used.

No comments: