Saturday, April 25, 2009

Classification

1. What is Classification ?
Classification is the technique of labelling new data with some specific class labels depending on some past observations (training data set). It is also called Supervised learning because you assist the machine learning process by training the system with some labelled data set

2. Motivation
Lets take this problem domain of spam emails. say, we want to automatically classify spam mails and send them to spam folders. How does the mail system know that a certain mail is spam ? Apply classification algorithms to teach the machines so it can identify spam mails.

3. Basic Idea
Classification is a two step process.

Step 1 - Model Construction:

Our goal is to train a system with training data set which has predefined classes. For example, our data has designation of faculties and their years of experience and specifies if they have a tenure or not. The classifier will learn a rule from the given data set to identify when a faculty can get a tenure. Once this training is done, we have a Trained Model that can be used for classification.


Step 2 - Model Usage:

We first test the constructed model for accuracy.

Once the accuracy is acceptable, we use the model to classify new data. So when a new faculty arrives, the Model would assign a class label ("Tenure" or "Not Tenured") depending on the rule that was learned during the Model Construction phase


4. Practical Applications
1. Spam Filtering
2. Anomaly Detection
3. Loan Approval

5. Important papers / Algorithms
1. Decision Tree - Ross Quinlan
2. Bayesian classifier
3. Support Vector Machines - seems to be the most preferred classifier
4. Neural Networks

6. Math concepts widely used
Probability and Linear Algebra are widely used

References:
Pictures are taken from Class notes of Prof. Han (UIUC)

No comments: