A. Model a classification rule directly B. Model the probability of class memberships given input data C. Make a probabilistic model of data within each class
We have two six-sided dice. When they are tolled, it could end up with the following occurance: A. dice 1 lands on side "3" B. dice 2 lands on side "1" C. Two dice sum to eight
We set the probabilities: a. dice 1: 61 b. dice 2: 61 c. dice 3: 61
MAP classification rule: for x=(x1,x2,⋅⋅⋅,xn) [P(x1∣C∗)⋅⋅⋅P(xn∣c∗)]P(c∗)>[P(x1∣c)⋅⋅⋅P(xn∣c)]P(c),c=c∗,c=c1,⋅⋅⋅,cL
Algorithm: Discrete-Valued Features
Learning Phase: Given a training set S, For each target value of ci(ci=c1,⋅⋅⋅,cL) P^(C=ci)← estimate P(C=ci) with examples in S; for every feature value χjk of each feature Xj(j=1,⋅⋅⋅,n;k=1,⋅⋅⋅,Nj) P^(Xj=xjk∣C=cj)← estimate P(Xj=xjk∣C=ci) with example in S; Output: conditional probability tables; for Xj,Nj×L elements
Test Phase: Given an unknown instance X′=(a1′,⋅⋅⋅,an′) Look up tables to assign the label c∗ to X′ if [P^(a1′∣c∗)⋅⋅⋅P^(an′∣c∗)]P^(c∗)>[P^(a1′∣c)⋅⋅⋅P^(an′∣c)]P^(c),c=c∗,c=c1,⋅⋅⋅,cL
Given a new instance, predict its label x′=(Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Strong)
Look up tables achieved in the learning phrase P(Outlook=Sunny∣Play=Yes)=92 P(Outlook=Sunny∣Play=No)=53 P(Temperature=Cool∣Play=Yes)=93 P(Temperature=Cool∣Play==No)=51 P(Huminity=High∣Play=Yes)=93 P(Huminity=High∣Play=No)=54 P(Wind=Strong∣Play=Yes)=93 P(Wind=Strong∣Play=No)=53 P(Play=Yes)=149 P(Play=No)=145
Decision making with the MAP rule P(Yes∣x′):[P(Sunny∣Yes)P(Cool∣Yes)P(High∣Yes)P(Strong∣Yes)]P(Play=Yes)=0.0053 P(No∣x′):[P(Sunny∣No)P(Cool∣No)P(High∣No)P(Strong∣No)]P(Play=No)=0.0206
Final Result
The fact P(Yes∣x′)<P(No∣x′), we label x′ to be “No”.
Algorithm: Continuous-valued Features
Numberless values for a feature
Conditional probability often modeled with the normal P^(Xj∣C=ci)=2πσji1exp(−2σji2(Xj−μji)2) μji: mean (average) of feature values Xj of examples for which C=cj σji: standard deviation of feature values Xj of examples for which C=ci
Learning Phase: for X=(X1,⋅⋅⋅,Xn),C=c1,⋅⋅⋅,cL Output: n×L normal distributions and P(C=ci)i=1,⋅⋅⋅,L
Test Phase: Given an unknown instance X′=(a1′,⋅⋅⋅,an′)
Instead of looking-up tables, calculate conditional probabilities with all the normal distributions achieved in the learning phrase
For many real world tasks, P(X1,⋅⋅⋅,Xn∣C)=P(X1∣C)⋅⋅⋅P(Xn∣C)
Nevertheless, naïve Bayes works surprisingly well anyway!
Zero conditional probability Problem
If no example contains the attribute value
In this circumstance, $$ during test
For a remedy, conditional probabilities estimated with
P^(Xj=ajk∣C=ci)=n+mnc+mp nc: number of training examples for which Xj=ajk and C=ci n: number of training examples for which C=ci p: prior estimate (usually, p=t1 for t possible values os Xj) m: weight to prior(number of "virtual" examples, m≥1)