The math behind Hard Margin classifier — Support vector machines.
Firstly a Support vector machine is a Machine learning algorithm used for both Classification and Regression problems. The main theme of this algorithm is to create a decision boundary that maximizes the margin between the classes in the data that we have, based on the nearest data points the margin will be maximized, and these nearest data points are known as the Support vectors. so whenever a new data point is witnessed, the new data point will be correctly classified because of that decision boundary that separates the classes with maximum margin. Despite the advantages, the hard margin classifier in the SVM is limited only to Linearly separable problems by forcing some constraints to satisfy.
Now let’s dive into the maths behind that constraints.
Equation of the hyperplane:
we know that the General form of the equation of a line is Y=mx+b, where m is the slope of a point on the curve and b is the bias term, it infers the initial value or offset of the line from the origin.
The other general form is ax+by+c=0. This is the equation we will be using the most to work with multi-dimensional data.
now from that equation, we get y= (-a/b)*x- (c/b), where m= -a/b , b = -c/b
Now we will modify that equation a lil bit as -> w1*x1+w2*x2+w0=0. This equation is for 2-dimensional data. see the below image for a clear understanding.
If we observe carefully the equation is nothing but the dot product of two vectors W and X, so this equation can be further simplified as W^T*X+W0=0. where W^T represents the transpose of the W matrix which does dot product with X. W0 is the bias term, it becomes zero when it passes through the origin.
Now Let’s look into the Hard margin classifier
As I said earlier the hard margin classifier is limited to Linearly separable problems, This is the reason I took an example of the linearly separable problem, which has two classes in it classified by the Positive class and negative class. If you guys observe there are three equations. The negative class has the equation W^T*X+b=-1 because whenever a data point falls in that region the dot product of the vectors must be negative for the negative class. The Decision boundary has the equation W^T*X+b=0 because whenever a data point falls on the decision boundary the dot product of the vectors must be equal to zero since it is in the middle of the both positive class and positive class. The positive class has the equation W^T*X+b=+1 because whenever a data point falls in the positive class the dot product of the vectors must be positive.
Now in the above picture, I took a data point that falls on the decision boundary, but it is to be classified as the positive class. So that the data point to be classified as the positive class it must move some units towards it. That’s actually the margin. we need to calculate the distance from the decision boundary hyperplane to the nearest data points which are support vectors.
See the below image for clear understanding
So we finally calculated the total size of the margin and also if we see carefully the margin size is inversely proportional to the magnitude of the vector. so in order to increase the margin we need to minimize the magnitude.
Decision rules:
for y=+1 -> W^T*X+b≥1, the reason why it always should be greater than or equal to +1 is, whenever a data point is labeled as the positive class it might be a support vector where the dot product equals +1or it might be greater than +1.
for y=-1 -> W^T*X+b≤-1, the reason why it always should be lesser than or equal to -1 is, whenever a data point is labeled as the negative class it might be a support vector where the dot product equals -1 or it might be lesser than -1.
we can combine these two rules into a single rule which satisfies all the conditions. That is Y*(W^T*X+b)≥+1, this will be always correct for any value of Y, suppose if the Y is positive then the result will always be positive, and if Y is negative when it is multiplied with that equation the result will become negative and it correctly classifies the data.
Final Formulation
So our main aim in the hard margin classifier is to minimize the magnitude of the vector so as to increase the margin between the classes for the prevention of misclassification. We also need to find the best W and b so that it decreases the magnitude and increases the margin.
So this is all about the Hard margin classifier and the math behind it. The main problem with this is, it can’t handle non-linearly separable problems. And also it forces these constraints on the data to get linearly separated.
In the next blog, I will post about the soft margin classifier. So stay tuned!