Support Vector Machines (1)

Support Vector Machines is a widely used supervised learning algorithm to learn linear and non-linear functions.

Motivation

Classifying data is a common task in machine learning. Given a set of training data, each marked as belonging to one of two categories, and the goal is to decide which class a new data point will be in. In Support Vector Machines (SVM) training algorithm, a data point is viewed as a n-dimensional vector, and we want to know whether we can separate such points with a (n-1)-dimensional hyperplane, we call it a Linear Classifier. Ususally there are many hyperplanes that might classify the data. We can choose the best hyperplane as the one that represents the largest separation, or margin, between the two classes. That is, the distance from it to the nearest data point on each side is maximized. The SVM algorithm is to learn (find) such maximum-margin hyperplane or maximum-margin binary classifier.

Linear Classifiers and Margin

In Logistic Regression, we have the logistic regression hypothesis:

\displaystyle h(\theta, X) = S(\theta^TX) = \frac{1}{1+e^{-\theta^TX}}

and make the following classfication based the estimated probability:

  • When the probability of y =1 is greater than 0.5 then we can predict y = 1;
  • Else we predict y = 0.

From the property of Sigmoid function, we can get the equivalent Linear Decision Boundary or Linear Classifier is:

\displaystyle \theta^TX=\theta_0+\theta_1x_1+\theta_1x_2 +...+ \theta_nx_n\ge 0

Let us rewrite

\theta^T = (\theta_0,\theta_1,\theta_2,...,\theta_n), X^T=(1, x_1, x_2, ..., x_n)

Then the Linear Decision Boundary or Linear Classifier is

z = \theta^TX \ge 0

Now we can modify the Linear Classifier little bit by adding decision margins to get the SVM —  We refer to SVM as a Large Margin Classifier:

Classifier

  • If Y = 1, we need z \ge 1, not justz \ge 0
  • If Y = 0, we need z \le -1, not justz \le 0

We can translate this Sigmoid function term into hyperplane language. If the training data are linearly separable, we can select two hyperplanes in a way that they separate the data and there are no points between them, and then try to maximize their distance. The region bounded by them is called “the margin”. These hyperplanes can be described by the equations

  • Classifier Boundary:    \{X: \theta^TX = 0\}
  • Plus-Plane:   \{X: \theta^TX = +1\}
  • Minus-Plane:   \{X: \theta^TX = -1\}

Hyper-Plane

Then the  classification decision for a new coming example is:

  • Positive Class (i.e. y=1):  if  \theta^TX \ge +1
  • Negative Class (i.e. y=0):   if  \theta^TX \le -1

How to Calculate the Margin Width

The distance between the Plus-Plane and Minus-Plane is called the margin width. How to calculate the margin width M in terms of \theta = (\theta_0, \theta_1,\theta_2,...,\theta_n)?

Margin

From the Classifier Boundary equation {X: \theta^TX = 0} we know the vector \theta is perpendicular to the Plus-Plane and Minus-Plane.

Let

  • X^- be any point on the Minus-Plane
  • X^+ be be the closest Plus-plane-point to X^-.

They are not necessarily example data points. Obviously, the vector X^+ - X^- is perpendicular to the Plus-Plane and Minus-Plane, and the margin width M = ||X^+ - X^-||.

Therefore, we have the following equations:

  • 1.  \theta^TX^+ = +1
  • 2.  \theta^TX^- = -1
  • 3.  X^+ - X^- =k \theta    (for some value k).
  • 4.  M = ||X^+ - X^-||

Now it is easy to calculate the margin width M: Substracting the Eq. 2 from Eq. 1, we have

 \theta^T(X^+ -X^-) = 2

and then applying the Eq.3,  we get

k \theta^T\theta = 2

So we have

\displaystyle k = \frac{2}{\theta^T\theta}=\frac{2}{||\theta||^2}

Finally from Eq.4 we have

\displaystyle M = ||X^+ - X^-|| = ||k \theta|| = k ||\theta|| =\frac{2}{||\theta||}

margin_width

The datapoints lying on either the Minus-Plan or the Plus-Plan are called Support Vectors.

SVM – Optimization

Therefore, learning the SVM can be formulated as an optimization program: Search the \theta space to find the widest margin that matches all the training example datapoints.  The constraints are

  • \theta^TX \ge +1,  if y_j = 1
  • \theta^TX \le -1,  if y_j = 0

for j=1,2,..., m over all the training dataset.

We can compact this constraint as

y_j\theta^TX_j + (y_j-1)\theta^TX_j = (2y_j -1)\theta^TX_j \ge 1, j=1,2,...,m

Equivalently,  from the margin width formula above, we have the optimization problem

\displaystyle \min_{\theta_0,\theta_1,...,\theta_n} \frac{1}{2}\sum_{i=1}^n\theta_i^2 = \frac{1}{2}||\theta||^2

s.t.     (2y_j -1)\theta^TX_j\ge 1, j=1,2,...,m.

This is a quadratic optimization problem subject to linear constraints and there is a unique minimum.

QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints (linear equality and/or inequality constraints). There exist algorithms for finding such constrained quadratic optima much more efficiently and reliably than gradient ascent method.

 

 

 

Tips for Writing & Editing in WordPress.com

1. How to Post Source Code — Code Snippet

Use  the square bracket syntax [ code] … [ /code], NOT use the angle bracket syntax < code> … < /code>. For example,

[ code]
your code line 1
your code line 2
[ /code]

will yield

your code line 1
your code line 2

You also can use the parameter “language” or “lang” to specify how to highlight your code. See here for the details.

 

2. How to Write Latex and Avoid the Not Parse Error

To type a formula in WordPress type $ latex [Your LaTex code here] $. For example, $ latex \sum_{i=1}^{n}X_i$ produces \sum_{i=1}^{n}X_i.

To display a formula in mat mode, type <p align=”center”> $ latex \displaystyle <Your LaTex code> $</p>.

For example <p align=”center”>$ latex \displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}. $</p> will yield

\displaystyle \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.

See here for the details.

Notes: If you type latex code in WordPress.com Visual editor, DO NOT insert any extra spaces between $ latex and [your code] and any extra spaces in your latex code. Otherwise, it will yield the not parse error. For example, $ latex x^2$ works fine and produce x^2 , but $ latex  x^2$ will not work and it yields  x^2, since there are two spaces between $ latex and x_2 in the code. in the Text editor, it is OK to insert more spaces.

3. How to write Case function and Aligned equations?

 

You can use the \begin{cases} — \end{cases} to write a Case function. For example

$ latex \begin{cases}
x^2, & \text{ if } x>2 \\
1,& \text{ if } x \le 2
\end{cases}$ 

 

 

will yield

\begin{cases}  x^2, & \text{ if } x>2 \\  1,& \text{ if } x \le 2  \end{cases}

You can use the \begin{aligned} — \end{aligned} to align up the equations. Not use \begin{align*} — \end{align*}, since it will not work in WordPress.com.  For example

$ latex \begin{aligned}
x &=2(a-b)+b \\
&=2a -b
\end{aligned}$

will produce

\begin{aligned}  x &=2(a-b)+b \\  &=2a -b  \end{aligned}

Overfitting and Regularization

In regression or logistics regression model, fitting a linear function to the data might be too simple and lead a high bias. The higher order polynomial can provide better fitting the training set, but it may not produce a better model, since it fails to provide a general solution of applying to new examples. This is unknown as the overfitting issue.

Overfitting usually leads to very large parameter choices. To prevent the overfitting issue, we can penalize  θ parameters and make some of the θ parameters really small. 

Regularization

For the linear regression, with regularization, we modify the cost function and add regularized terms to shrink all the parameters ( except θ0). That is, we can add the penalized term and  the cost function becomes

\displaystyle J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h(\theta, X_i)-y_i)^2+\lambda\sum_{j=1}^{n}\theta_j^2

Therefore, the regularized Gradient Descent update rule for linear regression becomes:

Simultaneously Update:

\displaystyle \theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m[ h(\theta, X_i)-y_i]x_{i0}

\displaystyle \theta_j := \theta_j (1-\alpha \frac{\lambda}{m} ) - \alpha\frac{1}{m}\sum_{i=1}^m[ h(\theta, X_i)-y_i]x_{ij}, j= 1,2,...,n.

where \alpha is the learning rate and \lambda is the regularization parameter.

The   \lambda  should be chosen carefully – not too big.  To make sure the regularized update factor (1-\alpha \frac{\lambda}{m} )   is often around 0.99 to 0.95. The second term is exactly the same as the original gradient descent.

For the logistic regression with the convex cost function, the regularized cost function becomes

\displaystyle J(\theta) = \frac{1}{m}\sum_{i=1}^m [-y_i\log (h(\theta, X_i))+(1-y_i)\log (1-h(\theta,X_i))] +\lambda\sum_{j=1}^{n}\theta_j^2

In the same way, the  regularized update factor (1-\alpha \frac{\lambda}{m} ) is applied to the  the regularized Gradient Descent update rule for logistics regression

Logistic Regression

When the target variable y is a continuous value, the linear regression model can used to solve the predication problem.

When the target variable y is a discrete value. we can modify the linear regression model to develop the logistic regression algorithm to solve the Classification problems.

Binary Class Problem

The target variable y is either  1 or 0:

  • 1 – positive class (presence)
  • 0 – negative class (absence )

We could modify the linear regression model as the classifier by

  • Resealing the model output  into certain range, say [0, 1].
  • Then setting the appropriate threshold value  (say 0.5 or 0.7) for the scaled model output to assign the positive or negative class.

Logistics Hypothesis

in linear regression, we have the linear hypothesis:

h(\theta, X) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ....+ \theta_n x_n = \theta^TXFor Logistic Regression, we can resale the hypothesis

h(\theta, X) = S (\theta^T X) : \rightarrow [0, 1]

by applying the Sigmoid function (aka  Logistic function):

\displaystyle S(t) = \frac{1}{1+e^{-t}}

That is,

\displaystyle h(\theta, X) = S(\theta^TX) = \frac{1}{1+e^{-\theta^TX}}

Since h(\theta, X) \in [0, 1], we can interpret that value as the estimated probability that y=1 on input x for the give parameters \theta. That is,

\displaystyle P(y=1|X; \theta) = h(\theta, X)

Then we have

\displaystyle P(y=0|X; \theta) = 1-P(y=1|X; \theta) = 1- h(\theta, X)

and furthermore,

\displaystyle \frac{P(y=1|X;\theta)}{P(y=0|X; \theta) } = \frac{h(\theta, X)}{1- h(\theta, X)} = e^{\theta^TX}

That is

\displaystyle \log\frac{P(y=1|X;\theta) }{P(y=0|X; \theta) }=\theta^TX

Decision Boundary and Classifier

Therefore it is reasonable to make the following predication based the estimated probability:

  • When the probability of y =1  is greater than 0.5  then we can predict y = 1;
  • Else we predict y = 0.

From the property of Sigmoid function, we can get the equivalent Linear Decision Boundary or Linear Classifier is:

\displaystyle \theta^TX=\theta_0+\theta_1x_1+\theta_1x_2 +...+ \theta_nx_n\ge 0

Image [5]

Convex Cost Function

In the linear regression model, we use the Squared-Sum cost function J(\theta) for the the linear hypothesis H(\theta, X) = \theta^TX.

\displaystyle J( \theta_0, \theta_1, \theta_2,....,\theta_n)=\frac{1}{2m}\sum_{i=1}^m (h(\theta, X_i) - y_i) ^2 =\frac{1}{2m}\sum_{i=1}^m (\theta^TX_i - y_i) ^2

The Square-Sum cost function is convex.

However, for the logistics hypothesis here, the Square-Sum cost function J(\theta) is no longer convex function at all now. It will not guarantee the global optimum. We need a different, convex Cost function which we can apply gradient descent.

New Cost Function

Intuitively, we can define

\displaystyle \text{Cost}(h(\theta, X), y) = - \log P(Y=y|X;\theta), y = 0, or 1

as the cost or error function for the data point (X, y).

Image [13]   Image [14]

That is, 

\displaystyle \text{Cost}(h(\theta, X),y)=\left\{ \begin{matrix}-\log h(\theta, X),& if y =1 \\-\log (1-h(\theta, X)),& if y=0\end{matrix}\right.

Since y is always 0 or 1, we can  rewrite the piece-wise cost function as

\displaystyle \text{Cost}(h(\theta, X),y)= -y \log (h(\theta, X) ) - (1-y) \log (1-h(\theta, X))

Therefore the logistic regression cost function is:

\displaystyle J(\theta) = \frac{1}{m}\sum_{i=1}^m \text{Cost}(h(\theta, X_i),y_i)\\= -\frac{1}{m}\sum_{i=1}^m [y_i\log (h(\theta, X_i) ) - (1-y_i) \log (1-h(\theta, X_i))]

where

\displaystyle h(\theta,X)=\frac{1}{1+e^{-\theta^TX}}

 
Good News: J(\theta) is convex function of \theta.    We then can use the gradient descent method to minimize the cost function \displaystyle \min_{\theta}J(\theta):
 
 Simultaneously update all \theta_j, j=0,1,2,...,n:
\displaystyle \theta_j :=\theta_j - \alpha \sum_{i=1}^{m} [h(\theta, X_i) - y_i]x_{ij}
 
This equation is the same as the linear regression rule. The only difference is that our definition for the hypothesis has changed. 

Multiclass classification problems

Given a training dataset with three classes C1, C2,. and C3, how do we get a regression learning algorithm to classify a new sample?

One vs. All Classification

    1. Replicate or split  the training set into three new training sets:
      • TS1: Set all the C1 target value = 1, and all other target values = 0.
      • TS2: Set all the C2 target value = 1, and all other target values = 0.
      • TS3: Set all the C3 target value = 1, and all other target values = 0.
    2. Use each of the new training set TS1, TS2, TS3 independently train the logistic regression classifier to get three  hypothesis h(\theta^{(1)}, X) ,  h(\theta^{(2)}, X) ,  h(\theta^{(3)}, X) .
    3.  For the new input X, pick the class that maximize the hypothesis ( probability) h(\theta^{(k)}, X)

Linear Regression with Multiple Variables

Linear regression is a supervised learning algorithm and it attempts to model the relationship between two variables by fitting a linear equation to the training or observed data. It can be used tp solve the prediction or classification problems.

Problem Formulation

Let

X=(x_0,x_1,x_2, ...., x_n ) \in R^{n+1} be an (n+1)-dimensional vertical vector, where we always have  x_0 = 1 as the “bias”, and  x_1, x_2, ..., x_n are the n features (the n independent variables).

  • y \in R^1 be the dependent “target” variable.
  • (X, y) is a single training example
  • \{(X_i, y_i), i= 1,2,....,m\} the training set with m examples. The  i-th sample is X_i = (x_{i0}, x_{i1}, x_{i2}, ...,x_{in})^T, where x_{ij} is the j-th feature value.
  • Hypothesis of the linear relationship y = h(\theta, X), where \theta = (\theta_0,\theta_1,\theta_2, ...., \theta_n) \in R^{n+1} is the n+1 parameters to be determined, and

h(\theta, X) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ....+ \theta_n x_n = \theta^TX

The Cost Function is:

\displaystyle J( \theta_0, \theta_1, \theta_2, ...., \theta_n) = \frac{1}{2m} \sum_{i=1}^m (h(\theta, X_i) - y_i) ^2

Then the algorithm’s objective of to Minimize the cost function J() over the \theta-parameter space. That is:

\displaystyle \min_{\theta_0,\theta_1,...,\theta_n }J( \theta_0, \theta_1, \theta_2, ...., \theta_n)

Gradient Descent

Gradient Descent is an optimization method that is used all over machine learning for minimization.

To Minimize cost function J( \theta_0, \theta_1, \theta_2, ...., \theta_n), do the following until convergence:

Simultaneously Update:

\displaystyle \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1, \theta_2, ...., \theta_n), j= 0,1,2,...,n.

where \alpha is the learning rate.

Warning: If you implement the non-simultaneous update it’s not gradient descent, and will behave weirdly.

Linear Regression Algorithm

For the linear regression cost function

\displaystyle J( \theta_0, \theta_1, \theta_2, ...., \theta_n) = \frac{1}{2m} \sum_{i=1}^m (h(\theta, X_i, - y_i ) ^2

It is easy to get:

\displaystyle \frac{\partial}{\partial \theta_j}J(\theta)=\frac{1}{m}\sum_{i=1}^m[h(\theta, X_i)-y_i]x_{ij},j=0,1,2,...,n.

So we can use this to Simultaneously update

\displaystyle \theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i=1}^m[ h(\theta, X_i)-y_i]x_{ij}, j= 0,1,2,...,n.

Feature Scaling

If you have a problem with multiple features, you should make sure those features have a similar scale, so that the gradient descent can converge more quickly. We need to rescale the features so the Contours become more like circles, and the algorithm is more effective.

Scaling Methods:

  • Max-Min Normalization: xi <– (xi – min) / (max – min)
  • Mean Normalization: xi <– (xi – mean)/max
  • Standardization: xi <– (xi – mean)/std

Learning Rate α

  • How to chose α? Make sure the gradient descent is working.
  • Plot min J(θ) vs. no of iterationsf gradient descent is working then J(θ) should decrease after every iteration
  • If you plot J(θ) vs iterations and see the value is increasing – means you probably need a smaller α
  • Use a smaller α. If α is small enough, J(θ) will decrease on every iteration
  • BUT, if α is too small then teh convergence is too slow
  • Typically
    1. Try a range of alpha values
    2. Plot J(θ) vs number of iterations for each version of alpha.
    3. Go for roughly threefold increases: 0.001, 0.003, 0.01, 0.03. 0.1, 0.3

Least-squares Estimation and Normal Equation

In statistics, we have the below normal equation  for the parameters estimation:

   \hat{\boldsymbol\beta} = (\mathbf{X}^{\rm T}\mathbf{X})^{-1} \mathbf{X}^{\rm T}\mathbf{y}  = \big(\,{\textstyle\sum} \mathbf{x}_i \mathbf{x}^{\rm T}_i \,\big)^{-1}  \big(\,{\textstyle\sum} \mathbf{x}_i y_i \,\big).

where X is the design mX(n+1) matrix which contains all the training data features (including the the column bias). y is the column vector or  mX1 matrix which is formed by the m target values from the  training data.

Gradient Descent or Normal Equation

Gradient Descent

  • Need to chose learning rate
  • Needs many iterations – could make it slower
  • Works well even when n is massive (millions)
  • Better suited to big data
  • What is a big n though
  • 100 or even a 1000 is still (relativity) small

Normal Equation

  • No need to chose a learning rate
  • No need to iterate, check for convergence etc.
  • Normal equation needs to compute the inverse of an nXn matrix  (X’X).
    • matrix inverse computing grows by O(n3 ) — so not great
  • Slow when n is large — Can be much slower

Polynomial Regression

By choosing appropriate features you can get different learning algorithms.

By performing appropriate features mapping, We can get polynomial regression for non-linear function from the Linear Regression developed above:

Polynomial Regression:

Hypothesis of polynomial relationship for the single variable x:

h(\theta, x) = \theta_0 + \theta_1 x + \theta_2 x^2 + ....+ \theta_n x^n

We can use the Features mapping: x^i: \rightarrow x_i (i=1,2,...,n) to covert the Polynomial Regression into the Linear Regression problem with multiple variables

Clustering: K-Means and K-Nearest Neighbor

K-Means and K-Nearest Neighbor (aka K-NN) are two commonly used clustering algorithms. They all automatically group the data into k-coherent clusters, but they are belong to two different learning categories:

  • K-Means — Unsupervised Learning: Learning from unlabeled data
  • K-NN — supervised Learning: Learning from labeled data

K-Means

Input:

  • K (the number of clusters in the data).
  • Training set: \{X_i, i=1,2,..., N\}.

Algorithm:

  1. Initialization:         Randomly choose K points \{C_k, k=1,2,...,K \} from the training set as the initial cluster centroids.
  2. Repeat: {

a.  Assign Cluster Step:

for each example \{X_i \}

Calculate the K-distances ||X_i - C_k||, k=1,2,...,K.

Set the class label cl_i := the index of k (from 1 to K) of the cluster centroids having the minimum distance (cl_i \in \{1,2,...,K\}).

        b. Move Centroids Step:

for k = 1 \text{ to } K (for each class)

C_k := mean of the examples assigned to cluster k

} Until convergence, ie. there is no now centroids and classification

 

K-Means Optimization Objective

Let \{C_k, k=1,2,...,K \} be the K centroids, \{cl_i , i =1, 2,..., N \} be the index of cluster {1,2,…,K}. So the optimization objective of K-Means is:

\displaystyle \min_{C_k, cl_i} J(C_1,C_2, ..., C_K, cl_1, cl_2, ..., cl_N) = \frac{1}{N} \sum_{i=1}^N||X_i - C(cl_i) ||^2

where C(cl_i) is the centroid to which the exsample X_i has been assigned to.

Therefore, for the K-means algorithm,

  • Assign Cluster Step: Minimizing J(…) with respect to the cluster index cl_i without changing the centroid  C_k.
  • Move Centroid Step: Minimizing J(…) with respect to the cluster centroid  C_k without changing the cluster index  cl_i.

How to Choose the Number K

Choosing K?

  • Not a great way to do this automatically
  • Normally use visualizations to do it manually

Elbow Method

  • Vary K and compute cost function J(…) for the K.
  • As K increases J(…) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)
  • Plot this (K vs J())
  • Look for the “elbow” on the graph
  • Chose the “elbow” number of clusters
  • Not really that helpful. — If you get a nice plot this is a reasonable way of choosing K. However, normally you don’t get a a nice line -> no clear elbow on curve.

    Image [12]

k-Nearest Neighbor:

Input:

  • K (the number of nearest neighbor chosen, typically small positive integer).
  • Training set with N samples and M known clusters (N>>M) : \{(X_i, c_i) i=1,2,...,N\}. Where c_i \in \{1, 2, ...,M\}.

Algorithm:

     For a given new unlabeled sample,

  1. Calculate its distances to all the training samples
  2. Choose the K nearest samples based on the calculated distance
  3. Count the vote for the chosen clusters.
  4. Assign a cluster to the new unlabeled sample using the simple majority vote

Distance:

K-NN is a distance based learning, so choosing the an appropriate distance is very important. Below are the most used distance:

Let X = (x_1, x_2, ..., x_n), Y = (y_1, y_2, ..., y_n) be two points in R^2. Then Minkowski distance of order p is defined as

D_p(X, Y) = \displaystyle \left ( \sum_{i=1}^n |x_i - y_i|^p \right ) ^{1/p}

  • p = 1:  Manhattan distance 
  • p = 2:  Euclidean distance
  • p = infinity: Infinity distance  D_{\\infty} = max \{ |x_i - y_i|\}

It should also be noted that all three distance measures above are only valid for continuous variables. In instance of categorical variables the Hamming distance must be used:

D_H(X, Y) = \displaystyle \sum_{i=1}^{n} |x_i - y_i|

Standardized Distance

One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below:

X_s = \displaystyle \frac{X-X_{min}}{X_{max} - X_{min}}