Starting with the linear classifier
Model
Linear classifier computes scores $s = W x$ for $k$ different visual categories given the image, where $W$ was a matrix and $x$ was an input column vector containing all pixel data of the image. In the case of CIFAR-10, $x$ is a [3072x1] column vector, and $W$ is a [10x3072] matrix, so that the output scores is a vector of 10 class scores.
Mathmetical formulation. Given labeled data set ${\bx_i, y_i}_{1 \le i \le n}$ where $\bx_i \in \mathbb{R}^p$ being the $p$-dimensional feature of sample $i$ and $y_i \in {1, \cdots, K}$ being the ground-truth label. We wish to learn a classifier that labels an new observation $\bx \in \mathbb{R}^p$ with label $y$.
A linear classifier computes a $K$ dimensional score vector $ \bs = \bW \bx $. Each entry $s_k = \bW_{k, \cdot}^\top \bx$, $1 \le k \le K$ is a score for the $k$-th class. Every row of $\bW$ is a classifier for one of the classes. In statistics, logistic regression can be used to build a 2-class classifier, which corresponds to one row of $\bW$.
The predicted label $y = \underset{\arg \max}{k} s_k$.
Bias trick. The genral score function is defined as
$$ f(\bx_i, \bW, \bb) = \bW \bx_i + \bb. $$
However, it is a little cumbersome to keep track of two sets of parameters (the biases (intercept) $\bb$ and weights (slope) $\bW$) separately. A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector $\bx_i$ with one additional dimension that always holds the constant 1 - a default bias dimension. With the extra dimension, the new score function will simplify to a single matrix multiply:
$$ f(\bx_i, W) = \bW \bx_i. $$
Data preprocessing. In Machine Learning, it is a very common practice to always perform normalization of your input features (in the case of images, every pixel is thought of as a feature). In particular, it is important to center your data by subtracting the mean from every feature. Further common preprocessing is to scale each input feature so that its values range from [-1, 1]. Of these, zero mean centering is arguably more important but we will have to wait for its justification until we understand the dynamics of gradient descent.
Estimation
Building a classifier boils down to estimate parameter $\bW$ based on observed samples ${\bx_i, y_i}_{1 \le i \le n}$. A common method in machine learning is to minimize some loss function of the descrepency between the predicted label $\hat{y}_i$ and the ground truth $y_i$. The loss function quantifies our unhappiness with predictions on the training set.
Diffirent definitions of the loss function lead to different classification methods. In the following, we will consider two commonly seen classifiers – the SVM and Softmax classifier.
Multiclass Support Vector Machine loss
A commonly used loss is called the Multiclass Support Vector Machine (SVM) loss. The Multiclass SVM loss for the i-th sample is formalized as follows:
$$ L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta) $$
Remark.
-
Note that $$ \max(0, s_j - s_{y_i} + \Delta) = \begin{pmatrix} 0, if s_{y_i} \ge s_j + \Delta, \ positive, if s_{y_i} < s_j + \Delta \end{pmatrix}. $$
$L-i$ achieves its minimal value 0 when the classifier gives the correct label, that is, when $s_{y_i} \ge s_j + \Delta$ for any $j \ne y_i$.
-
The threshold at zero $max(0,-)$ function is often called the hinge loss. The squared hinge loss SVM (or L2-SVM) uses the form $(max(0,-)^2$ that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation.
Regularization. Note that the $\hat{\bW}$ minimizing $L_i$ is not unique when the classifier with $\hat{\bW}$ correctly classify every example (i.e. all scores are so that all the margins are met, and $L_i = 0$ for all i). Any multiple of these parameters $\lambda W$ where $\lambda > 1$ will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. We wish to encode some preference for a certain set of weights $\bW$ over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty $R(W)$. The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:
$$ R(\bW) = \norm{\bW}^2_F. $$
Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss $L_i$ over all examples) and the regularization loss. That is, the full Multiclass SVM loss becomes:
$$ L = \underbrace{ \frac{1}{N} \sum_i L_i }\text{data loss} + \underbrace{ \lambda R(W) }\text{regularization loss}, $$
where \(N\) is the number of training examples. Hyperparameter $\lambda$ is usually determined by cross-validation.
Remark.
-
In addition to the motivation above there are many desirable properties to include the regularization penalty. For example, it turns out that including the L2 penalty leads to the appealing max margin property in SVMs (See CS229 lecture notes for full details).
-
The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly. As we will see later in the class, this effect can improve the generalization performance of the classifiers on test images and lead to less overfitting.
-
The biases do not have the same effect since, unlike the weights, they do not control the strength of influence of an input dimension.
-
Due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples, because this would only be possible in the pathological setting of $\bW = 0$.
-
Making good predictions on the training set is equivalent to minimizing the loss. All we have to do now is to come up with a way to find the weights that minimize the loss.
Later. Leave the code and practical considerations of SVM to later study.
Softmax classifier
The Softmax classifier is a generalization of the binary Logistic Regression classifier to multiple classes. Unlike the SVM which treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation explained later. In the Softmax classifier, the scores $f(x_i; W) = W x_i$ are now interpreted as the unnormalized log probabilities for each class, that is $\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }$ is the probability that $y_i$ is the correct label. Thus, for the grounth truth to be choosen, we need to maximize $\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }$. This can be achieved by using the cross-entropy loss
$$ L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j} $$
As before, the full loss for the dataset is the mean of \(L_i\) over all training examples together with a regularization term \(R(W)\). The function \(f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}} \) is called the softmax function: It takes a vector of arbitrary real-valued scores (in \(z\)) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you’re seeing it for the first time but it is relatively easy to motivate.
Information theory view. The cross-entropy between a “true” distribution \(p\) and an estimated distribution \(q\) is defined as:
$$ H(p,q) = - \sum_x p(x) \log q(x) $$
The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( \(q = e^{f_{y_i}} / \sum_j e^{f_j} \) as seen above) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. \(p = [0, \ldots 1, \ldots, 0]\) contains a single 1 at the \(y_i\) -th position.). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as \(H(p,q) = H(p) + D_{KL}(p||q)\), and the entropy of the delta function \(p\) is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.
Probabilistic interpretation. Looking at the expression, we see that
$$ P(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j} } $$
can be interpreted as the (normalized) probability assigned to the correct label \(y_i\) given the image \(x_i\) and parameterized by \(W\). To see this, remember that the Softmax classifier interprets the scores inside the output vector \(f\) as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE). A nice feature of this view is that we can now also interpret the regularization term \(R(W)\) in the full loss function as coming from a Gaussian prior over the weight matrix \(W\), where instead of MLE we are performing the Maximum a posteriori (MAP) estimation. We mention these interpretations to help your intuitions, but the full details of this derivation are beyond the scope of this class.
Practical issues: Numeric stability. When you’re writing code for computing the Softmax function in practice, the intermediate terms \(e^{f_{y_i}}\) and \(\sum_j e^{f_j}\) may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant \(C\) and push it into the sum, we get the following (mathematically equivalent) expression:
$$ \frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}} $$
We are free to choose the value of \(C\). This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for \(C\) is to set \(\log C = -\max_j f_j \). This simply states that we should shift the values inside the vector \(f\) so that the highest value is zero. In code:
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup
# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer
Possibly confusing naming conventions. To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss. The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn’t make sense to talk about the “softmax loss”, since softmax is just the squashing function, but it is a relatively commonly used shorthand.