前言

开始搓延期了一百年的机器学习实验,把SVM推导过程从定义到求解都刷了一遍,包括推导优化目标、提出对偶问题、利用KKT条件得到原问题的解以及SMO算法等。

注意:本文在某些地方比如KKT条件和SMO算法处不提供证明过程(太麻烦了喵),而是直接使用其结论,感兴趣的读者可以查阅参考资料

参考资料:

推导过程学习视频:(系列六) 支持向量机1-硬间隔SVM-模型定义_哔哩哔哩_bilibili

拉格朗日对偶性的条件:拉格朗日对偶性详解(手推笔记)-CSDN博客

从几何角度理解KKT条件和对偶关系:机器学习笔记(8)-对偶关系和KKT条件 - Epir - 博客园 (cnblogs.com)

代码参考:统计学习:线性可分支持向量机(Cvxpy实现) - orion-orion - 博客园 (cnblogs.com)

一、优化问题

当一个分类问题线性可分时,我们希望找到一个最优的超平面将其划分,这个最优的超平面需要满足以下性质:

超平面距超平面最近的样本点 之间的间隔应该尽可能大

举一个简单的例子:

当这是一个二分类问题,且样本点特征数为2时(点可以表示在二维平面),该超平面是一条一维的直线。那么,我们想要找到能够将两类数据划分的最优的直线,它与距离它最近的点的距离应该尽可能大

那么我们的优化问题可以描述为:

给定样本点$X = (x_1, x_2, …, x_m)^T, X \in R^{m \times n}, x_i \in R^n$
给定标签$y = (y_1, y_2, …, y_m)^T, y\in R^m, y_i \in {-1, 1}$代表样本点$x_i$所属的类
先求一最优的超平面$w^T x + b = 0$将$y_i$的值不同的样本点分割开,也就是求

由于$y_i(w^T x_i + b) = | w^T x_i + b |$,原问题等价于

由于最小化问题是关于$x_i$的,那么$| w |$便是无关变量,可往前提

由$y_i(w^T x_i + b) > 0$得

$\exists \gamma > 0$使得$min_{x_i}(y_i(w^T x_i + b)) = \gamma$

也就是$y_i(w^T x_i + b) >= \gamma$

由于$w$和$b$的缩放不影响样本点到超平面的几何距离,可取$\gamma = 1$方便讨论

将$min_{x_i}(y_i(w^T x_i + b)) = 1$代入到上述优化问题中,并修改约束如下:

问题等同于

二、对偶问题

1. 得到对偶问题

构造上述问题的广义拉格朗日函数:

由拉格朗日对偶性:

拉格朗日对偶性

对于 $min_x f(x) \quad \
s.t. \quad c_i(x) < = 0, i = 1, …, k \
\quad \quad h_j(x) = 0, j = 1, …, l$

构造广义拉格朗日函数如下:

那么原问题等价于

证明:

$max_{\alpha, \beta} \mathcal{L}(x, \alpha, \beta) = \left{ \
\begin{matrix} f(x), \quad x满足约束 \ \infty ,\quad x不满足约束\
\end{matrix} \right.$

$min_x max_{\alpha, \beta} \mathcal{L}(x, \alpha, \beta) \
\Leftrightarrow min_x f(x), \quad x满足约束$

对于任意极大极小问题:$min_a max_b f(a, b)$

对应的极小极大问题:$max_b min_a f(a, b)$都是它的弱对偶问题,即

$max_b min_a f(a, b) <= min_a max_b f(a, b)$

证明:

由于$min_a f(a, b) <= f(a, b) <= max_b f(a, b)$

得$min_a f(a, b) <= max_b f(a, b)$

由于该式恒成立,那么左式取最大值,右式取最小值时仍然成立:

当问题满足以下两个条件时,上述不等式取等号,也就是强对偶关系

  1. 原问题是凸优化问题
  2. 原问题满足slater条件

凸优化问题:目标函数是凸函数,并且可行域是凸集
slater条件:所有约束函数都是凸函数并且存在满足所有约束的点

由于$min_x max_{\alpha, \beta} \mathcal{L}(x, \alpha, \beta)$是关于$x$的优化问题
而对偶问题$max_{\alpha, \beta} min_x \mathcal{L}(x, \alpha, \beta)$是关于$\alpha$和$\beta$的问题
当两者满足强对偶关系时,可以利用KKT条件将对偶问题的解映射到原问题的解

KKT条件的表述如下:

是对偶问题的解,$x^*$是原问题的解,那它们满足以下条件:

1.可行条件

$c_i(x^) < = 0, i = 1, …, k \
h_j(x^
) = 0, j = 1, …, l \
\alpha_i^* >= 0$

2.互补松弛条件

$\alpha_i^ c_i(x^) = 0, i = 1, …, k $

3.梯度为0

$\nabla_t \mathcal{L}(t, \alpha^, \beta^)|_{t=x^*} = 0$

综上,原问题的对偶问题为

对于$min_{w, b} \mathcal{L}(w, b, \alpha)$,对$w, b$求偏导,得

$\nabla_w \mathcal{L}(w, b, \alpha) = w - \Sigma_{i=1}^n \alpha_i y_i x_i = 0$

$\nabla_b \mathcal{L}(w, b, \alpha) = - \Sigma_{i=1}^n \alpha_i y_i = 0$

$w = \Sigma_{i=1}^n \alpha_i y_i x_i$

$\Sigma_{i=1}^n \alpha_i y_i = 0$

将上述两式代入到对偶问题中,得

$max_{\alpha} -\frac{1}{2} \Sigma_{i=1}^n \Sigma_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^T x_j + \Sigma_{i = 1}^n \alpha_i \
s.t. \quad \Sigma_{i=1}^n \alpha_i y_i = 0 \
\quad \quad \quad \alpha_i >= 0$

该最大化问题等同于以下最小化问题

$min_{\alpha} \frac{1}{2} \Sigma_{i=1}^n \Sigma_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^T x_j - \Sigma_{i = 1}^n \alpha_i \
s.t. \quad \Sigma_{i=1}^n \alpha_i y_i = 0 \
\quad \quad \quad \alpha_i >= 0$

2. 求解过程

求上述问题的最优解$\alpha^*$(使用SMO算法)

之后,由KKT条件,有

梯度为0:

互补松弛条件:

设$\forall \alpha_i^ = 0$,由上,那么$w^ = 0$,而$w^ = 0$不是原问题的解,因而假设不成立。存在一个$\alpha_j^ > 0$

那么,取任一$\alpha_j^* > 0$,存在

由于$y_j^2 = 1$,等式左右乘以$y_j$,得到

故,求得的最优超平面可写作:

决策函数可写作:

三、软间隔

1. 得到优化问题

当不同类别的样本轻微混杂在一起,导致线性不可分时,原先的优化问题无法得到最优解。所以修改约束条件使其允许一定的误差:

$min_{w, b} \frac{1}{2} w^T w + loss$

假如将$loss$定义为

$loss = \left{ \ \begin{matrix} 1, \quad y_i(w^T x_i + b) < 1 \ 0 , \quad y_i(w^T x_i + b) >= 1\ \end{matrix} \right.$

但是它不连续,数学性质不好

故定义$loss$为

$loss = \left{ \ \begin{matrix} 1 - y_i(w^T x_i + b), \quad y_i(w^T x_i + b) < 1 \ 0 , \quad y_i(w^T x_i + b) >= 1\ \end{matrix} \right.$

$loss = max{0, 1 - y_i(w^T x_i + b) }$

记$\xi_i = 1 - y_i(w^T x_i + b)$,由于样本无法完全满足原问题的约束$y_i(w^T x_i + b) >= 1$,修改其约束为:

$y_i(w^T x_i + b) >= 1 - \xi_i, \quad \xi_i >= 0$

因此,软间隔的SVM的优化问题为:

$min_{w, b, \xi} \frac{1}{2} w^T w + C \Sigma_{i=1}^n \xi_i \
s.t. \quad y_i(w^T x_i + b) >= 1 - \xi_i, \quad i = 1, 2, …, n\
\quad \quad \quad \xi_i >= 0, \quad i = 1, 2, …, n$

2. 求解过程

同理,构造广义拉格朗日函数:

$\mathcal{L}(w, b, \xi, \alpha, \beta) = \frac{1}{2} w^T w + C\Sigma_{i=1}^n \xi_i - \Sigma_{i = 1}^n \alpha_i (y_i(w^T x_i + b) - 1 + \xi_i) - \Sigma_{i=1}^n \beta_i \xi_i \
s.t. \quad \alpha_i >= 0 \
\quad \quad \beta_i >= 0$

对$w, b, \xi_i$求偏导,得

$\nabla_w \mathcal{L} = w - \Sigma_{i=1}^n \alpha_i y_i x_i = 0$

$\nabla_b \mathcal{L} = - \Sigma_{i=1}^n \alpha_i y_i = 0$

$\nabla_{\xi_i} \mathcal{L} = C - \alpha_i - \beta_i = 0$

同理,将上述式子代入到对偶问题$max_{\alpha, \beta} min_{w, b, \xi} \mathcal{L}$,得

$max_{\alpha} -\frac{1}{2} \Sigma_{i=1}^n \Sigma_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^T x_j + \Sigma_{i = 1}^n \alpha_i \
s.t. \quad \Sigma_{i=1}^n \alpha_i y_i = 0 \
\quad \quad \quad C - \alpha_i - \beta_i = 0 \
\quad \quad \quad \alpha_i >= 0 \
\quad \quad \quad \beta_i >= 0 $

由$C - \alpha_i - \beta_i = 0$和$\beta_i >= 0$,得$C - \alpha_i >= 0$,即$\alpha_i <= C$

对偶问题表示如下:

$max_{\alpha} -\frac{1}{2} \Sigma_{i=1}^n \Sigma_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^T x_j + \Sigma_{i = 1}^n \alpha_i \
s.t. \quad \Sigma_{i=1}^n \alpha_i y_i = 0 \
\quad \quad \quad 0 <= \alpha_i <= C $

求上述问题的最优解$\alpha^*$

同理,由KKT条件,取一$\alpha_j^$满足$0 < \alpha_j^ < C$

得到

四、核函数

当样本线性不可分时,可以将低维不可分的数据映射到高维,从而找到更高维的超平面将其分割

在对偶问题中,需要计算两个样本点之间的内积,当样本的特征数即维度很大时,计算内积相当耗时

故引入核函数直接求得两个样本点映射到高维空间后它们的内积:

$K(x_i, x_j) = \Phi(x_i)^T \Phi(x_j)$

对偶问题中的目标函数便可以写作:

$W(\alpha) = \frac{1}{2} \Sigma_{i=1}^n \Sigma_{j=1}^n \alpha_i \alpha_j y_i y_j K(x_i, x_j) - \Sigma_{i = 1}^n \alpha_i$

最优超平面可以写作:

决策函数可以写作:

五、代码实现

1. cvxpy实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import copy

import numpy as np
import cvxpy as cp
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from random import choice


class SVM:
def __init__(self, C=None, KF=None):
# when self.C is not None, there is the soft-margin SVM
self.C = C
self.KF = KF

def K(self, i, j):
if self.KF and callable(self.KF):
return self.KF(self.X_train[i], self.X_train[j])
else:
return self.X_train[i].T @ self.X_train[j]
def object_func(self, alpha):
sum = 0
for i in range(self.n):
for j in range(self.n):
print("calculate entry: (%d, %d)" % (i, j))
sum += alpha[i] * alpha[j] * self.y_train[i] * self.y_train[j] * self.K(i, j)
return 0.5 * sum - cp.sum(alpha)

def fit(self, X_train, y_train):
self.X_train = copy.deepcopy(X_train)
self.y_train = copy.deepcopy(y_train)
self.n = self.X_train.shape[0]
print("begin to construct the convex problem...")
alpha = cp.Variable(self.n)
objective = cp.Minimize(self.object_func(alpha))
constraint = []
if self.C:
constraint = [alpha >= 0, alpha <= C, self.y_train @ alpha == 0]
else:
constraint = [alpha >= 0, self.y_train @ alpha == 0]

print("convex problem have built...")
prob = cp.Problem(objective, constraint)
prob.solve(solver='CVXOPT')
self.alpha_star = alpha.value

print("dual problem have been solved!")
# KKT条件求解w和b
self.w = np.zeros(self.X_train.shape[1])
for i in range(self.n):
self.w += X_train[i] * (self.alpha_star[i] * y_train[i])

S_with_idx = None
if self.C:
S_with_idx = [(alpha_star_i, idx)
for idx, alpha_star_i in enumerate(self.alpha_star) if 0 < alpha_star_i < self.C]
else:
S_with_idx = [(alpha_star_i, idx)
for idx, alpha_star_i in enumerate(self.alpha_star) if alpha_star_i > 0]

(_, s) = choice(S_with_idx)
self.b = y_train[s]
for i in range(self.n):
self.b -= self.alpha_star[i] * y_train[i] * (X_train[i].T @ X_train[s])
def pred(self, x):
if self.KF and callable(self.KF):
y = np.zeros(self.X_train.shape[1])
for i in range(self.n):
y += self.alpha_star[i] * y_train[i] * self.KF(self.X_train[i], x)
y += self.b
return y
else:
return self.w.T @ x + self.b
def acc(self, X_test, y_test):
y_pred = []
for x in X_test:
y_hat = np.sign(self.pred(x))
y_pred.append(y_hat)
y_pred = np.array(y_pred)
acc = accuracy_score(y_pred, y_test)
return acc

读取数据集并实现SVM实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
X, y = datasets.load_digits(return_X_y=True)
# X, y = datasets.load_breast_cancer(return_X_y=True)
# X, y = datasets.load_wine(return_X_y=True)
# X, y = datasets.load_iris(return_X_y=True)


y = np.where(y == 1, y, -1)

print(X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

classicSVM = SVM()
classicSVM.fit(X_train, y_train)
print("acc of classic SVM: ", classicSVM.acc(X_test, y_test))

C = 0.1
softSVM = SVM(C=C)
softSVM.fit(X_train, y_train)
print("acc of soft SVM: ", softSVM.acc(X_test, y_test))


# 选择最优的C
# for i in range(-10, 10):
# C = pow(10, i)
# softSVM = SVM(C=C)
# softSVM.fit(X_train, y_train)
# print("when C = %e, acc of softSVM SVM: %.4f"
# % (C, softSVM.pred(X_test, y_test)))

没用SMO算法。。跑的巨慢,摆了

2. sklearn实现

哈哈我是调包侠

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = datasets.load_digits(return_X_y=True)
# X, y = datasets.load_breast_cancer(return_X_y=True)
# X, y = datasets.load_wine(return_X_y=True)
# X, y = datasets.load_iris(return_X_y=True)
y = np.where(y == 1, y, -1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建SVM模型
svm_classic = SVC(kernel='linear') # 经典SVM
svm_soft = SVC(kernel='linear', C=0.1) # 软间隔SVM
svm_kernel = SVC(kernel='rbf', gamma=0.001) # 带核函数的SVM

# 拟合模型
svm_classic.fit(X_train, y_train)
svm_soft.fit(X_train, y_train)
svm_kernel.fit(X_train, y_train)

y_test_classic = svm_classic.predict(X_test)
y_test_soft = svm_soft.predict(X_test)
y_test_kernel = svm_kernel.predict(X_test)

accuracy_classic = accuracy_score(y_test_classic, y_test)
accuracy_soft = accuracy_score(y_test_soft, y_test)
accuracy_kernel = accuracy_score(y_test_kernel, y_test)

print("accuracy of classic svm: %.2f\n"
"accuracy of soft svm: %.2f\n"
"accuracy of kernel svm: %.2f\n"
% (accuracy_classic, accuracy_soft, accuracy_kernel))

结果如下:

1
2
3
4
D:\.py\PythonProject\ml2024\svm\mySVM>python svm.py
accuracy of classic svm: 0.98
accuracy of soft svm: 0.99
accuracy of kernel svm: 1.00