草根之明

管理及技术博客

0%

数据归一化

解决方案:将所有的数据映射到同一尺度

最值归一化:把所有数据映射到0-1之间

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

适用于分布有明显边界的情况;受outlier影响较大

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import matplotlib.pyplot as plt

x = np.random.randint(0, 100, size=100)
jg = (x - np.min(x)) / (np.max(x) - np.min(x))
print(jg)

X = np.random.randint(0, 100, (50, 2))
X = np.array(X, dtype=float)
X[:, 0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))
print(X)

plt.scatter(X[:, 0], X[:, 1])
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# # 均值
# np.mean(X[:, 0])
# # 方差 方差是衡量源数据和期望值相差的度量值。
# np.std(X[:, 0])

import numpy as np
import matplotlib.pyplot as plt

X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)
X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])

plt.scatter(X2[:, 0], X2[:, 1])
plt.show()

所以要保存训练数据集得到的均值和方差

scikit-learn中使用Scaler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn import datasets

# 获取数据源
iris = datasets.load_iris()
X = iris.data
y = iris.target

# 训练集与测试集切分
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

# 均值方差归一化
# scikit-learn中的StandardScaler 数据预处理类
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train)
print(standardScaler.mean_)
print(standardScaler.scale_) # std_ 相同

# 归一化
X_train = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)

# 进行kNN算法分类
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)

res = knn_clf.score(X_test_standard, y_test)
print(res)

自定义StandarScaler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np

class StandardScaler:
def __init__(self):
self.mean_ = None
self.scale_ = None

def fit(self, X):
""" 根据训练集X获得数据的均值和方差 """
self.mean_ = np.array([np.mean(X[:, i]) for i in range(X.shape[1])])
self.scale_ = np.array([np.std(X[:, i]) for i in range(X.shape[1])])

return self

def tranform(self, X):
""" 将X根据这个StandardScaler进行均值方差归一化处理 """
resX = np.empty(shape=X.shape, dtype=float)
for col in range(X.shape[1]):
resX[:, col] = (X[:, col] - self.mean_[col]) / self.scale_[col]
return resX