在使用HuberRegressor模型时,可以通过设置一个阈值来判断哪些样本被视为异常值。具体的步骤如下:
from sklearn.linear_model import HuberRegressor
import numpy as np
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
y = np.array([1, 2, 2, 3, 4, 5, 6, 7])
outliers_fraction = 0.001 # 异常值的比例
n_samples = len(X)
n_outliers = int(outliers_fraction * n_samples)
model = HuberRegressor()
model.fit(X, y)
residuals = np.abs(y - model.predict(X))
sorted_res = np.sort(residuals)
threshold = sorted_res[-n_outliers]
outliers_index = np.where(residuals >= threshold)
print("异常值的索引:", outliers_index)
完整代码示例如下:
from sklearn.linear_model import HuberRegressor
import numpy as np
# 加载数据集
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
y = np.array([1, 2, 2, 3, 4, 5, 6, 7])
# 计算HuberRegressor的阈值
outliers_fraction = 0.001 # 异常值的比例
n_samples = len(X)
n_outliers = int(outliers_fraction * n_samples)
# 训练HuberRegressor模型
model = HuberRegressor()
model.fit(X, y)
# 预测所有样本的残差
residuals = np.abs(y - model.predict(X))
# 按残差值进行排序
sorted_res = np.sort(residuals)
# 根据预设的异常值比例选择阈值
threshold = sorted_res[-n_outliers]
# 根据阈值判断异常值的索引
outliers_index = np.where(residuals >= threshold)
# 打印异常值的索引
print("异常值的索引:", outliers_index)
输出结果可能是:异常值的索引: (array([7]),),表示第7个样本是异常值。