Skip to content

机器学习完整实战手册

导航目录

一、前言:简介、价值、场景、学习前提、安装与验证

1.1 机器学习简介

核心理论说明
机器学习(Machine Learning)是让模型从数据中学习规律,并用于预测或决策的方法体系。核心不是“背算法”,而是构建完整闭环:数据 -> 特征 -> 模型 -> 评估 -> 优化 -> 部署。

1.2 核心价值

  • 数据驱动决策:用历史数据指导业务动作
  • 自动化建模:减少手工规则维护成本
  • 预测分析能力:对未来趋势和风险做估计

1.3 应用场景

  • 数据挖掘与用户画像
  • 图像识别与视觉分析
  • 自然语言处理(文本分类、情感分析)
  • 推荐系统
  • 金融风控与反欺诈
  • 医疗辅助诊断

1.4 学习前提

  • Python 基础(函数、类、文件处理)
  • 线性代数基础(向量、矩阵、线性变换)
  • 概率论与统计学基础(分布、均值方差、假设检验)

1.5 核心工具安装

pip 安装

bash
pip install numpy pandas matplotlib scikit-learn xgboost lightgbm

镜像安装

bash
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple numpy pandas matplotlib scikit-learn xgboost lightgbm

1.6 环境验证

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

print("numpy:", np.__version__)
print("pandas:", pd.__version__)
print("sklearn:", sklearn.__version__)

x = np.array([1, 2, 3, 4])
print("mean:", x.mean())
# 效果说明:输出版本与均值,说明基础环境可用

二、入门基础:核心概念、基础流程、工具入门、常见误区

2.1 核心概念

核心理论说明

  • 监督学习:有标签数据,做分类/回归
  • 无监督学习:无标签数据,做聚类/降维
  • 半监督学习:少量标注 + 大量未标注
  • 强化学习:与环境交互,基于奖励学习策略
  • 特征:输入变量(X)
  • 标签:目标变量(y)
  • 训练集/验证集/测试集:训练、调参、最终评估
  • 过拟合:训练好、测试差
  • 欠拟合:训练和测试都差
  • 泛化能力:模型在新数据上的表现

2.2 机器学习基础流程

核心理论说明
标准流程:数据收集 -> 预处理 -> 特征工程 -> 模型选择 -> 训练 -> 评估 -> 优化 -> 部署

全流程最小可运行代码示例

python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1) 造数据
np.random.seed(42)
X = np.random.randn(500, 6)
y = (X[:, 0] + 0.5 * X[:, 1] > 0).astype(int)

# 2) 划分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3) 预处理
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4) 建模训练
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# 5) 评估
pred = model.predict(X_test)
print("accuracy:", accuracy_score(y_test, pred))

2.3 基础工具使用

2.3.1 NumPy(数据处理)

python
import numpy as np

a = np.array([[1, 2], [3, 4]])
print("shape:", a.shape)
print("mean:", a.mean())

2.3.2 Pandas(读取与清洗)

python
import pandas as pd

df = pd.DataFrame({"age": [18, 20, None, 25], "score": [88, 92, 95, None]})
df["age"] = df["age"].fillna(df["age"].median())
df["score"] = df["score"].fillna(df["score"].mean())
print(df)

2.3.3 Matplotlib(可视化)

python
import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [2, 4, 3, 5]
plt.plot(x, y, marker="o")
plt.title("Simple Trend")
plt.show()

2.3.4 Scikit-learn(快速建模)

python
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

X, y = load_iris(return_X_y=True)
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X, y)
print("train score:", clf.score(X, y))

2.4 新手常见误区与避坑

  • 只看准确率,不看召回/F1/AUC
  • 先划分数据再做标准化(正确),反过来会数据泄露
  • 训练集表现很好就以为模型可用
  • 类别不平衡时仍用准确率做唯一指标
  • 特征工程和业务背景脱节

三、核心功能与基础算法(监督学习)

3.1 回归算法

3.1.1 线性回归

原理讲解
目标是拟合线性关系:
[ \hat{y} = w^T x + b ]

python
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X, y = make_regression(n_samples=400, n_features=5, noise=10, random_state=42)
model = LinearRegression()
model.fit(X, y)
pred = model.predict(X)
print("MSE:", mean_squared_error(y, pred))
print("R2:", r2_score(y, pred))

3.1.2 多项式回归

python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.linspace(-3, 3, 200).reshape(-1, 1)
y = X[:, 0] ** 2 + 0.5 * np.random.randn(200)

poly_model = Pipeline([
    ("poly", PolynomialFeatures(degree=2)),
    ("lr", LinearRegression())
])
poly_model.fit(X, y)
print("poly train score:", poly_model.score(X, y))

3.1.3 岭回归 / Lasso 回归

python
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=0.01).fit(X, y)

参数详解

  • alpha:正则化强度,越大约束越强

3.1.4 逻辑回归(分类)

python
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=500, n_features=10, random_state=42)
clf = LogisticRegression(max_iter=300, C=1.0)
clf.fit(X, y)
print("score:", clf.score(X, y))

参数详解

  • C:正则化倒数,越大正则越弱
  • max_iter:最大迭代轮次

3.2 分类算法

3.2.1 决策树

python
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X, y)

3.2.2 随机森林

python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf.fit(X, y)

3.2.3 SVM

python
from sklearn.svm import SVC

svm = SVC(C=1.0, kernel="rbf", probability=True, random_state=42)
svm.fit(X, y)

3.2.4 KNN

python
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)

3.2.5 朴素贝叶斯

python
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X, y)

3.3 模型评估指标

3.3.1 回归指标

python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_true = np.array([3.0, 2.5, 4.0])
y_pred = np.array([2.8, 2.7, 3.9])
print("MAE:", mean_absolute_error(y_true, y_pred))
print("MSE:", mean_squared_error(y_true, y_pred))
print("RMSE:", mean_squared_error(y_true, y_pred) ** 0.5)
print("R2:", r2_score(y_true, y_pred))

3.3.2 分类指标

python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
y_prob = [0.1, 0.9, 0.4, 0.2, 0.8]

print("Acc:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1:", f1_score(y_true, y_pred))
print("AUC:", roc_auc_score(y_true, y_prob))

3.4 数据预处理

python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df = pd.DataFrame({
    "age": [18, 22, None, 30],
    "city": ["BJ", "SH", "BJ", "SZ"],
    "income": [5000, 8000, 7000, None],
    "label": [0, 1, 1, 0],
})

# 缺失值
df["age"] = df["age"].fillna(df["age"].median())
df["income"] = df["income"].fillna(df["income"].mean())

# 类别编码
df = pd.get_dummies(df, columns=["city"], drop_first=True)

X = df.drop(columns=["label"])
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

四、进阶算法与技巧:无监督学习、特征工程、模型优化、强化学习基础

4.1 无监督学习算法

4.1.1 K-Means

python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
print("cluster centers shape:", kmeans.cluster_centers_.shape)

4.1.2 层次聚类 / DBSCAN

python
from sklearn.cluster import AgglomerativeClustering, DBSCAN

agg = AgglomerativeClustering(n_clusters=4)
db = DBSCAN(eps=0.5, min_samples=5)

4.1.3 PCA 降维

python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
print("explained variance ratio:", pca.explained_variance_ratio_)

4.2 特征工程进阶

python
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import PolynomialFeatures

# 特征选择
selector = SelectKBest(score_func=f_classif, k=5)
X_selected = selector.fit_transform(X_train_scaled, y_train)

# 特征交叉
poly = PolynomialFeatures(degree=2, include_bias=False)
X_cross = poly.fit_transform(X_train_scaled)

技巧说明

  • 先做强相关特征筛选,再做交叉特征,降低维度膨胀风险。

4.3 模型优化技巧

4.3.1 网格搜索 / 随机搜索

python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {"n_estimators": [100, 200], "max_depth": [5, 10]}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3, scoring="f1")
grid.fit(X_train_scaled, y_train)
print("best params:", grid.best_params_)

4.3.2 集成学习(Bagging / Boosting / Stacking)

python
from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression

bag = BaggingClassifier(n_estimators=50, random_state=42)
gbdt = GradientBoostingClassifier(random_state=42)

stack = StackingClassifier(
    estimators=[("rf", rf), ("gbdt", gbdt)],
    final_estimator=LogisticRegression(max_iter=200)
)

4.4 强化学习基础(Q-Learning)

python
import numpy as np

n_states, n_actions = 6, 2
Q = np.zeros((n_states, n_actions))
alpha, gamma, epsilon = 0.1, 0.9, 0.1

def step(state, action):
    next_state = min(n_states - 1, state + 1) if action == 1 else max(0, state - 1)
    reward = 1 if next_state == n_states - 1 else 0
    done = next_state == n_states - 1
    return next_state, reward, done

for _ in range(200):
    s = 0
    done = False
    while not done:
        a = np.random.randint(n_actions) if np.random.rand() < epsilon else np.argmax(Q[s])
        ns, r, done = step(s, a)
        Q[s, a] += alpha * (r + gamma * np.max(Q[ns]) - Q[s, a])
        s = ns

print("Q-table:\n", Q)

五、实战场景:回归、分类、聚类、特征工程、部署

5.1 回归实战(房价预测模板)

python
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)
pred = model.predict(X_test)

print("RMSE:", mean_squared_error(y_test, pred) ** 0.5)
print("R2:", r2_score(y_test, pred))

5.2 分类实战(用户流失预测模板)

python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X, y = make_classification(n_samples=2000, n_features=20, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=300, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(classification_report(y_test, pred))

5.3 聚类实战(用户分群)

python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

df = pd.DataFrame({
    "age": [20, 22, 45, 47, 26, 28, 50, 52],
    "spend": [200, 220, 800, 780, 260, 250, 900, 880],
})

X = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
df["cluster"] = kmeans.fit_predict(X)
print(df)

5.4 特征工程实战(提升模型性能)

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("clf", LogisticRegression(max_iter=500))
])

pipe.fit(X_train, y_train)
print("pipeline score:", pipe.score(X_test, y_test))

5.5 模型部署基础(保存、加载、简单 API)

python
import joblib
from fastapi import FastAPI
from pydantic import BaseModel

# 保存
joblib.dump(clf, "churn_model.joblib")

# 加载
loaded_model = joblib.load("churn_model.joblib")

app = FastAPI(title="ML Inference API")

class InputData(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(data: InputData):
    pred = loaded_model.predict([data.features])[0]
    return {"prediction": int(pred)}

# 启动: uvicorn main:app --reload

六、高级进阶:高级算法、复杂场景、自定义模型、工程化

6.1 高级算法

python
# pip install xgboost lightgbm catboost
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

xgb = XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=6, random_state=42)
lgb = LGBMClassifier(n_estimators=300, learning_rate=0.05, random_state=42)
cat = CatBoostClassifier(iterations=300, learning_rate=0.05, depth=6, verbose=0)

对比说明

  • XGBoost:稳定、成熟
  • LightGBM:速度快,适合大数据
  • CatBoost:对类别特征友好

6.2 复杂场景处理

  • 不平衡数据class_weight、SMOTE、阈值移动
  • 高维数据:PCA、特征选择、稀疏化
  • 时序建模:滑窗特征、时间切分验证
  • 多任务学习:共享特征 + 多目标输出

6.3 自定义模型与算法(sklearn estimator)

python
from sklearn.base import BaseEstimator, ClassifierMixin
import numpy as np

class MeanThresholdClassifier(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        self.threshold_ = X.mean()
        return self

    def predict(self, X):
        return (X.mean(axis=1) > self.threshold_).astype(int)

6.4 机器学习工程化

  • 批量处理:离线训练 + 定时重训
  • 模型监控:线上分布漂移、指标漂移
  • 模型迭代:A/B 测试、灰度发布
  • 数据流水线:特征计算、样本回流、自动评估

七、核心工具与资源:参数手册、模板库、学习资源

7.1 常用工具参数手册

Scikit-learn

  • train_test_split(test_size, random_state, stratify)
  • GridSearchCV(cv, scoring, n_jobs)
  • Pipeline(steps)

XGBoost

  • n_estimators:树数量
  • max_depth:树深
  • learning_rate:学习率
  • subsample/colsample_bytree:采样比例

LightGBM

  • num_leaves:叶子数
  • max_depth:深度限制
  • min_child_samples:叶子最小样本数

7.2 优质实战模板

模板1:分类训练模板

python
def train_classifier(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    return pred

模板2:回归训练模板

python
def train_regressor(model, X_train, y_train, X_test):
    model.fit(X_train, y_train)
    return model.predict(X_test)

模板3:网格搜索模板

python
def tune_model(model, param_grid, X, y):
    gs = GridSearchCV(model, param_grid, cv=5, scoring="f1", n_jobs=-1)
    gs.fit(X, y)
    return gs.best_estimator_, gs.best_params_

7.3 学习资源推荐

  • 经典书籍:《统计学习方法》《Hands-On Machine Learning》
  • 官方文档:Scikit-learn Docs
  • 数据集平台:Kaggle, UCI
  • 开源项目:GitHub 上 awesome-ml 系列

八、常见问题与避坑指南

8.1 高频问题(含错误与修正思路)

问题1:模型训练不收敛

错误代码示例

python
model = LogisticRegression(max_iter=10)  # 迭代太少

修正代码示例

python
model = LogisticRegression(max_iter=500, solver="lbfgs")

原因说明:迭代不足或特征未标准化会导致不收敛。

问题2:泛化能力差(过拟合)

修正策略

  • 增加正则化
  • 减少模型复杂度
  • 增加训练数据或数据增强
  • 使用交叉验证

问题3:超参数调优无效

常见原因

  • 搜索空间过窄
  • 指标不匹配业务目标
  • 验证方式不正确

问题4:数据泄露

错误做法:在全量数据上先 fit_transform 再切分
正确做法:先切分,再在训练集上 fit,测试集仅 transform

8.2 高级避坑技巧

  • 防数据泄露:严格按时间和业务流程划分数据
  • 防模型偏见:做分群评估(按性别、区域、年龄段等)
  • 算法选择:先基线模型,再复杂模型
  • 计算优化:合理采样、并行计算、特征降维

九、实战案例汇总(3-5个完整案例)

案例1:房价预测(回归)

需求分析:预测房价。
流程:数据加载 -> 缺失值处理 -> 特征标准化 -> 回归建模 -> RMSE/R²评估 -> 模型保存。
结果分析:对比线性回归与随机森林回归性能差异。

案例2:垃圾邮件识别(分类)

需求分析:识别垃圾邮件。
流程:文本向量化(TF-IDF) -> 朴素贝叶斯/逻辑回归 -> F1/AUC 评估。
结果分析:观察精确率和召回率平衡。

案例3:用户分群(聚类)

需求分析:按消费行为分群。
流程:数据标准化 -> KMeans 聚类 -> 轮廓系数评估 -> 聚类可视化。
结果分析:输出每类用户画像,辅助营销策略。

案例4:特征工程提升项目

需求分析:原模型效果一般。
流程:特征选择 + 特征交叉 + 模型调参 -> 对比前后指标。
结果分析:验证特征工程对泛化能力提升。

案例5:模型部署到本地 API

需求分析:让前端/业务系统可调用模型。
流程:模型保存 -> FastAPI 封装 -> 本地部署测试。
结果分析:形成可交付的推理服务闭环。


十、运行环境与依赖清单

10.1 推荐环境

  • Python 3.9+
  • numpy 1.24+
  • pandas 2.0+
  • matplotlib 3.7+
  • scikit-learn 1.3+
  • xgboost/lightgbm/catboost(可选)

10.2 一键安装命令

bash
pip install numpy pandas matplotlib scikit-learn xgboost lightgbm catboost fastapi uvicorn joblib

10.3 运行步骤

  1. 安装依赖
  2. 运行示例脚本
  3. 检查指标输出与可视化图
  4. 保存模型并启动 API 验证

十一、总结

关键结论

  • 机器学习不是“选一个算法就结束”,而是完整系统工程。
  • 决定效果上限的通常是:数据质量 + 特征工程 + 评估方法
  • 建议从“可解释基线模型”起步,再逐步引入复杂模型与工程化能力。