NVIDIA GeForce RTX 3050 搭建PyTorch环境

NVIDIA GeForce RTX 3050 搭建PyTorch环境

查看cuda版本

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$ nvidia-smi
Wed Jan 21 16:40:58 2026       
+---------------------------------------------------------------------------------------+
  | NVIDIA-SMI 546.30                 Driver Version: 546.30       CUDA Version: 12.3     |
  |-----------------------------------------+----------------------+----------------------+
  | GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
  |                                         |                      |               MIG M. |
  |=========================================+======================+======================|
  |   0  NVIDIA GeForce RTX 3050 ...  WDDM  | 00000000:01:00.0  On |                  N/A |
  | N/A   47C    P8               6W /  60W |    807MiB /  4096MiB |      6%      Default |
  |                                         |                      |                  N/A |
  +-----------------------------------------+----------------------+----------------------+
  
  +---------------------------------------------------------------------------------------+
  | Processes:                                                                            |
  |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
  |        ID   ID                                                             Usage      |
  |=======================================================================================|
  |    0   N/A  N/A      2804    C+G   ...4034\office6\promecefpluginhost.exe    N/A      |
  |    0   N/A  N/A      2820    C+G   ...ekyb3d8bbwe\PhoneExperienceHost.exe    N/A      |
  |    0   N/A  N/A      4688    C+G   ...oogle\Chrome\Application\chrome.exe    N/A      |
  |    0   N/A  N/A      6344    C+G   ...__8wekyb3d8bbwe\WindowsTerminal.exe    N/A      |
  |    0   N/A  N/A      6444    C+G   ...n\143.0.3650.139\msedgewebview2.exe    N/A      |
  |    0   N/A  N/A     10564    C+G   ...siveControlPanel\SystemSettings.exe    N/A      |
  |    0   N/A  N/A     10688    C+G   ...t.LockApp_cw5n1h2txyewy\LockApp.exe    N/A      |
  |    0   N/A  N/A     11740    C+G   ...s\System32\ApplicationFrameHost.exe    N/A      |
  |    0   N/A  N/A     13708    C+G   C:\Windows\explorer.exe                   N/A      |
  |    0   N/A  N/A     13780    C+G   C:\Windows\System32\ShellHost.exe         N/A      |
  |    0   N/A  N/A     14856    C+G   ...1\extracted\runtime\WeChatAppEx.exe    N/A      |
  |    0   N/A  N/A     15552    C+G   ...2txyewy\StartMenuExperienceHost.exe    N/A      |
  |    0   N/A  N/A     17604    C+G   ...4657\office6\promecefpluginhost.exe    N/A      |
  |    0   N/A  N/A     18544    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe    N/A      |
  |    0   N/A  N/A     18748    C+G   ...5n1h2txyewy\ShellExperienceHost.exe    N/A      |
  |    0   N/A  N/A     22696    C+G   ...n\143.0.3650.139\msedgewebview2.exe    N/A      |
  |    0   N/A  N/A     24724    C+G   ...nt.CBS_cw5n1h2txyewy\SearchHost.exe    N/A      |
  |    0   N/A  N/A     25112    C+G   ...ndows\System32\DataExchangeHost.exe    N/A      |
  |    0   N/A  N/A     28452    C+G   ...oogle\Chrome\Application\chrome.exe    N/A      |
  +---------------------------------------------------------------------------------------+

可以看到CUDA版本:CUDA Version: 12.3

pytorch选择对应cuda版本的安装命令

因为CUDA有向下兼容性,最新的"# CUDA 12.1"的安装命令如下

1
2
3
4
5
# 创建新环境并指定Python版本
conda create -n pytorch2.5 python=3.10 -y
conda activate pytorch2.5
# 安装PyTorch
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidia -y

PyTorch基础:张量与计算

张量操作与加速原理

ts1.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import torch

# 创建张量并启用GPU加速,统一数据类型为float32
tensor_a = torch.tensor([[1, 2], [3, 4]], device='cuda', dtype=torch.float32)  # 指定为float32
tensor_b = torch.randn(2, 2).cuda()  # 随机张量移至GPU,默认是float32

# 检查数据类型
print(f"tensor_a dtype: {tensor_a.dtype}")
print(f"tensor_b dtype: {tensor_b.dtype}")

# 常见计算
result = torch.matmul(tensor_a, tensor_b)  # 矩阵乘法
print(f"Result:\n{result}")

# 等价写法
result2 = tensor_a @ tensor_b.T  # 等价写法(转置后乘)
print(f"Result with transpose:\n{result2}")
运行ts1.py
1
2
3
4
5
6
7
8
9
$ python ts1.py
tensor_a dtype: torch.float32
tensor_b dtype: torch.float32
Result:
tensor([[ 3.6182, -2.2474],
        [ 9.4193, -3.6061]], device='cuda:0')
Result with transpose:
tensor([[ 3.9601, -2.4183],
        [10.1031, -4.1190]], device='cuda:0')

自动求导(Autograd)机制

1
2
3
4
x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2*x + 1
y.backward()  # 自动计算梯度
print(x.grad)  # 输出:8.0 (dy/dx = 2x+2)
ts2.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch

# 创建张量并启用GPU加速,统一数据类型为float32
tensor_a = torch.tensor([[1, 2], [3, 4]], device='cuda', dtype=torch.float32)  # 指定为float32
tensor_b = torch.randn(2, 2).cuda()  # 随机张量移至GPU,默认是float32

# 检查数据类型
print(f"tensor_a dtype: {tensor_a.dtype}")
print(f"tensor_b dtype: {tensor_b.dtype}")

# 常见计算
result = torch.matmul(tensor_a, tensor_b)  # 矩阵乘法
print(f"Result:\n{result}")

# 等价写法
result2 = tensor_a @ tensor_b.T  # 等价写法(转置后乘)
print(f"Result with transpose:\n{result2}")

x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2*x + 1
y.backward()  # 自动计算梯度
print(x.grad)  # 输出:8.0 (dy/dx = 2x+2)

关键点:requires_grad=True 开启追踪计算图,backward() 反向传播求导。

运行ts2.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ python ts3.py
tensor_a dtype: torch.float32
tensor_b dtype: torch.float32
Result:
tensor([[2.9388, 1.2983],
        [6.5159, 3.5078]], device='cuda:0')
Result with transpose:
tensor([[2.4608, 1.5373],
        [5.5599, 4.2248]], device='cuda:0')
tensor(8.)

模型训练与保存

安装 transformers 库

1
2
# 安装 transformers 库,可能有版本冲突警告,但不影响基本使用,sympy 版本的小差异通常不会影响 PyTorch 和 transformers 的正常工作。
pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

使用 Hugging Face 数据集

1
2
# 使用 Hugging Face 数据集(需要安装 datasets)
pip install datasets -i https://pypi.tuna.tsinghua.edu.cn/simple
ts3.py
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import os
import torch
import torch.optim as optim
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import warnings

# 屏蔽警告
warnings.filterwarnings("ignore", message="Xet Storage is enabled")
os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"

print("Loading BERT model and tokenizer...")
# 加载模型和分词器
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Loading IMDb dataset...")
# 加载IMDb电影评论数据集(情感分析)
dataset = load_dataset("imdb")

# 数据预处理函数
def preprocess_function(examples):
    # 对文本进行分词,设置截断和填充
    result = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )
    result["labels"] = examples["label"]
    return result

print("Tokenizing data...")
# 应用分词
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# 设置数据集的格式为PyTorch tensors
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

# 创建数据加载器(只使用少量数据演示,避免内存不足)
print("Creating data loaders...")
train_dataset = tokenized_datasets["train"].select(range(100))  # 只使用100个训练样本
test_dataset = tokenized_datasets["test"].select(range(20))     # 只使用20个测试样本

train_dataloader = DataLoader(
    train_dataset,
    batch_size=8,
    shuffle=True
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=4
)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# 将模型移动到GPU(如果可用)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Model moved to: {device}")

# 创建优化器
optimizer = optim.AdamW(model.parameters(), lr=5e-5)

# 训练循环
print("\nStarting training...")
model.train()
for epoch in range(3):
    total_loss = 0
    for i, batch in enumerate(train_dataloader):
        # 移动数据到对应设备
        batch = {k: v.to(device) for k, v in batch.items()}

        # 前向传播
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()

        # 反向传播
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if i % 5 == 0:  # 每5个batch打印一次
            print(f"Epoch {epoch+1}, Batch {i+1}/{len(train_dataloader)}, Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1} completed, Average Loss: {avg_loss:.4f}")

# 保存模型
torch.save(model.state_dict(), "model.pt")
print("\nModel saved as 'model.pt'")

# 测试模型
print("\nTesting model...")
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)
        labels = batch["labels"]

        correct += (predictions == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total * 100
print(f"Test Accuracy: {accuracy:.2f}% ({correct}/{total})")

# 单个样本推理示例
print("\nSingle sample inference:")
test_text = "This movie was absolutely fantastic! I loved every minute of it."
print(f"Input: {test_text}")

# 预处理输入
inputs = tokenizer(test_text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.softmax(outputs.logits, dim=-1)
    prediction = torch.argmax(outputs.logits, dim=-1)

    print(f"Positive score: {probabilities[0][1].item():.4f}")
    print(f"Negative score: {probabilities[0][0].item():.4f}")
    print(f"Prediction: {'Positive' if prediction.item() == 1 else 'Negative'}")

print("\nTraining completed successfully!")
ts3.py代码作用
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
这个代码是一个完整的**文本情感分类模型的训练和评估流程**,具体作用如下:

## 🎯 **主要目标**
训练一个基于 BERT 的深度学习模型,用于判断电影评论的情感倾向(正面/负面)。

## 📊 **详细功能**

### 1. **模型加载**
- 使用 Hugging Face 的 `transformers` 库加载预训练的 BERT 模型(bert-base-uncased)
- 添加一个分类头,将 BERT 的输出转换为二元分类(正面/负面)
- 加载对应的分词器(Tokenizer)

### 2. **数据准备**
- **数据集**:IMDb 电影评论数据集(25k 条训练 + 25k 条测试)
- **数据预处理**:
  - 将文本转换为 BERT 能理解的 token ID
  - 添加 attention mask(区分真实内容与填充)
  - 设置固定长度(128个token),超过截断,不足填充
  - 提取情感标签(0=负面,1=正面)
- **数据划分**:为了演示,只使用 100 条训练样本和 20 条测试样本

### 3. **模型训练**
- **优化器**:AdamW(BERT 常用的优化器)
- **学习率**:5e-5(BERT fine-tuning 的标准学习率)
- **训练循环**:
  - 3个 epoch(完整遍历数据集3次)
  - 每个 batch 进行前向传播、计算损失、反向传播、更新参数
  - 监控并打印损失值

### 4. **模型评估**
- 在测试集上评估模型准确率
- 进行单个样本推理,输出:
  - 正面情感的概率
  - 负面情感的概率
  - 最终预测结果

### 5. **模型保存**
- 将训练好的模型权重保存为 "model.pt" 文件

## 🔧 **技术要点**
1. **迁移学习**:在预训练的 BERT 基础上进行微调(fine-tuning)
2. **GPU 加速**:自动检测并使用 CUDA GPU 进行训练
3. **批处理**:使用 DataLoader 高效加载数据
4. **自动梯度**:PyTorch 的自动微分系统
5. **模型部署**:保存训练好的模型供后续使用

## 📈 **应用场景**
- 电商平台:分析用户评价
- 社交媒体:监测舆情
- 客服系统:自动分析客户反馈
- 内容平台:推荐系统的一部分

## ⚠️ **注意事项**
- 演示中只用了很少的数据,真实应用需要更多数据
- 模型性能取决于训练数据量和质量
- 可以调整超参数(学习率、batch size、训练轮数等)优化效果

这个代码展示了**从数据准备到模型训练、评估、保存的完整机器学习工作流程**,是 NLP 领域的一个经典入门示例!
运行ts3.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ python ts3.py
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading BERT model and tokenizer...
Loading IMDb dataset...
Tokenizing data...
Map: 100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 25000/25000 [00:04<00:00, 5796.14 examples/s]
Map: 100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 25000/25000 [00:04<00:00, 5359.47 examples/s]
Map: 100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 50000/50000 [00:10<00:00, 4998.96 examples/s]
Creating data loaders...
Training samples: 100
Test samples: 20
Model moved to: cuda

Starting training...
Epoch 1, Batch 1/13, Loss: 0.6865
Epoch 1, Batch 6/13, Loss: 0.0891
Epoch 1, Batch 11/13, Loss: 0.0209
Epoch 1 completed, Average Loss: 0.1602
Epoch 2, Batch 1/13, Loss: 0.0118
Epoch 2, Batch 6/13, Loss: 0.0051
Epoch 2, Batch 11/13, Loss: 0.0033
Epoch 2 completed, Average Loss: 0.0058
Epoch 3, Batch 1/13, Loss: 0.0027
Epoch 3, Batch 6/13, Loss: 0.0020
Epoch 3, Batch 11/13, Loss: 0.0017
Epoch 3 completed, Average Loss: 0.0020

Model saved as 'model.pt'

Testing model...
Test Accuracy: 100.00% (20/20)

Single sample inference:
Input: This movie was absolutely fantastic! I loved every minute of it.
Positive score: 0.0017
Negative score: 0.9983
Prediction: Negative

Training completed successfully!

模型加载与推理

使用 Hugging Face 数据集

ts4.py
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import warnings

# 屏蔽警告
warnings.filterwarnings("ignore")

def load_model(model_path="model.pt", model_name="bert-base-uncased"):
    """
    Load trained model
    """
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    print("Loading base model...")
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    print(f"Loading fine-tuned weights from {model_path}...")
    model.load_state_dict(torch.load(model_path, weights_only=True))

    # Move to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    print(f"Model loaded and moved to {device}")
    return model, tokenizer, device

def predict_single(text, model, tokenizer, device, max_length=128):
    """
    Predict single text
    """
    # Preprocess input
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=max_length
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Inference
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=-1)
        prediction = torch.argmax(logits, dim=1).item()

    return {
        "text": text,
        "prediction": prediction,  # 0=negative, 1=positive
        "confidence": probabilities[0][prediction].item(),
        "negative_score": probabilities[0][0].item(),
        "positive_score": probabilities[0][1].item()
    }

def print_result(result):
    """
    Print result - English only
    """
    sentiment = "Positive" if result['prediction'] == 1 else "Negative"
    print(f"\nText: {result['text']}")
    print(f"Prediction: {sentiment}")
    print(f"Confidence: {result['confidence']:.2%}")
    print(f"Positive score: {result['positive_score']:.4f}")
    print(f"Negative score: {result['negative_score']:.4f}")
    print("-" * 50)

def main():
    try:
        # 1. Load model
        model, tokenizer, device = load_model("model.pt")

        # 2. Single sample inference
        print("\n" + "="*50)
        print("Single Sample Inference Test")
        print("="*50)

        test_cases = [
            "This movie was absolutely fantastic! I loved every minute of it.",
            "Waste of time and money. Terrible acting and boring story.",
            "It was okay, not great but not bad either.",
            "The special effects were amazing but the plot was weak.",
            "I've never seen such a wonderful performance in my life!"
        ]

        for text in test_cases:
            result = predict_single(text, model, tokenizer, device)
            print_result(result)

        # 3. Interactive inference
        print("\n" + "="*50)
        print("Interactive Inference (Type 'quit' or Ctrl+C to exit)")
        print("="*50)

        while True:
            try:
                user_input = input("\nEnter English text to analyze: ")
                if user_input.lower() == 'quit':
                    break

                if user_input.strip():
                    result = predict_single(user_input, model, tokenizer, device)
                    print_result(result)
            except KeyboardInterrupt:
                print("\n\nExited by user")
                break

    except FileNotFoundError:
        print("Error: model.pt not found. Please run training script first.")
    except Exception as e:
        print(f"Error occurred: {e}")

if __name__ == "__main__":
    main()
ts4.py的作用
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
这个代码是一个**完整的预训练模型加载与推理系统**,用于情感分析(Sentiment Analysis)。以下是详细的作用分析:

## 🎯 **核心功能**
,**情感分类**:判断英文文本的情感倾向(正面/负面),主要用于电影评论分析。

## 🛠️ **技术架构**

### 1. **模型管理模块**
```python
def load_model(model_path="model.pt", model_name="bert-base-uncased"):
```
- **加载预训练模型**:使用 Hugging Face 的 `bert-base-uncased` 作为基础
- **加载微调权重**:从本地文件 `model.pt` 加载特定任务的训练结果
- **设备管理**:自动检测并利用 GPU 加速(CUDA)
- **模式切换**:设置为评估模式 (`model.eval()`)

### 2. **推理引擎模块**
```python
def predict_single(text, model, tokenizer, device, max_length=128):
```
- **文本预处理**:
  - 分词(Tokenization)
  - 截断(Truncation)
  - 填充(Padding)
  - 转换为张量
- **模型推理**:
  - 禁用梯度计算(`torch.no_grad()`  - 前向传播
  - Softmax 概率转换
  - 分类决策

### 3. **交互系统模块**
- **批量测试**:预设测试用例验证模型效果
- **交互模式**:用户可实时输入文本进行分析
- **错误处理**:文件不存在、用户中断等异常处理

## 🔄 **工作流程**
```
输入文本 → 分词编码 → 模型推理 → 概率计算 → 分类输出
    ↓         ↓          ↓         ↓         ↓
 "I love it" → [101, 1045, ...] → Logits → Softmax → "Positive (99%)"
```

## 📊 **输出格式**
```python
{
    "text": "This movie was fantastic!",      # 原始文本
    "prediction": 1,                          # 0=负面, 1=正面
    "confidence": 0.9983,                     # 预测置信度
    "negative_score": 0.0017,                 # 负面概率
    "positive_score": 0.9983                  # 正面概率
}
```

## 🔧 **关键技术点**

### 1. **迁移学习**
```python
# 预训练模型 + 微调权重
base_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
base_model.load_state_dict(torch.load("model.pt"))
```
- **预训练**:在大规模语料上训练的通用语言理解能力
- **微调**:在特定任务(情感分析)上进一步训练

### 2. **GPU加速**
```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
```
- 自动检测 GPU 可用性
- 将模型和数据移至 GPU 进行加速计算

### 3. **推理优化**
```python
with torch.no_grad():  # 禁用梯度计算
    outputs = model(**inputs)
```
- **内存优化**:不保存中间梯度值
- **速度优化**:减少不必要的计算

## 🎪 **使用场景**

### 1. **质量监控**
```python
# 批量分析用户评论
reviews = ["Great product!", "Terrible service", "Average experience"]
for review in reviews:
    result = predict_single(review, model, tokenizer, device)
```

### 2. **实时分析**
```python
# 客服系统实时情感分析
while user_is_talking:
    text = get_latest_message()
    sentiment = predict_single(text, model, tokenizer, device)
    if sentiment["prediction"] == 0:
        escalate_to_supervisor()
```

### 3. **数据标注辅助**
```python
# 半自动标注工具
unlabeled_texts = [...]  # 大量未标注数据
for text in unlabeled_texts:
    prediction = predict_single(text, model, tokenizer, device)
    if prediction["confidence"] > 0.95:
        auto_label(text, prediction["prediction"])
```

## ⚡ **性能特征**
- **延迟**:单次推理约 10-50ms(取决于文本长度和硬件)
- **吞吐量**:批量处理可大幅提高效率
- **内存占用**:BERT-base 约 440MB,微调部分很小

## 🔍 **扩展可能性**

### 1. **多语言支持**
```python
# 切换为多语言模型
model_name = "bert-base-multilingual-cased"
```

### 2. **多分类任务**
```python
# 修改为5分类(非常负面、负面、中性、正面、非常正面)
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", 
    num_labels=5
)
```

### 3. **API服务化**
```python
from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict_api():
    text = request.json.get('text', '')
    result = predict_single(text, model, tokenizer, device)
    return jsonify(result)
```

## 📈 **模型评估**
从你的输出可以看到:
- **高置信度问题**:所有预测置信度 >98%
- **可能的过拟合**:模型可能过于偏向某个类别
- **改进方向**:
  1. 增加训练数据量
  2. 平衡正负样本
  3. 添加正则化
  4. 调整学习率

## 💡 **总结**
这个代码展示了**工业级 NLP 推理系统的核心组件**:
- **模块化设计**:模型加载、推理、交互分离
- **生产就绪**:错误处理、设备管理、性能优化
- **易于扩展**:支持批处理、API 集成等

它是将学术研究转化为实际应用的典型示例!
运行ts4.py

quit退出

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
$ python ts4.py
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading tokenizer...
Loading base model...
Loading fine-tuned weights from model.pt...
Model loaded and moved to cuda

==================================================
Single Sample Inference Test
==================================================

Text: This movie was absolutely fantastic! I loved every minute of it.
Prediction: Negative
Confidence: 99.83%
Positive score: 0.0017
Negative score: 0.9983
--------------------------------------------------

Text: Waste of time and money. Terrible acting and boring story.
Prediction: Negative
Confidence: 99.85%
Positive score: 0.0015
Negative score: 0.9985
--------------------------------------------------

Text: It was okay, not great but not bad either.
Prediction: Negative
Confidence: 99.83%
Positive score: 0.0017
Negative score: 0.9983
--------------------------------------------------

Text: The special effects were amazing but the plot was weak.
Prediction: Negative
Confidence: 99.85%
Positive score: 0.0015
Negative score: 0.9985
--------------------------------------------------

Text: I've never seen such a wonderful performance in my life!
Prediction: Negative
Confidence: 99.83%
Positive score: 0.0017
Negative score: 0.9983
--------------------------------------------------

==================================================
Interactive Inference (Type 'quit' or Ctrl+C to exit)
==================================================

Enter English text to analyze: quit

使用自定义数据,数据加载与预处理、模型加载与保存、模型推理与加载

ts5.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# 在代码开头添加
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
from typing import List, Dict, Optional

# ==================== 6.2 数据加载与预处理 ====================

class TextDataset(Dataset):
    """自定义文本数据集类"""
    def __init__(self, texts: List[str], labels: List[int], tokenizer, max_length: int = 128):
        """
        初始化数据集
        
        参数:
            texts: 文本列表
            labels: 标签列表
            tokenizer: 分词器
            max_length: 最大序列长度
        """
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __len__(self) -> int:
        """返回数据集大小"""
        return len(self.texts)
    
    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
        """获取单个样本"""
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        # 对文本进行编码
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        # 移除batch维度
        item = {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }
        
        return item

# ==================== 模型训练与评估类 ====================

class TextClassifierTrainer:
    """文本分类训练器"""
    def __init__(self, 
                 model_name: str = "bert-base-uncased",
                 num_labels: int = 2,
                 device: Optional[str] = None):
        """
        初始化训练器
        
        参数:
            model_name: 预训练模型名称
            num_labels: 标签数量
            device: 设备 (cpu/cuda)
        """
        self.device = device if device else ('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"使用设备: {self.device}")
        
        # 加载分词器和模型
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, 
            num_labels=num_labels
        ).to(self.device)
        
        self.num_labels = num_labels
        
    def create_dataloader(self, 
                         texts: List[str], 
                         labels: List[int], 
                         batch_size: int = 32,
                         shuffle: bool = True,
                         max_length: int = 128) -> DataLoader:
        """
        创建数据加载器
        
        参数:
            texts: 文本列表
            labels: 标签列表
            batch_size: 批次大小
            shuffle: 是否打乱数据
            max_length: 最大序列长度
        """
        dataset = TextDataset(texts, labels, self.tokenizer, max_length)
        dataloader = DataLoader(
            dataset, 
            batch_size=batch_size, 
            shuffle=shuffle,
            pin_memory=True if self.device == 'cuda' else False
        )
        return dataloader
    
    def train(self, 
              train_loader: DataLoader,
              val_loader: Optional[DataLoader] = None,
              epochs: int = 3,
              learning_rate: float = 5e-5,
              save_path: str = "model.pt"):
        """
        训练模型
        
        参数:
            train_loader: 训练数据加载器
            val_loader: 验证数据加载器
            epochs: 训练轮数
            learning_rate: 学习率
            save_path: 模型保存路径
        """
        # 设置优化器
        optimizer = optim.AdamW(self.model.parameters(), lr=learning_rate)
        
        # 训练循环
        for epoch in range(epochs):
            print(f"\n{'='*50}")
            print(f"Epoch {epoch + 1}/{epochs}")
            print(f"{'='*50}")
            
            # 训练模式
            self.model.train()
            total_loss = 0
            correct = 0
            total = 0
            
            for batch_idx, batch in enumerate(train_loader):
                # 将数据移动到设备
                batch = {k: v.to(self.device) for k, v in batch.items()}
                
                # 前向传播
                outputs = self.model(**batch)
                loss = outputs.loss
                
                # 反向传播
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()
                
                # 计算统计信息
                total_loss += loss.item()
                predictions = torch.argmax(outputs.logits, dim=-1)
                correct += (predictions == batch['labels']).sum().item()
                total += batch['labels'].size(0)
                
                # 每100个batch打印一次进度
                if (batch_idx + 1) % 100 == 0:
                    print(f"  Batch {batch_idx + 1}/{len(train_loader)}, "
                          f"Loss: {loss.item():.4f}, "
                          f"Accuracy: {correct/total:.4f}")
            
            # 计算epoch平均损失和准确率
            avg_loss = total_loss / len(train_loader)
            train_acc = correct / total
            print(f"训练结果 - Loss: {avg_loss:.4f}, Accuracy: {train_acc:.4f}")
            
            # 验证(如果提供了验证集)
            if val_loader:
                val_loss, val_acc = self.evaluate(val_loader)
                print(f"验证结果 - Loss: {val_loss:.4f}, Accuracy: {val_acc:.4f}")
        
        # 保存模型
        self.save_model(save_path)
        print(f"\n模型已保存到: {save_path}")
    
    def evaluate(self, dataloader: DataLoader) -> tuple:
        """
        评估模型
        
        参数:
            dataloader: 评估数据加载器
            
        返回:
            (平均损失, 准确率)
        """
        self.model.eval()
        total_loss = 0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for batch in dataloader:
                batch = {k: v.to(self.device) for k, v in batch.items()}
                outputs = self.model(**batch)
                
                total_loss += outputs.loss.item()
                predictions = torch.argmax(outputs.logits, dim=-1)
                correct += (predictions == batch['labels']).sum().item()
                total += batch['labels'].size(0)
        
        avg_loss = total_loss / len(dataloader)
        accuracy = correct / total if total > 0 else 0
        
        return avg_loss, accuracy
    
    def predict(self, text: str) -> Dict:
        """
        预测单个文本
        
        参数:
            text: 输入文本
            
        返回:
            预测结果字典
        """
        self.model.eval()
        
        # 编码文本
        inputs = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=128,
            return_tensors='pt'
        )
        
        # 移动到设备
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        # 预测
        with torch.no_grad():
            outputs = self.model(**inputs)
            probabilities = torch.softmax(outputs.logits, dim=-1)
            prediction = torch.argmax(outputs.logits, dim=1).item()
        
        return {
            'text': text,
            'prediction': prediction,
            'probabilities': probabilities.cpu().numpy()[0].tolist(),
            'confidence': probabilities.max().item()
        }
    
    def save_model(self, path: str):
        """保存模型"""
        torch.save({
            'model_state_dict': self.model.state_dict(),
            'tokenizer_config': self.tokenizer.init_kwargs,
            'model_config': self.model.config.to_dict(),
            'num_labels': self.num_labels
        }, path)
    
    def load_model(self, path: str):
        """加载模型"""
        checkpoint = torch.load(path, map_location=self.device)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.num_labels = checkpoint['num_labels']

# ==================== 使用示例 ====================

def main():
    """主函数:演示完整流程"""
    print("="*60)
    print("文本分类完整流程演示")
    print("="*60)
    
    # 1. 准备示例数据
    print("\n1. 准备数据...")
    train_texts = [
        "This product is amazing! I really love it.",
        "Terrible experience, would not recommend.",
        "The quality is good but delivery was late.",
        "Excellent service and fast shipping.",
        "Poor customer support, very disappointed.",
        "Good value for money.",
        "Not as described, very misleading.",
        "Perfect! Exactly what I wanted."
    ]
    train_labels = [1, 0, 1, 1, 0, 1, 0, 1]  # 1: 正面, 0: 负面
    
    test_texts = [
        "I'm very satisfied with this purchase.",
        "Waste of money, completely useless."
    ]
    test_labels = [1, 0]
    
    # 2. 初始化训练器
    print("\n2. 初始化训练器...")
    trainer = TextClassifierTrainer(
        model_name="bert-base-uncased",
        num_labels=2,
        device='cpu'  # 使用cpu,如需GPU请改为'cuda'
    )
    
    # 3. 创建数据加载器
    print("\n3. 创建数据加载器...")
    train_loader = trainer.create_dataloader(
        train_texts, train_labels, 
        batch_size=2, shuffle=True, max_length=64
    )
    
    test_loader = trainer.create_dataloader(
        test_texts, test_labels,
        batch_size=2, shuffle=False, max_length=64
    )
    
    # 4. 训练模型
    print("\n4. 开始训练模型...")
    trainer.train(
        train_loader=train_loader,
        val_loader=test_loader,
        epochs=3,
        learning_rate=5e-5,
        save_path="text_classifier_model.pt"
    )
    
    # 5. 加载模型并进行推理
    print("\n5. 加载模型并进行推理...")
    trainer.load_model("text_classifier_model.pt")
    
    # 6. 测试推理
    print("\n6. 测试推理...")
    test_samples = [
        "This is the best product I've ever bought!",
        "I'm very disappointed with the quality.",
        "It's okay, nothing special.",
        "The service was absolutely fantastic!"
    ]
    
    for text in test_samples:
        result = trainer.predict(text)
        sentiment = "正面" if result['prediction'] == 1 else "负面"
        print(f"\n文本: {text}")
        print(f"情感: {sentiment}")
        print(f"置信度: {result['confidence']:.2%}")
        print(f"概率分布: [负面: {result['probabilities'][0]:.4f}, 正面: {result['probabilities'][1]:.4f}]")
    
    print("\n" + "="*60)
    print("流程完成!")
    print("="*60)

# ==================== 模型加载与推理的独立函数 ====================

def load_and_predict(model_path: str, text: str, device: str = None):
    """
    加载模型并进行预测的独立函数
    
    参数:
        model_path: 模型路径
        text: 输入文本
        device: 设备类型
        
    返回:
        预测结果
    """
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    # 加载检查点
    checkpoint = torch.load(model_path, map_location=device)
    
    # 重新创建模型
    model = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-uncased",
        num_labels=checkpoint['num_labels']
    ).to(device)
    
    # 加载权重
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    # 重新创建分词器
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    # 编码文本
    inputs = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=128,
        return_tensors='pt'
    ).to(device)
    
    # 预测
    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = torch.softmax(outputs.logits, dim=-1)
        prediction = torch.argmax(outputs.logits, dim=1).item()
    
    return {
        'prediction': prediction,
        'probabilities': probabilities.cpu().numpy()[0],
        'confidence': probabilities.max().item()
    }

if __name__ == "__main__":
    # 运行完整演示
    main()
    
    # 示例:使用独立的加载和预测函数
    print("\n\n独立函数调用示例:")
    result = load_and_predict("text_classifier_model.pt", "This product is excellent!")
    print(f"预测结果: {result}")

ts5.py运行效果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
$ python ts5.py
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
D:\temp\temp\ts8.py:260: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(path, map_location=self.device)
D:\temp\temp\ts8.py:365: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(model_path, map_location=device)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
============================================================
文本分类完整流程演示
============================================================

1. 准备数据...

2. 初始化训练器...
使用设备: cpu

3. 创建数据加载器...

4. 开始训练模型...

==================================================
Epoch 1/3
==================================================
训练结果 - Loss: 0.6826, Accuracy: 0.6250
验证结果 - Loss: 0.7525, Accuracy: 0.5000

==================================================
Epoch 2/3
==================================================
训练结果 - Loss: 0.6311, Accuracy: 0.6250
验证结果 - Loss: 0.6439, Accuracy: 0.5000

==================================================
Epoch 3/3
==================================================
训练结果 - Loss: 0.4965, Accuracy: 0.7500
验证结果 - Loss: 0.5111, Accuracy: 1.0000

模型已保存到: text_classifier_model.pt

5. 加载模型并进行推理...

6. 测试推理...

文本: This is the best product I've ever bought!
情感: 正面
置信度: 71.61%
概率分布: [负面: 0.2839, 正面: 0.7161]

文本: I'm very disappointed with the quality.
情感: 正面
置信度: 63.62%
概率分布: [负面: 0.3638, 正面: 0.6362]

文本: It's okay, nothing special.
情感: 正面
置信度: 59.90%
概率分布: [负面: 0.4010, 正面: 0.5990]

文本: The service was absolutely fantastic!
情感: 正面
置信度: 69.58%
概率分布: [负面: 0.3042, 正面: 0.6958]

============================================================
流程完成!
============================================================


独立函数调用示例:
预测结果: {'prediction': 1, 'probabilities': array([0.28677344, 0.7132265 ], dtype=float32), 'confidence': 0.7132264971733093}
comments powered by Disqus
使用 Hugo 构建
主题 StackJimmy 设计