简介&快速上手
https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M
PromptGuard ,meta开源的模型,bert架构,可以用于检测Prompt注入、越狱攻击等。一个简单的测试如下:
from transformers import pipeline
classifier = pipeline("text-classification", model="meta-llama/Llama-Prompt-Guard-2-86M")
classifier("Ignore your previous instructions.")
#输出为 [{'label': 'LABEL_1', 'score': 0.9995731711387634}]
其模型架构如下
基模为DebertaV2架构
词表大小为251000,支持多种语言,词向量维度为768
DebertaV2ForSequenceClassification(
(deberta): DebertaV2Model(
(embeddings): DebertaV2Embeddings(
(word_embeddings): Embedding(251000, 768, padding_idx=0)
(LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): DebertaV2Encoder(
(layer): ModuleList(
(0-11): 12 x DebertaV2Layer(
(attention): DebertaV2Attention(
(self): DisentangledSelfAttention(
(query_proj): Linear(in_features=768, out_features=768, bias=True)
(key_proj): Linear(in_features=768, out_features=768, bias=True)
(value_proj): Linear(in_features=768, out_features=768, bias=True)
(pos_dropout): Dropout(p=0.1, inplace=False)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): DebertaV2SelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): DebertaV2Intermediate(
...
(dropout): Dropout(p=0, inplace=False)
)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
微调
官方有专门的微调手册,可直接参考进行学习
微调用的中文数据集可以选择https://huggingface.co/datasets/CTCT-CT2/ChangeMore-prompt-injection-eval
数据集加载&查看
dataset = load_dataset("CTCT-CT2/ChangeMore-prompt-injection-eval")
train_dataset = dataset["test"].select(range(8000))
test_dataset = dataset["test"].select(range(8000, 10000))

query 即为输入, label为对应的标签, 1为 prompt注入样本 0为正常样本
- 选用8000条作为训练集
- 选用2000条作为测试集
测试当前模型效果
def evaluate_batch(texts, batch_size=32, positive_label=1, temperature=1.0, device='cpu'):
model.to(device)
model.eval()
# Prepare the data loader
print(type(texts))
#embedding
encoded_texts = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
dataset = torch.utils.data.TensorDataset(encoded_texts['input_ids'], encoded_texts['attention_mask'])
data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
scores = []
#调用模型进行分类
for batch in tqdm(data_loader, desc="Evaluating"):
input_ids, attention_mask = [b.to(device) for b in batch]
with torch.no_grad():
# 输入模型,获取logits [batchsize,2]
logits = model(input_ids=input_ids, attention_mask=attention_mask).logits
scaled_logits = logits / temperature
probabilities = softmax(scaled_logits, dim=-1)
positive_class_probabilities = probabilities[:, positive_label].cpu().numpy()
scores.extend(positive_class_probabilities)
return scores
# Convert test_dataset['query'] to a list before passing to evaluate_batch
test_queries = list(test_dataset['query'])
test_scores = evaluate_batch(test_queries, positive_label=1, temperature=3.0, device='cuda:0')
绘制ROC曲线
plt.figure(figsize=(8, 6))
test_labels = [int(elt) for elt in test_dataset['label']]
fpr, tpr, _ = roc_curve(test_labels, test_scores)
roc_auc = roc_auc_score(test_labels, test_scores)
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
## 打印具体的数值
print(f"ROC AUC: {roc_auc:.3f}")

Precision: 0.970, Recall: 0.344, F1-score: 0.508, 有改进的空间
模型微调
数据集大小为8k,batchsize为16,epochs为1 学习率5e-6, 在3090 24GB上进行微调,用时3.5分钟
def train_model(train_dataset, model, tokenizer, batch_size=16, epochs=1, lr=5e-6, device='cuda:0'):
# Adjust the model's classifier to have two output labels
model.classifier = torch.nn.Linear(model.classifier.in_features, 2)
model.num_labels = 2
model.to(device)
model.train()
# Prepare optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
# Prepare data loader
def collate_fn(batch):
#提取文本标签
texts = [item['query'] for item in batch]
labels = torch.tensor([int(item['label']) for item in batch]) # Convert string labels to integers
encodings = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
# 返回文本embedding、掩码、标签
return encodings.input_ids, encodings.attention_mask, labels
data_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
# Training loop
for epoch in range(epochs):
total_loss = 0
for batch in tqdm(data_loader, desc=f"Epoch {epoch + 1}"):
input_ids, attention_mask, labels = [x.to(device) for x in batch]
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)# 提供了标签,会自动计算交叉熵损失
loss = outputs.loss
# Backpropagation
optimizer.zero_grad()#清空梯度
loss.backward()# 反向传播,计算梯度
optimizer.step()# 参数更新
total_loss += loss.item()
print(f"Average loss in epoch {epoch + 1}: {total_loss / len(data_loader)}")
# Example usage
train_model(train_dataset, model, tokenizer, device='cuda:0')
微调后验证模型

微调前:Precision: 0.970, Recall: 0.344, F1-score: 0.508
微调后:Precision: 0.994, Recall: 0.971, F1-score: 0.982
小结
Bert利用注意力机制学习query的含义,很适用于二分类任务。其特点是轻量高效、微调成本低、延迟低。
然而其本身参数量少也导致很难去学习到越狱意图。仅匹配了恶意关键字、恶意样本结构。对于OOD的样本,分类准确率并不理想。不过微调成本很低,每天都可以迭代,需要定期维护更新数据集。
从整体来看,(Bert)二分类模型只能作为大模型防火墙中的一个小环节。通过训练其可以检测有明显恶意意图的输入。但做不到意图识别。
文章参考:
https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M
博客地址: qwrdxer.github.io
欢迎交流: qq1944270374
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 1944270374@qq.com