Skip to content

【开源实习】在arxiv上发表基于MindSpore的原生论文(26) #204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions research/arxiv_papers/ragat/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Auto detect text files and perform LF normalization
* text=auto

# Python bytecode files
*.pyc
*.pyo
__pycache__/

# Jupyter notebook checkpoints
.ipynb_checkpoints/

# Virtualenv directories
env/
venv/

# Data files
data/

# Logs
*.log
44 changes: 44 additions & 0 deletions research/arxiv_papers/ragat/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# **RAGAT-Mind: 基于多粒度建模的谣言检测方法**

## **概述**
本项目实现了 **RAGAT-Mind** 模型,一种用于中文社交媒体谣言检测的多粒度建模方法。该模型结合了 **TextCNN** 提取局部特征、**GRU** 捕捉时序依赖、**多头自注意力**(MHA)聚合全局特征,以及 **双向图卷积网络**(BiGCN)用于结构化的词共现学习。该模型基于 **MindSpore** 深度学习框架进行实现。

## **数据集**
本模型在 **Weibo1-Rumor** 数据集上进行了评估。该数据集包含来自微博的实际社交媒体文本,总共有 3,387 条样本(1,538 条谣言和 1,849 条非谣言)。数据集被划分为训练集(80%)和测试集(20%)。预处理步骤包括分词、词汇索引化以及构建图卷积所需的邻接矩阵。

## **网络架构**
RAGAT-Mind 模型包括以下几个核心模块:
1. **TextCNN**:使用多种卷积核提取局部n-gram语义特征。
2. **GRU**:建模文本中的时序依赖,增强模型对语境变化的感知。
3. **MHA**:通过多头自注意力机制提升模型对重要语义区域的关注,增强全局语义建模。
4. **BiGCN**:通过双向图卷积网络(BiGCN)学习词共现图的结构化依赖。
5. **融合层**:将语义路径和结构路径的输出进行拼接,形成统一的特征表示,用于最终分类。

## **训练步骤**
1. **数据加载**:从指定路径加载 **Weibo1-Rumor** 数据集,并进行数据增强处理。
2. **模型训练**:使用 Adam 优化器,训练 50 个 epoch,并在每个 epoch 结束时计算训练集和验证集的评估指标,包括准确率、精确度、召回率和 F1 分数。
3. **保存训练结果**:训练过程中,所有的训练和验证指标会被记录并保存到 `training_results.txt` 文件中。

## **训练结果**
### 输出内容
每个 epoch 结束时,训练过程和验证过程的结果分别输出并保存。输出内容包括:
- **训练集**:
- 损失值(Loss)
- 准确率(Accuracy)
- 精确度(Precision)
- 召回率(Recall)
- F1 分数(F1-Score)
- 训练时间(Training Time)

- **验证集**:
- 测试准确率(Test Accuracy)
- 测试精确度(Test Precision)
- 测试召回率(Test Recall)
- 测试 F1 分数(Test F1-Score)

### 性能比较
我们与多种基线模型进行了对比实验,包括 **TextCNN**、**GRU-ATT**、**TextGCN** 和 **BERT-FT** 等。性能比较的评估指标包括准确率、精确度、召回率、宏平均F1和推理延迟时间。**RAGAT-Mind** 在准确率和宏平均F1方面显著优于所有基线模型,达到了 **99.2%** 的测试准确率和 **0.9919** 的宏平均F1分数。

---

> **注意**: 训练和测试结果将保存在 `training_results.txt` 文件中。
500 changes: 500 additions & 0 deletions research/arxiv_papers/ragat/data/test.csv

Large diffs are not rendered by default.

3,388 changes: 3,388 additions & 0 deletions research/arxiv_papers/ragat/data/train.csv

Large diffs are not rendered by default.

86 changes: 86 additions & 0 deletions research/arxiv_papers/ragat/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import mindspore
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import context
from mindspore.common.initializer import Normal

# 设置MindSpore运行模式
context.set_context(mode=context.GRAPH_MODE, device_target="CPU")

# Multi-Head Attention 模块
class MultiHeadSelfAttention(nn.Cell):
def __init__(self, hidden_dim, num_heads):
super(MultiHeadSelfAttention, self).__init__()
assert hidden_dim % num_heads == 0
self.num_heads = num_heads
self.head_dim = hidden_dim // num_heads
self.q_linear = nn.Dense(hidden_dim, hidden_dim)
self.k_linear = nn.Dense(hidden_dim, hidden_dim)
self.v_linear = nn.Dense(hidden_dim, hidden_dim)
self.out_linear = nn.Dense(hidden_dim, hidden_dim)
self.softmax = nn.Softmax(axis=-1)
self.batch_matmul = ops.BatchMatMul()

def construct(self, x):
batch_size, seq_len, hidden_dim = x.shape
Q = self.q_linear(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
K = self.k_linear(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
V = self.v_linear(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
scores = self.batch_matmul(Q, K.transpose(0, 1, 3, 2)) * (1.0 / self.head_dim ** 0.5)
attn_weights = self.softmax(scores)
context = self.batch_matmul(attn_weights, V).transpose(0, 2, 1, 3).reshape(batch_size, seq_len, hidden_dim)
return self.out_linear(context)

# BiGCN 模块
class BiGCN(nn.Cell):
def __init__(self, input_dim, hidden_dim, output_dim):
super(BiGCN, self).__init__()
self.gcn_forward = nn.Dense(input_dim, hidden_dim)
self.gcn_backward = nn.Dense(input_dim, hidden_dim)
self.out_proj = nn.Dense(hidden_dim * 2, output_dim)
self.relu = nn.ReLU()

def construct(self, x, adj):
h_fwd = ops.BatchMatMul()(adj, x)
h_fwd = self.relu(self.gcn_forward(h_fwd))
h_bwd = ops.BatchMatMul()(adj.transpose(0, 2, 1), x)
h_bwd = self.relu(self.gcn_backward(h_bwd))
return self.out_proj(ops.Concat(axis=-1)((h_fwd, h_bwd)))

# TextCNN 模块
class TextCNN(nn.Cell):
def __init__(self, embed_dim, num_channels, kernel_sizes):
super(TextCNN, self).__init__()
self.conv_layers = nn.CellList([
nn.Conv1d(embed_dim, num_channels, k, pad_mode='same', weight_init=Normal(0.1))
for k in kernel_sizes])
self.relu = nn.ReLU()

def construct(self, x):
x_conv_input = x.transpose(0, 2, 1)
conv_outs = [self.relu(conv(x_conv_input)) for conv in self.conv_layers]
return ops.Concat(axis=1)(conv_outs)

# 完整模型结构:TextCNN + GRU + MHA + BiGCN
class TextCNN_GRU_MHA_BiGCN(nn.Cell):
def __init__(self, vocab_size, embed_dim, num_channels, kernel_sizes, rnn_hidden_dim,
num_heads, num_classes, dropout=0.5):
super(TextCNN_GRU_MHA_BiGCN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0,
embedding_table=Normal(0.1))
self.textcnn = TextCNN(embed_dim, num_channels, kernel_sizes)
self.gru = nn.GRU(input_size=num_channels * len(kernel_sizes), hidden_size=rnn_hidden_dim,
num_layers=1, batch_first=True)
self.mha = MultiHeadSelfAttention(rnn_hidden_dim, num_heads)
self.bigcn = BiGCN(embed_dim, embed_dim, rnn_hidden_dim)
self.dropout = nn.Dropout(keep_prob=1 - dropout)
self.fc = nn.Dense(rnn_hidden_dim * 2, num_classes)

def construct(self, x, adj):
x_embed = self.embedding(x)
cnn_out = self.textcnn(x_embed)
gru_out, _ = self.gru(cnn_out)
mha_out = self.mha(gru_out)
gcn_out = self.bigcn(x_embed, adj)
features = ops.Concat(axis=-1)((mha_out, gcn_out))
return self.fc(self.dropout(features))
6 changes: 6 additions & 0 deletions research/arxiv_papers/ragat/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
mindspore==1.6.0 # MindSpore框架
jieba==0.42.1 # 中文分词库
scikit-learn==0.24.2 # 用于分类报告
numpy==1.21.0 # 数值计算库
pandas==1.3.0 # 数据处理库
tqdm==4.61.0 # 进度条库
201 changes: 201 additions & 0 deletions research/arxiv_papers/ragat/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
import os
import numpy as np
import pandas as pd
import jieba
from tqdm import tqdm
from sklearn.metrics import classification_report

import mindspore
import mindspore.dataset as ds
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore import context
from mindspore.common.initializer import Normal

# 设置MindSpore运行模式
context.set_context(mode=context.GRAPH_MODE, device_target="CPU")

# 模型参数
max_seq_len = 128
embed_dim = 128
num_channels = 100
kernel_sizes = [3, 4, 5]
rnn_hidden_dim = 128
num_heads = 4
num_classes = 2
batch_size = 32
num_epochs = 3
learning_rate = 0.001
weight_decay = 1e-4
dropout_rate = 0.5

# 构建邻接矩阵
def build_adj_matrix(tokens, max_seq_len, window_size=2):
adj = np.eye(max_seq_len, dtype=np.float32)
for i in range(len(tokens)):
for j in range(i + 1, min(i + window_size, len(tokens))):
adj[i, j] = 1.0
adj[j, i] = 1.0
return adj

# 词汇表构建
def build_vocab(texts):
vocab = {"<pad>": 0, "<unk>": 1}
for text in texts:
tokens = jieba.lcut(text)
for token in tokens:
if token not in vocab:
vocab[token] = len(vocab)
return vocab

# 预处理每个文本(返回词索引序列和邻接矩阵)
def preprocess_with_adj(text, vocab, max_seq_len):
tokens = jieba.lcut(text)
seq = [vocab.get(token, vocab["<unk>"]) for token in tokens]
if len(seq) < max_seq_len:
seq = seq + [vocab["<pad>"]] * (max_seq_len - len(seq))
else:
seq = seq[:max_seq_len]
adj = build_adj_matrix(tokens[:max_seq_len], max_seq_len)
return seq, adj

# Multi-Head Attention 模块
class MultiHeadSelfAttention(nn.Cell):
def __init__(self, hidden_dim, num_heads):
super(MultiHeadSelfAttention, self).__init__()
assert hidden_dim % num_heads == 0
self.num_heads = num_heads
self.head_dim = hidden_dim // num_heads
self.q_linear = nn.Dense(hidden_dim, hidden_dim)
self.k_linear = nn.Dense(hidden_dim, hidden_dim)
self.v_linear = nn.Dense(hidden_dim, hidden_dim)
self.out_linear = nn.Dense(hidden_dim, hidden_dim)
self.softmax = nn.Softmax(axis=-1)
self.batch_matmul = ops.BatchMatMul()

def construct(self, x):
batch_size, seq_len, hidden_dim = x.shape
Q = self.q_linear(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
K = self.k_linear(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
V = self.v_linear(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
scores = self.batch_matmul(Q, K.transpose(0, 1, 3, 2)) * (1.0 / self.head_dim ** 0.5)
attn_weights = self.softmax(scores)
context = self.batch_matmul(attn_weights, V).transpose(0, 2, 1, 3).reshape(batch_size, seq_len, hidden_dim)
return self.out_linear(context)

# BiGCN 模块
class BiGCN(nn.Cell):
def __init__(self, input_dim, hidden_dim, output_dim):
super(BiGCN, self).__init__()
self.gcn_forward = nn.Dense(input_dim, hidden_dim)
self.gcn_backward = nn.Dense(input_dim, hidden_dim)
self.out_proj = nn.Dense(hidden_dim * 2, output_dim)
self.relu = nn.ReLU()

def construct(self, x, adj):
h_fwd = ops.BatchMatMul()(adj, x)
h_fwd = self.relu(self.gcn_forward(h_fwd))
h_bwd = ops.BatchMatMul()(adj.transpose(0, 2, 1), x)
h_bwd = self.relu(self.gcn_backward(h_bwd))
return self.out_proj(ops.Concat(axis=-1)((h_fwd, h_bwd)))

# 完整模型
class TextCNN_GRU_MHA_BiGCN(nn.Cell):
def __init__(self, vocab_size, embed_dim, num_channels, kernel_sizes, rnn_hidden_dim,
num_heads, num_classes, dropout=0.5):
super(TextCNN_GRU_MHA_BiGCN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0,
embedding_table=Normal(0.1))
self.conv_layers = nn.CellList([
nn.Conv1d(embed_dim, num_channels, k, pad_mode='same', weight_init=Normal(0.1))
for k in kernel_sizes])
self.relu = nn.ReLU()
self.concat = ops.Concat(axis=1)
self.gru = nn.GRU(input_size=num_channels * len(kernel_sizes), hidden_size=rnn_hidden_dim,
num_layers=1, batch_first=True)
self.mha = MultiHeadSelfAttention(rnn_hidden_dim, num_heads)
self.bigcn = BiGCN(embed_dim, embed_dim, rnn_hidden_dim)
self.dropout = nn.Dropout(keep_prob=1 - dropout)
self.fc = nn.Dense(rnn_hidden_dim * 2, num_classes)

def construct(self, x, adj):
x_embed = self.embedding(x)
x_conv_input = x_embed.transpose(0, 2, 1)
conv_outs = [self.relu(conv(x_conv_input)) for conv in self.conv_layers]
x_conv_cat = self.concat(conv_outs).transpose(0, 2, 1)
gru_out, _ = self.gru(x_conv_cat)
mha_out = self.mha(gru_out)
mha_pooled = ops.ReduceMean(keep_dims=False)(mha_out, 1)
gcn_out = self.bigcn(x_embed, adj)
gcn_pooled = ops.ReduceMean(keep_dims=False)(gcn_out, 1)
features = ops.Concat(axis=-1)((mha_pooled, gcn_pooled))
return self.fc(self.dropout(features))

# 自定义损失函数类
class WithLossCell_Custom(nn.Cell):
def __init__(self, backbone, loss_fn):
super(WithLossCell_Custom, self).__init__()
self.backbone = backbone
self.loss_fn = loss_fn

def construct(self, x, adj, label):
logits = self.backbone(x, adj)
loss = self.loss_fn(logits, label)
return loss

# 数据加载
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_train.columns = df_train.columns.str.strip()
df_test.columns = df_test.columns.str.strip()
texts_train = df_train['text_a'].tolist()
labels_train = df_train['label'].tolist()
texts_test = df_test['text_a'].tolist()
labels_test = df_test['label'].tolist()

vocab = build_vocab(texts_train)
X_train, A_train, Y_train = [], [], []
X_test, A_test, Y_test = [], [], []
for text, label in zip(texts_train, labels_train):
seq, adj = preprocess_with_adj(text, vocab, max_seq_len)
X_train.append(seq)
A_train.append(adj)
Y_train.append(label)
for text, label in zip(texts_test, labels_test):
seq, adj = preprocess_with_adj(text, vocab, max_seq_len)
X_test.append(seq)
A_test.append(adj)
Y_test.append(label)
X_train = np.array(X_train, dtype=np.int32)
A_train = np.array(A_train, dtype=np.float32)
Y_train = np.array(Y_train, dtype=np.int32)
X_test = np.array(X_test, dtype=np.int32)
A_test = np.array(A_test, dtype=np.float32)
Y_test = np.array(Y_test, dtype=np.int32)

train_dataset = ds.NumpySlicesDataset({"data": X_train, "adj": A_train, "label": Y_train}, shuffle=True).batch(batch_size)
test_dataset = ds.NumpySlicesDataset({"data": X_test, "adj": A_test, "label": Y_test}, shuffle=False).batch(batch_size)

# 模型训练
model = TextCNN_GRU_MHA_BiGCN(len(vocab), embed_dim, num_channels, kernel_sizes,
rnn_hidden_dim, num_heads, num_classes, dropout=dropout_rate)
loss_fn = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
optimizer = nn.Adam(model.trainable_params(), learning_rate=learning_rate, weight_decay=weight_decay)
net_with_loss = WithLossCell_Custom(model, loss_fn)
train_network = nn.TrainOneStepCell(net_with_loss, optimizer)
train_network.set_train()

# 训练与评估
for epoch in range(num_epochs):
total_loss = 0
pbar = tqdm(train_dataset.create_dict_iterator(), total=train_steps, desc=f"Epoch {epoch+1}")
for batch in pbar:
data, adj, label = batch['data'], batch['adj'], batch['label']
loss = train_network(data, adj, label)
total_loss += loss.asnumpy()
pbar.set_postfix({"loss": f"{loss.asnumpy():.4f}"})
avg_loss = total_loss / train_steps
acc = evaluate(model, test_dataset)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Test Acc={acc*100:.2f}%")

print("训练完成 ✅")
Binary file not shown.
Binary file not shown.
Loading