Skip to content

【开源实习】技术公开课内容测试任务14 #2051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 0.4
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions examples/deepseek1.3b_code_generation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# DeepSeek Coder 代码生成教程与示例

本目录包含使用 DeepSeek Coder 模型进行代码生成的教程和示例。DeepSeek Coder 是一个强大的代码生成模型,专为编程领域优化,能够根据自然语言描述生成高质量的代码。

## 内容

- `deepseek_coder_tutorial.ipynb`: Jupyter Notebook 教程,展示如何使用 DeepSeek Coder 模型进行各种代码生成任务
- `deepseek_coder_code_generation.py`: 命令行工具,用于生成代码
- `deepseek_coder_finetuning.py`: 在自定义数据集上微调 DeepSeek Coder 模型的脚本

## 基本用法

### 安装依赖

确保你已经安装了最新版本的 MindNLP:

```bash
pip install mindnlp
```

### 使用命令行工具生成代码

```bash
python deepseek_coder_code_generation.py --prompt "实现一个快速排序算法" --max_length 500
```

参数说明:
- `--prompt`: 用于生成代码的自然语言描述
- `--max_length`: 生成的最大长度
- `--temperature`: 生成温度 (默认为0.7)
- `--top_p`: 核采样概率 (默认为0.95)
- `--top_k`: Top-K抽样 (默认为50)
- `--model_name_or_path`: 要使用的模型名称或路径 (默认为 "deepseek-ai/deepseek-coder-1.3b-base")

### 微调 DeepSeek Coder 模型

如果你有特定领域的代码数据集,可以使用我们提供的微调脚本来自定义 DeepSeek Coder 模型:

```bash
python deepseek_coder_finetuning.py \
--train_file path/to/train.txt \
--validation_file path/to/validation.txt \
--output_dir ./deepseek-coder-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 4
```

对于大型模型,建议使用 LoRA 进行参数高效微调:

```bash
python deepseek_coder_finetuning.py \
--train_file path/to/train.txt \
--output_dir ./deepseek-coder-finetuned \
--use_lora \
--lora_rank 8 \
--lora_alpha 16
```

## 进阶教程

查看 `deepseek_coder_tutorial.ipynb` 获取更详细的教程,包括:

1. 基础代码生成
2. 高级代码生成示例
3. 调整生成参数
4. 提取生成的代码
5. 实际应用案例

## 数据格式

对于微调,训练数据应该是文本文件,每个代码样本以 `# ---NEW SAMPLE---` 分隔。例如:

```
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
# ---NEW SAMPLE---
def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quick_sort(left) + middle + quick_sort(right)
```

## 注意事项

- DeepSeek Coder 模型适用于生成多种编程语言的代码,但效果最好的是 Python、JavaScript、Java、C++ 等常用语言
- 提供更详细和具体的提示通常会得到更好的代码生成结果
- 对于复杂任务,可以尝试增大 `max_length` 参数值
- 降低 `temperature` 参数可以获得更确定性的结果,增大可以获得更多样化的输出
262 changes: 262 additions & 0 deletions examples/deepseek1.3b_code_generation/code_assistant_bot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
#!/usr/bin/env python
# coding=utf-8
"""
基于 DeepSeek Coder 模型的代码助手机器人
"""

import os
import argparse
import re
import time
from rich.console import Console
from rich.markdown import Markdown
from rich.panel import Panel
from rich.syntax import Syntax
from prompt_toolkit import PromptSession
from prompt_toolkit.history import FileHistory
from prompt_toolkit.auto_suggest import AutoSuggestFromHistory
from mindnlp.transformers import AutoModelForCausalLM, AutoTokenizer

console = Console()

class CodeAssistant:
"""代码助手类,使用 DeepSeek Coder 模型提供代码生成和解释服务"""

def __init__(self, model_name="deepseek-ai/deepseek-coder-1.3b-base"):
"""初始化代码助手"""
self.model_name = model_name

# 加载模型和分词器
console.print(f"正在加载 [bold]{model_name}[/bold] 模型...", style="yellow")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
console.print("模型加载完成!", style="green")

# 对话历史
self.conversation_history = []

# 命令列表
self.commands = {
"/help": self.show_help,
"/clear": self.clear_history,
"/save": self.save_conversation,
"/exit": lambda: "exit",
"/examples": self.show_examples
}

def start(self):
"""启动交互式代码助手"""
console.print(Panel.fit(
"[bold]DeepSeek Coder 代码助手[/bold]\n\n"
"一个基于 DeepSeek Coder 模型的代码生成和解释工具\n"
"输入 [bold blue]/help[/bold blue] 查看帮助信息\n"
"输入 [bold blue]/exit[/bold blue] 退出程序",
title="欢迎使用",
border_style="green"
))

# 创建历史记录文件
history_file = os.path.expanduser("~/.code_assistant_history")
session = PromptSession(history=FileHistory(history_file),
auto_suggest=AutoSuggestFromHistory())

while True:
try:
user_input = session.prompt("\n[用户] > ")

# 处理命令
if user_input.strip().startswith("/"):
command = user_input.strip().split()[0]
if command in self.commands:
result = self.commands[command]()
if result == "exit":
break
continue

if not user_input.strip():
continue

# 将用户输入添加到历史记录
self.conversation_history.append(f"[用户] {user_input}")

# 获取回复
start_time = time.time()
console.print("[AI 思考中...]", style="yellow")

response = self.generate_response(user_input)

# 提取代码块
code_blocks = self.extract_code_blocks(response)

# 格式化输出
console.print("\n[AI 助手]", style="bold green")

# 如果有代码块,特殊处理
if code_blocks:
parts = re.split(r'```(?:\w+)?\n|```', response)
i = 0
for part in parts:
if part.strip():
if i % 2 == 0: # 文本部分
console.print(Markdown(part.strip()))
else: # 代码部分
lang = self.detect_language(code_blocks[(i-1)//2])
console.print(Syntax(code_blocks[(i-1)//2], lang, theme="monokai",
line_numbers=True, word_wrap=True))
i += 1
else:
# 没有代码块,直接显示为Markdown
console.print(Markdown(response))

elapsed_time = time.time() - start_time
console.print(f"[生成用时: {elapsed_time:.2f}秒]", style="dim")

# 将回复添加到历史记录
self.conversation_history.append(f"[AI] {response}")

except KeyboardInterrupt:
console.print("\n中断操作...", style="bold red")
break
except Exception as e:
console.print(f"\n发生错误: {str(e)}", style="bold red")

def generate_response(self, prompt, max_length=1000, temperature=0.7):
"""生成回复"""
# 处理提示
if "代码" in prompt or "函数" in prompt or "实现" in prompt or "编写" in prompt:
# 检测是否已经包含了代码格式声明
if not "```" in prompt:
prompt = f"```python\n# {prompt}\n"

inputs = self.tokenizer(prompt, return_tensors="ms")

# 生成回复
generated_ids = self.model.generate(
inputs.input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.95,
top_k=50,
)

response = self.tokenizer.decode(generated_ids[0], skip_special_tokens=True)

# 清理响应,如果有的话
if prompt in response:
response = response.replace(prompt, "", 1).strip()

return response

def extract_code_blocks(self, text):
"""从文本中提取代码块"""
pattern = r'```(?:\w+)?\n(.*?)```'
matches = re.findall(pattern, text, re.DOTALL)
return matches

def detect_language(self, code):
"""简单检测代码语言"""
if "def " in code and ":" in code:
return "python"
elif "{" in code and "}" in code and ";" in code:
if "public class" in code or "private" in code:
return "java"
elif "function" in code or "var" in code or "let" in code or "const" in code:
return "javascript"
else:
return "cpp"
elif "<" in code and ">" in code and ("</" in code or "/>" in code):
return "html"
else:
return "text"

def show_help(self):
"""显示帮助信息"""
help_text = """
# 可用命令:

- `/help` - 显示此帮助信息
- `/clear` - 清除当前对话历史
- `/save` - 保存当前对话到文件
- `/examples` - 显示示例提示
- `/exit` - 退出程序

# 使用技巧:

1. 提供详细的需求描述以获得更好的代码生成效果
2. 如果生成的代码不满意,可以要求修改或优化
3. 可以请求解释已有代码或调试问题
4. 对复杂功能,建议分步骤请求实现
"""
console.print(Markdown(help_text))

def clear_history(self):
"""清除对话历史"""
self.conversation_history = []
console.print("已清除对话历史", style="green")

def save_conversation(self):
"""保存对话到文件"""
if not self.conversation_history:
console.print("没有对话内容可保存", style="yellow")
return

filename = f"code_assistant_conversation_{int(time.time())}.md"
with open(filename, "w", encoding="utf-8") as f:
f.write("# DeepSeek Coder 代码助手对话记录\n\n")
for entry in self.conversation_history:
if entry.startswith("[用户]"):
f.write(f"## {entry}\n\n")
else:
f.write(f"{entry[5:]}\n\n")

console.print(f"对话已保存到 {filename}", style="green")

def show_examples(self):
"""显示示例提示"""
examples = """
# 示例提示:

1. "实现一个Python函数,计算两个日期之间的工作日数量"

2. "编写一个简单的Flask API,具有用户注册和登录功能"

3. "创建一个二分查找算法的JavaScript实现"

4. "使用pandas分析CSV数据并生成统计报告"

5. "实现一个简单的React组件,显示待办事项列表"

6. "解释以下代码的功能:
```python
def mystery(arr):
return [x for x in arr if x == x[::-1]]
```"

7. "优化下面的排序算法:
```python
def sort(arr):
for i in range(len(arr)):
for j in range(len(arr)):
if arr[i] < arr[j]:
arr[i], arr[j] = arr[j], arr[i]
return arr
```"
"""
console.print(Markdown(examples))


def main():
"""主函数"""
parser = argparse.ArgumentParser(description="DeepSeek Coder 代码助手")
parser.add_argument("--model", type=str, default="deepseek-ai/deepseek-coder-1.3b-base",
help="使用的模型名称或路径")
args = parser.parse_args()

# 创建并启动代码助手
assistant = CodeAssistant(model_name=args.model)
assistant.start()


if __name__ == "__main__":
main()
Loading