Skip to content

Commit c3aa4e1

Browse files
committed
Merge branch 'feature/refactor-project'
2 parents a04e9d3 + a691026 commit c3aa4e1

34 files changed

+862
-940
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,7 @@ gold.txt
212212
generated.txt
213213
*.pkl
214214
*.png
215+
!assets/service_pipeline.png
215216

216217
example_data.json
217218
example_data_2.json

README.md

Lines changed: 173 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,173 @@
1-
# Final Project
1+
# 매일메일: 일간 메일 요약 비서
2+
3+
LLM Agent 기반 일별 메일 요약 비서 Chrome Extension 서비스입니다.
4+
5+
## 📌 프로젝트 개요
6+
7+
온 종일 쌓이는 메일을 핵심만 빠르게 파악하고, 놓치는 정보 없이 우선순위를 정해 효율적으로 업무를 처리할 수 있도록 돕자!
8+
9+
> 프로젝트 진행 경황 및 자세한 실험 내역은 [노션 링크](https://www.notion.so/gamchan/Upstage-234368a08ffd4965aad55b1a93b3cc3d?pvs=4)에서 확인하실 수 있습니다.
10+
11+
## 🏅 최종 결과
12+
13+
시연 영상 링크
14+
15+
## 🏛️ System Structures
16+
17+
![service_pipeline](./assets/service_pipeline.png)
18+
19+
## 💯 평가 지표 및 결과
20+
21+
- [결과 정리](https://www.notion.so/gamchan/195815b39d3980078aa1c8e645bf435c?pvs=4)
22+
- [실험 내용](https://www.notion.so/gamchan/18c815b39d39805e916ad56f39fa2c6b?pvs=4)
23+
- [프롬프트 버저닝](https://www.notion.so/gamchan/c77dbeb277fd476bbc08d3ecab3ce3a2?v=398efc762f394868a3f241dd62ec48e0&pvs=4)
24+
25+
### 메일 개별 요약
26+
27+
| Condition | ROUGE-1 Recall | ROUGE-1 Precision | ROUGE-1 F1 | BERT Score Recall | BERT Score Precision | BERT Score F1 | G-EVAL Conciseness |
28+
| ---------------------- | -------------- | ----------------- | ---------- | ----------------- | -------------------- | ------------- | ------------------ |
29+
| Baseline | 0.0667 | 0.0042 | 0.1678 | 0.8223 | 0.8789 | 0.8494 | 4.3958 |
30+
| + refine | 0.2618 | 0.2049 | 0.4649 | 0.8740 | 0.9146 | 0.8932 | 4.8750 |
31+
| + one-shot | 0.2288 | 0.2005 | 0.3661 | 0.8325 | 0.8905 | 0.8588 | 4.9375 |
32+
| **+ refine, one-shot** | **0.3062** | **0.2691** | **0.4690** | **0.8905** | **0.9319** | **0.0901** | **4.9167** |
33+
34+
`ROUGE-1`에서 **24.0 ~ 30.1%p**, `BERTScore`에서 **5.3 ~ 6.8%p**, `G-Eval conciseness` 항목(5점 만점)에서 **0.52점** 상승폭이 있었습니다.
35+
36+
### 분류
37+
38+
| Condition | Accuracy | Tokens | Accuracy per Tokens |
39+
| ------------------------ | ---------- | ---------- | ------------------- |
40+
| Baseline | 0.8104 | 97,436 | 8.32e-6 |
41+
| **summary based** | 0.7708 | **52,477** | **1.47e-5** |
42+
| summary based + 1-shot | 0.8021 | 63,599 | 1.27e-5 |
43+
| summary based + 5-shots | 0.7708 | 86,878 | 8.87e-6 |
44+
| summary based + 10-shots | **0.8146** | 115,558 | 7.05e-6 |
45+
46+
`정확도/토큰 사용량` 지표를 바탕으로 현재 프롬프트를 채택했습니다.
47+
48+
- [목적 별 분류](prompt/template/classification/category.yaml)
49+
- [추가 행동 필요 여부 분류](prompt/template/classification/action.yaml)
50+
51+
### 메일 전체 요약
52+
53+
| Condition | G-eval score |
54+
| --------------------------------------------------------- | ------------ |
55+
| Baseline(Self-Refine) | 3.75 |
56+
| Baseline(Reflexion) | 4.00 |
57+
| Detailed Instructions(Self-Refine) | 3.50 |
58+
| Detailed Instructions(Reflexion) | 3.50 |
59+
| Detailed Instructions + Formatting Penalty(Self-Refine) | 3.94 |
60+
| **Detailed Instructions + Formatting Penalty(Reflexion)** | **4.19** |
61+
62+
`G-Eval` 평가 평균 점수(5점 만점)에서 **0.44** 상승폭이 있었습니다.
63+
64+
- [G-Eval 평가 항목 별 프롬프트](prompt/template/reflexion/g_eval/)
65+
- [전체 요약 시스템 프롬프트](prompt/template/summary/final_summary_system.txt)
66+
- [전체 요약 사용자 프롬프트](prompt/template/summary/final_summary_user.txt)
67+
68+
## ⚙️ Project Quick Setup
69+
70+
### 1. Git Clone
71+
72+
```shell
73+
$ git clone [email protected]:boostcampaitech7/level4-nlp-finalproject-hackathon-nlp-06-lv3.git
74+
$ cd level4-nlp-finalproject-hackathon-nlp-06-lv3
75+
```
76+
77+
### 2. Create Virtual Environment
78+
79+
```shell
80+
$ python -m venv .venv
81+
$ source .venv/bin/activate
82+
(.venv) $
83+
```
84+
85+
### 3. Install Packages
86+
87+
```shell
88+
(.venv) $ pip install -r requirements.txt
89+
(.venv) $ sudo apt-get install build-essential
90+
```
91+
92+
### 4. Setup Environment Variables
93+
94+
4.1. `.env`를 생성 후 환경 변수를 수정합니다.
95+
96+
```shell
97+
(.venv) $ cp .env.example .env
98+
```
99+
100+
- Upstage API Key는 [여기](https://console.upstage.ai/api-keys?api=chat)에서, Openai API Key는 [여기](https://platform.openai.com/welcome?step=create)에서 발급해주세요.
101+
- Google Client ID 및 Google Client Secret은 [다음 게시물](https://www.notion.so/gamchan/OAuth-179815b39d398017aeb8f6a8172e6e76?pvs=4)을 참고해주세요.
102+
103+
```shell
104+
# AI Service
105+
UPSTAGE_API_KEY=your_upstage_api_key
106+
OPENAI_API_KEY=your_openai_api_key
107+
108+
# Google OAuth 2.0(with GMail)
109+
GOOGLE_CLIENT_ID=1234567890.apps.googleusercontent.com
110+
GOOGLE_CLIENT_SECRET=1234567890
111+
```
112+
113+
4.2. `main.py`를 실행하기 위해서 `client_secret_...usercontent.com.json` 파일 이름을 `credentials.json`으로 변경해주세요.
114+
115+
### 5. Execute pipeline
116+
117+
```shell
118+
(.venv) $ python main.py
119+
```
120+
121+
### (Optional) Execute with DB connection
122+
123+
```shell
124+
(.venv) $ docker-compose -f server/docker-compose.yml up -d
125+
(.venv) $ python batch_main.py
126+
```
127+
128+
## 🔬 References
129+
130+
- Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark, "Self-Refine: Iterative Refinement with Self-Feedback", 25 May, 2023. https://arxiv.org/abs/2303.17651.
131+
- Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, "Reflexion: Language Agents with Verbal Reinforcement Learning", 10 Oct, 2023. https://arxiv.org/abs/2303.11366.
132+
- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", 24 Dec, 2023. https://arxiv.org/abs/2306.05685.
133+
- Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu, "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment", 23 May, 2023. https://arxiv.org/abs/2303.16634.
134+
- Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Pilsung Kang, "CheckEval: Robust Evaluation Framework using Large Language Model via Checklist", 27 Mar, 2024. https://arxiv.org/abs/2403.18771.
135+
136+
- 기타 등
137+
- 등등
138+
139+
## 👥 Collaborators
140+
141+
<div align="center">
142+
143+
| 팀원 | 역할 |
144+
| :-------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------: |
145+
| <a href="https://github.com/gsgh3016"><img src="https://github.com/gsgh3016.png" width="100"></a> | Streamlit app 개발 참여, 데이터 관찰 및 분석, 데이터 재구성 및 증강 |
146+
| <a href="https://github.com/eyeol"> <img src="https://github.com/eyeol.png" width="100"></a> | Streamlit app 개발 참여, RAG 구현 및 성능 평가 |
147+
| <a href="https://github.com/jagaldol"> <img src="https://github.com/jagaldol.png" width="100"> </a> | 협업 초기 환경 세팅 및 코드 모듈화, CoT 방식 실험 설계 및 성능 평가 |
148+
| <a href="https://github.com/Usunwoo"> <img src="https://github.com/Usunwoo.png" width="100"> </a> | 베이스라인 모듈화, 메모리 사용 최적화, 모델 서치 및 실험 |
149+
| <a href="https://github.com/canolayoo78"> <img src="https://github.com/canolayoo78.png" width="100"> </a> | Streamlit app 개발 참여, 데이터 분석 및 정제, RAG 구현 및 성능 평가 |
150+
| <a href="https://github.com/chell9999"> <img src="https://github.com/chell9999.png" width="100"> </a> | 문서 작업, RAG 전용 Vector DB 구성, 벤치마크 데이터셋 기반 데이터 증강 |
151+
152+
</div>
153+
154+
## 🛠️ Tools and Technologies
155+
156+
<div align="center">
157+
158+
![Python](https://img.shields.io/badge/-Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
159+
![jupyter](https://img.shields.io/badge/-jupyter-F37626?style=for-the-badge&logo=jupyter&logoColor=white)
160+
![PyTorch](https://img.shields.io/badge/-PyTorch-EE4C2C?style=for-the-badge&logo=PyTorch&logoColor=white)
161+
![huggingface](https://img.shields.io/badge/-huggingface-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)
162+
163+
![unsloth](https://img.shields.io/badge/-unsloth-14B789?style=for-the-badge&logo=unsloth&logoColor=white)
164+
![BitsandBytes](https://img.shields.io/badge/BitsandBytes-36474F?style=for-the-badge&logo=BitsandBytes&logoColor=white)
165+
![LoRA](https://img.shields.io/badge/LoRA-40B5A4?style=for-the-badge&logo=LoRA&logoColor=white)
166+
![langchain](https://img.shields.io/badge/-langchain-1C3C3C?style=for-the-badge&logo=langchain&logoColor=white)
167+
168+
![RAG](https://img.shields.io/badge/RAG-1868F2?style=for-the-badge&logo=RAG&logoColor=white)
169+
![pinecone](https://img.shields.io/badge/pinecone-000000?style=for-the-badge&logo=pinecone&logoColor=white)
170+
![Cot](https://img.shields.io/badge/cot-535051?style=for-the-badge&logo=cot&logoColor=white)
171+
![github action](https://img.shields.io/badge/GITHUB%20ACTIONS-2088FF?style=for-the-badge&logo=github-actions&logoColor=white)
172+
173+
</div>

agents/reflexion/evaluator.py

Lines changed: 59 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,71 @@
1-
from evaluation.gpt_eval import calculate_g_eval
1+
import re
2+
3+
from openai import OpenAI
4+
5+
from utils.configuration import Config
6+
from utils.decorators import retry_with_exponential_backoff
7+
from utils.token_usage_counter import TokenUsageCounter
28

39

410
class ReflexionEvaluator:
5-
def __init__(self, task: str):
6-
self.task = task
11+
def __init__(self):
12+
self.model_name = "solar-pro"
13+
self.client = OpenAI(api_key=Config.user_upstage_api_key, base_url="https://api.upstage.ai/v1/solar")
714

8-
def get_geval_scores(self, source_text: str, output_text: str):
15+
self.prompt_path: str = Config.config["report"]["g_eval"]["prompt_path"]
16+
self.aspects = ["consistency", "coherence", "fluency", "relevance"]
17+
18+
@retry_with_exponential_backoff()
19+
def get_geval_scores(self, source_text: str, output_text: str) -> dict:
920
"""참조한 텍스트와 생성한 텍스트를 입력으로 받고 점수를 매긴다.
1021
1122
Args:
1223
source_text (str): 생성하는 데 참조한 텍스트
1324
output_text (str): 생성한 텍스트
1425
1526
Returns:
16-
(dict, str)
17-
0번째는 dict, 1번째는 한 줄의 str 형식으로 aspect 별 점수를 return한다.
27+
g_eval_result (dict): g-eval 결과 딕셔너리
1828
"""
19-
eval_type = "summary" if self.task == "single" else "report"
20-
model_name = "solar-pro" # TODO: Config로 추출
21-
22-
return calculate_g_eval(
23-
source_texts=[source_text],
24-
generated_texts=[output_text],
25-
eval_type=eval_type,
26-
model_name=model_name,
27-
)
29+
30+
total_token_usage = 0
31+
32+
aspect_scores = {}
33+
for aspect in self.aspects:
34+
cur_prompt = self._create_aspect_prompt(aspect, source_text, output_text)
35+
36+
# OpenAI API 호출
37+
response = self.client.chat.completions.create(
38+
model=self.model_name,
39+
messages=[{"role": "system", "content": cur_prompt}],
40+
temperature=0.7,
41+
max_tokens=50,
42+
n=1,
43+
)
44+
45+
try:
46+
aspect_scores[aspect] = self._extract_score(response.choices[0].message.content.strip())
47+
except Exception as e:
48+
print(f"[Error] eval_type=report, aspect={aspect}, error={e}")
49+
aspect_scores[aspect] = 0.0
50+
continue
51+
52+
total_token_usage += response.usage.total_tokens
53+
54+
TokenUsageCounter.add_usage("reflexion", "evaluator", total_token_usage)
55+
56+
return aspect_scores
57+
58+
def _create_aspect_prompt(self, aspect: str, source_text: str, output_text: str) -> str:
59+
with open(f"{self.prompt_path}{aspect}.txt", "r", encoding="utf-8") as f:
60+
base_prompt = f.read()
61+
62+
# {Document}, {Summary} 치환
63+
return base_prompt.format(Document=source_text, Summary=output_text)
64+
65+
def _extract_score(self, gpt_text: str):
66+
# 정규표현식으로 숫자만 추출, 예: "abc123def" -> numbers = ['1','2','3']
67+
numbers = re.findall(r"\d", gpt_text)
68+
score_value = float(numbers[-1])
69+
if score_value > 5:
70+
return 1.0
71+
return score_value

agents/reflexion/reflexion.py

Lines changed: 58 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,19 @@
55

66

77
class ReflexionFramework:
8-
def __init__(self, task: str):
9-
self.task = task
10-
self.evaluator = ReflexionEvaluator(task)
11-
self.self_reflection = ReflexionSelfReflection(task)
8+
def __init__(self):
9+
self.summary_agent = SummaryAgent(
10+
model_name="solar-pro",
11+
summary_type="final",
12+
temperature=Config.config["temperature"]["summary"],
13+
seed=Config.config["seed"],
14+
)
15+
self.evaluator = ReflexionEvaluator()
16+
self.self_reflection = ReflexionSelfReflection()
17+
self.threshold = Config.config["reflexion"]["threshold"]
18+
self.max_iteration = Config.config["reflexion"]["max_iteration"]
1219

13-
def process(self, origin_mail, summary_agent: SummaryAgent) -> str:
20+
def process(self, origin_mail) -> str:
1421
"""
1522
Reflexion을 실행합니다.
1623
@@ -19,58 +26,60 @@ def process(self, origin_mail, summary_agent: SummaryAgent) -> str:
1926
Returns:
2027
모든 aspect 점수의 평균값이 제일 높은 text가 반환됩니다.
2128
"""
22-
threshold_type = Config.config["self_reflection"]["reflexion"]["threshold_type"]
23-
threshold = Config.config["self_reflection"]["reflexion"]["threshold"]
24-
25-
scores = []
2629
outputs = []
27-
output_text = summary_agent.process(origin_mail, 3, ["start"])
28-
print("\n\nINITIATE REFLEXION\n")
29-
print(f"{'=' * 25}\n" f"초기 출력문:\n{output_text}\n" f"{'=' * 25}\n")
30-
for i in range(Config.config["self_reflection"]["max_iteration"]):
30+
eval_results = []
31+
final_output = ""
32+
max_score = 0
33+
34+
for i in range(self.max_iteration):
35+
# 출력문 재생성
36+
output_text = self.summary_agent.process_with_reflection(
37+
origin_mail, self.self_reflection.reflection_memory
38+
)
39+
outputs.append(output_text)
40+
3141
# 평가하기
32-
eval_result_list = self.evaluator.get_geval_scores(origin_mail, output_text)
33-
eval_result_str = ""
34-
aspect_score = 0
35-
for eval_result in eval_result_list:
36-
aspect_len = len(eval_result)
37-
for aspect, score in eval_result.items():
38-
eval_result_str += f"항목: {aspect} 점수: {score}\n"
39-
aspect_score += score
42+
eval_result: dict = self.evaluator.get_geval_scores(origin_mail, output_text)
43+
eval_results.append(eval_result)
44+
score = round(sum(eval_result.values()) / len(eval_result), 1)
4045

41-
# 성찰하기
42-
self.self_reflection.generate_reflection(origin_mail, output_text, eval_result_str)
46+
if max_score < score:
47+
max_score = score
48+
final_output = output_text
49+
if round(sum(eval_result.values()) / len(eval_result), 1) >= self.threshold:
50+
break
4351

44-
# 출력문 다시 생성하기
45-
previous_reflections = self.self_reflection.reflection_memory
46-
output_text = summary_agent.process(origin_mail, 3, previous_reflections)
52+
# 점수 도달 실패 시 이유 성찰
53+
self.self_reflection.generate_reflection(
54+
origin_mail, output_text, self._create_eval_result_str(eval_result)
55+
)
4756

48-
eval_average = round(aspect_score / aspect_len, 1)
49-
scores.append(eval_average)
50-
outputs.append(output_text)
51-
previous_reflections_msg = "\n".join(previous_reflections)
57+
self._print_result(eval_results, outputs)
58+
59+
return final_output
60+
61+
def _create_eval_result_str(self, eval_result: dict):
62+
return "\n".join([f"항목: {aspect} 점수: {score}" for aspect, score in eval_result.items()])
63+
64+
def _print_result(self, eval_results: list[dict], outputs: list[str]):
65+
max_score = 0
66+
final_output = ""
67+
max_index = 0
68+
69+
for i, (eval_result, output) in enumerate(zip(eval_results, outputs)):
70+
score = round(sum(eval_result.values()) / len(eval_result), 1)
5271
print(
5372
f"{'=' * 25}\n"
54-
f"{i + 1}회차\n"
73+
f"{i+1}회차 평균 {score}\n"
74+
f"{self._create_eval_result_str(eval_result)}\n"
75+
f"Reflection 메모리:\n{self.self_reflection.get_reflection_memory_str()}\n"
5576
f"{'-' * 25}\n"
56-
f"{eval_result_str}, 평균 {eval_average}\n"
77+
f"생성된 텍스트:\n{output}\n"
5778
f"{'-' * 25}\n"
58-
f"Reflection 메모리:\n{previous_reflections_msg}\n\n"
59-
f"{'-' * 25}\n"
60-
f"성찰 후 재생성된 텍스트:\n{output_text}"
6179
)
80+
if max_score < score:
81+
max_score = score
82+
final_output = output
83+
max_index = i
6284

63-
if (threshold_type == "all" and all(value > threshold for value in scores)) or (
64-
threshold_type == "average" and eval_average >= threshold
65-
):
66-
print(f"{'=' * 25}\n" "Evaluation 점수 만족, Reflexion 루프 종료\n" f"{'=' * 25}")
67-
break
68-
69-
for i, score in enumerate(scores):
70-
print(f"{i+1}회차 평균 {score}점")
71-
print("=" * 25)
72-
print(f"\n최종 출력문 ({scores.index(max(scores)) + 1}회차, 평균: {max(scores)}점)")
73-
final = outputs[scores.index(max(scores))]
74-
print(f"{final}")
75-
76-
return final
85+
print(f"{'=' * 25}\n최종 출력:{max_index+1}회차 평균 {max_score}\n{final_output}\n{'=' * 25}\n")

0 commit comments

Comments
 (0)