boostcampaitech7
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 173 additions & 1 deletion b/‎README.md‎
Lines changed: 173 additions & 1 deletion
diff --git a/‎agents/reflexion/evaluator.py‎
Lines changed: 59 additions & 15 deletions b/‎agents/reflexion/evaluator.py‎
Lines changed: 59 additions & 15 deletions
diff --git a/‎agents/reflexion/reflexion.py‎
Lines changed: 58 additions & 49 deletions b/‎agents/reflexion/reflexion.py‎
Lines changed: 58 additions & 49 deletions
@@ -212,6 +212,7 @@ gold.txt
 generated.txt
 *.pkl
 *.png
+!assets/service_pipeline.png
 
 example_data.json
 example_data_2.json
 
@@ -1 +1,173 @@
-# Final Project
+# 매일메일: 일간 메일 요약 비서
+
+LLM Agent 기반 일별 메일 요약 비서 Chrome Extension 서비스입니다.
+
+## 📌 프로젝트 개요
+
+온 종일 쌓이는 메일을 핵심만 빠르게 파악하고, 놓치는 정보 없이 우선순위를 정해 효율적으로 업무를 처리할 수 있도록 돕자!
+
+> 프로젝트 진행 경황 및 자세한 실험 내역은 [노션 링크](https://www.notion.so/gamchan/Upstage-234368a08ffd4965aad55b1a93b3cc3d?pvs=4)에서 확인하실 수 있습니다.
+
+## 🏅 최종 결과
+
+시연 영상 링크
+
+## 🏛️ System Structures
+
+![service_pipeline](./assets/service_pipeline.png)
+
+## 💯 평가 지표 및 결과
+
+- [결과 정리](https://www.notion.so/gamchan/195815b39d3980078aa1c8e645bf435c?pvs=4)
+- [실험 내용](https://www.notion.so/gamchan/18c815b39d39805e916ad56f39fa2c6b?pvs=4)
+- [프롬프트 버저닝](https://www.notion.so/gamchan/c77dbeb277fd476bbc08d3ecab3ce3a2?v=398efc762f394868a3f241dd62ec48e0&pvs=4)
+
+### 메일 개별 요약
+
+| Condition              | ROUGE-1 Recall | ROUGE-1 Precision | ROUGE-1 F1 | BERT Score Recall | BERT Score Precision | BERT Score F1 | G-EVAL Conciseness |
+| ---------------------- | -------------- | ----------------- | ---------- | ----------------- | -------------------- | ------------- | ------------------ |
+| Baseline               | 0.0667         | 0.0042            | 0.1678     | 0.8223            | 0.8789               | 0.8494        | 4.3958             |
+| + refine               | 0.2618         | 0.2049            | 0.4649     | 0.8740            | 0.9146               | 0.8932        | 4.8750             |
+| + one-shot             | 0.2288         | 0.2005            | 0.3661     | 0.8325            | 0.8905               | 0.8588        | 4.9375             |
+| **+ refine, one-shot** | **0.3062**     | **0.2691**        | **0.4690** | **0.8905**        | **0.9319**           | **0.0901**    | **4.9167**         |
+
+`ROUGE-1`에서 **24.0 ~ 30.1%p**, `BERTScore`에서 **5.3 ~ 6.8%p**, `G-Eval conciseness` 항목(5점 만점)에서 **0.52점** 상승폭이 있었습니다.
+
+### 분류
+
+| Condition                | Accuracy   | Tokens     | Accuracy per Tokens |
+| ------------------------ | ---------- | ---------- | ------------------- |
+| Baseline                 | 0.8104     | 97,436     | 8.32e-6             |
+| **summary based**        | 0.7708     | **52,477** | **1.47e-5**         |
+| summary based + 1-shot   | 0.8021     | 63,599     | 1.27e-5             |
+| summary based + 5-shots  | 0.7708     | 86,878     | 8.87e-6             |
+| summary based + 10-shots | **0.8146** | 115,558    | 7.05e-6             |
+
+`정확도/토큰 사용량` 지표를 바탕으로 현재 프롬프트를 채택했습니다.
+
+- [목적 별 분류](prompt/template/classification/category.yaml)
+- [추가 행동 필요 여부 분류](prompt/template/classification/action.yaml)
+
+### 메일 전체 요약
+
+| Condition                                                 | G-eval score |
+| --------------------------------------------------------- | ------------ |
+| Baseline(Self-Refine)                                     | 3.75         |
+| Baseline(Reflexion)                                       | 4.00         |
+| Detailed Instructions(Self-Refine)                        | 3.50         |
+| Detailed Instructions(Reflexion)                          | 3.50         |
+| Detailed Instructions + Formatting Penalty(Self-Refine)   | 3.94         |
+| **Detailed Instructions + Formatting Penalty(Reflexion)** | **4.19**     |
+
+`G-Eval` 평가 평균 점수(5점 만점)에서 **0.44** 상승폭이 있었습니다.
+
+- [G-Eval 평가 항목 별 프롬프트](prompt/template/reflexion/g_eval/)
+- [전체 요약 시스템 프롬프트](prompt/template/summary/final_summary_system.txt)
+- [전체 요약 사용자 프롬프트](prompt/template/summary/final_summary_user.txt)
+
+## ⚙️ Project Quick Setup
+
+### 1. Git Clone
+
+```shell
+$ git clone [email protected]:boostcampaitech7/level4-nlp-finalproject-hackathon-nlp-06-lv3.git
+$ cd level4-nlp-finalproject-hackathon-nlp-06-lv3
+```
+
+### 2. Create Virtual Environment
+
+```shell
+$ python -m venv .venv
+$ source .venv/bin/activate
+(.venv) $
+```
+
+### 3. Install Packages
+
+```shell
+(.venv) $ pip install -r requirements.txt
+(.venv) $ sudo apt-get install build-essential
+```
+
+### 4. Setup Environment Variables
+
+4.1. `.env`를 생성 후 환경 변수를 수정합니다.
+
+```shell
+(.venv) $ cp .env.example .env
+```
+
+- Upstage API Key는 [여기](https://console.upstage.ai/api-keys?api=chat)에서, Openai API Key는 [여기](https://platform.openai.com/welcome?step=create)에서 발급해주세요.
+- Google Client ID 및 Google Client Secret은 [다음 게시물](https://www.notion.so/gamchan/OAuth-179815b39d398017aeb8f6a8172e6e76?pvs=4)을 참고해주세요.
+
+```shell
+# AI Service
+UPSTAGE_API_KEY=your_upstage_api_key
+OPENAI_API_KEY=your_openai_api_key
+
+# Google OAuth 2.0(with GMail)
+GOOGLE_CLIENT_ID=1234567890.apps.googleusercontent.com
+GOOGLE_CLIENT_SECRET=1234567890
+```
+
+4.2. `main.py`를 실행하기 위해서 `client_secret_...usercontent.com.json` 파일 이름을 `credentials.json`으로 변경해주세요.
+
+### 5. Execute pipeline
+
+```shell
+(.venv) $ python main.py
+```
+
+### (Optional) Execute with DB connection
+
+```shell
+(.venv) $ docker-compose -f server/docker-compose.yml up -d
+(.venv) $ python batch_main.py
+```
+
+## 🔬 References
+
+- Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark, "Self-Refine: Iterative Refinement with Self-Feedback", 25 May, 2023. https://arxiv.org/abs/2303.17651.
+- Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, "Reflexion: Language Agents with Verbal Reinforcement Learning", 10 Oct, 2023. https://arxiv.org/abs/2303.11366.
+- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", 24 Dec, 2023. https://arxiv.org/abs/2306.05685.
+- Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu, "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment", 23 May, 2023. https://arxiv.org/abs/2303.16634.
+- Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Pilsung Kang, "CheckEval: Robust Evaluation Framework using Large Language Model via Checklist", 27 Mar, 2024. https://arxiv.org/abs/2403.18771.
+
+- 기타 등
+- 등등
+
+## 👥 Collaborators
+
+<div align="center">
+
+|                                                   팀원                                                    |                                  역할                                  |
+| :-------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------: |
+|     <a href="https://github.com/gsgh3016"><img src="https://github.com/gsgh3016.png" width="100"></a>     |  Streamlit app 개발 참여, 데이터 관찰 및 분석, 데이터 재구성 및 증강   |
+|       <a href="https://github.com/eyeol"> <img src="https://github.com/eyeol.png" width="100"></a>        |             Streamlit app 개발 참여, RAG 구현 및 성능 평가             |
+|    <a href="https://github.com/jagaldol"> <img src="https://github.com/jagaldol.png" width="100"> </a>    |  협업 초기 환경 세팅 및 코드 모듈화, CoT 방식 실험 설계 및 성능 평가   |
+|     <a href="https://github.com/Usunwoo"> <img src="https://github.com/Usunwoo.png" width="100"> </a>     |        베이스라인 모듈화, 메모리 사용 최적화, 모델 서치 및 실험        |
+| <a href="https://github.com/canolayoo78"> <img src="https://github.com/canolayoo78.png" width="100"> </a> |  Streamlit app 개발 참여, 데이터 분석 및 정제, RAG 구현 및 성능 평가   |
+|   <a href="https://github.com/chell9999"> <img src="https://github.com/chell9999.png" width="100"> </a>   | 문서 작업, RAG 전용 Vector DB 구성, 벤치마크 데이터셋 기반 데이터 증강 |
+
+</div>
+
+## 🛠️ Tools and Technologies
+
+<div align="center">
+
+![Python](https://img.shields.io/badge/-Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
+![jupyter](https://img.shields.io/badge/-jupyter-F37626?style=for-the-badge&logo=jupyter&logoColor=white)
+![PyTorch](https://img.shields.io/badge/-PyTorch-EE4C2C?style=for-the-badge&logo=PyTorch&logoColor=white)
+![huggingface](https://img.shields.io/badge/-huggingface-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)
+
+![unsloth](https://img.shields.io/badge/-unsloth-14B789?style=for-the-badge&logo=unsloth&logoColor=white)
+![BitsandBytes](https://img.shields.io/badge/BitsandBytes-36474F?style=for-the-badge&logo=BitsandBytes&logoColor=white)
+![LoRA](https://img.shields.io/badge/LoRA-40B5A4?style=for-the-badge&logo=LoRA&logoColor=white)
+![langchain](https://img.shields.io/badge/-langchain-1C3C3C?style=for-the-badge&logo=langchain&logoColor=white)
+
+![RAG](https://img.shields.io/badge/RAG-1868F2?style=for-the-badge&logo=RAG&logoColor=white)
+![pinecone](https://img.shields.io/badge/pinecone-000000?style=for-the-badge&logo=pinecone&logoColor=white)
+![Cot](https://img.shields.io/badge/cot-535051?style=for-the-badge&logo=cot&logoColor=white)
+![github action](https://img.shields.io/badge/GITHUB%20ACTIONS-2088FF?style=for-the-badge&logo=github-actions&logoColor=white)
+
+</div>
@@ -1,27 +1,71 @@
-from evaluation.gpt_eval import calculate_g_eval
+import re
+
+from openai import OpenAI
+
+from utils.configuration import Config
+from utils.decorators import retry_with_exponential_backoff
+from utils.token_usage_counter import TokenUsageCounter
 
 
 class ReflexionEvaluator:
-    def __init__(self, task: str):
-        self.task = task
+    def __init__(self):
+        self.model_name = "solar-pro"
+        self.client = OpenAI(api_key=Config.user_upstage_api_key, base_url="https://api.upstage.ai/v1/solar")
 
-    def get_geval_scores(self, source_text: str, output_text: str):
+        self.prompt_path: str = Config.config["report"]["g_eval"]["prompt_path"]
+        self.aspects = ["consistency", "coherence", "fluency", "relevance"]
+
+    @retry_with_exponential_backoff()
+    def get_geval_scores(self, source_text: str, output_text: str) -> dict:
         """참조한 텍스트와 생성한 텍스트를 입력으로 받고 점수를 매긴다.
 
         Args:
             source_text (str): 생성하는 데 참조한 텍스트
             output_text (str): 생성한 텍스트
 
         Returns:
-            (dict, str)
-            0번째는 dict, 1번째는 한 줄의 str 형식으로 aspect 별 점수를 return한다.
+            g_eval_result (dict): g-eval 결과 딕셔너리
         """
-        eval_type = "summary" if self.task == "single" else "report"
-        model_name = "solar-pro"  # TODO: Config로 추출
-
-        return calculate_g_eval(
-            source_texts=[source_text],
-            generated_texts=[output_text],
-            eval_type=eval_type,
-            model_name=model_name,
-        )
+
+        total_token_usage = 0
+
+        aspect_scores = {}
+        for aspect in self.aspects:
+            cur_prompt = self._create_aspect_prompt(aspect, source_text, output_text)
+
+            # OpenAI API 호출
+            response = self.client.chat.completions.create(
+                model=self.model_name,
+                messages=[{"role": "system", "content": cur_prompt}],
+                temperature=0.7,
+                max_tokens=50,
+                n=1,
+            )
+
+            try:
+                aspect_scores[aspect] = self._extract_score(response.choices[0].message.content.strip())
+            except Exception as e:
+                print(f"[Error] eval_type=report, aspect={aspect}, error={e}")
+                aspect_scores[aspect] = 0.0
+                continue
+
+            total_token_usage += response.usage.total_tokens
+
+        TokenUsageCounter.add_usage("reflexion", "evaluator", total_token_usage)
+
+        return aspect_scores
+
+    def _create_aspect_prompt(self, aspect: str, source_text: str, output_text: str) -> str:
+        with open(f"{self.prompt_path}{aspect}.txt", "r", encoding="utf-8") as f:
+            base_prompt = f.read()
+
+        # {Document}, {Summary} 치환
+        return base_prompt.format(Document=source_text, Summary=output_text)
+
+    def _extract_score(self, gpt_text: str):
+        # 정규표현식으로 숫자만 추출, 예: "abc123def" -> numbers = ['1','2','3']
+        numbers = re.findall(r"\d", gpt_text)
+        score_value = float(numbers[-1])
+        if score_value > 5:
+            return 1.0
+        return score_value
@@ -5,12 +5,19 @@
 
 
 class ReflexionFramework:
-    def __init__(self, task: str):
-        self.task = task
-        self.evaluator = ReflexionEvaluator(task)
-        self.self_reflection = ReflexionSelfReflection(task)
+    def __init__(self):
+        self.summary_agent = SummaryAgent(
+            model_name="solar-pro",
+            summary_type="final",
+            temperature=Config.config["temperature"]["summary"],
+            seed=Config.config["seed"],
+        )
+        self.evaluator = ReflexionEvaluator()
+        self.self_reflection = ReflexionSelfReflection()
+        self.threshold = Config.config["reflexion"]["threshold"]
+        self.max_iteration = Config.config["reflexion"]["max_iteration"]
 
-    def process(self, origin_mail, summary_agent: SummaryAgent) -> str:
+    def process(self, origin_mail) -> str:
         """
         Reflexion을 실행합니다.
 
@@ -19,58 +26,60 @@ def process(self, origin_mail, summary_agent: SummaryAgent) -> str:
         Returns:
             모든 aspect 점수의 평균값이 제일 높은 text가 반환됩니다.
         """
-        threshold_type = Config.config["self_reflection"]["reflexion"]["threshold_type"]
-        threshold = Config.config["self_reflection"]["reflexion"]["threshold"]
-
-        scores = []
         outputs = []
-        output_text = summary_agent.process(origin_mail, 3, ["start"])
-        print("\n\nINITIATE REFLEXION\n")
-        print(f"{'=' * 25}\n" f"초기 출력문:\n{output_text}\n" f"{'=' * 25}\n")
-        for i in range(Config.config["self_reflection"]["max_iteration"]):
+        eval_results = []
+        final_output = ""
+        max_score = 0
+
+        for i in range(self.max_iteration):
+            # 출력문 재생성
+            output_text = self.summary_agent.process_with_reflection(
+                origin_mail, self.self_reflection.reflection_memory
+            )
+            outputs.append(output_text)
+
             # 평가하기
-            eval_result_list = self.evaluator.get_geval_scores(origin_mail, output_text)
-            eval_result_str = ""
-            aspect_score = 0
-            for eval_result in eval_result_list:
-                aspect_len = len(eval_result)
-                for aspect, score in eval_result.items():
-                    eval_result_str += f"항목: {aspect} 점수: {score}\n"
-                    aspect_score += score
+            eval_result: dict = self.evaluator.get_geval_scores(origin_mail, output_text)
+            eval_results.append(eval_result)
+            score = round(sum(eval_result.values()) / len(eval_result), 1)
 
-            # 성찰하기
-            self.self_reflection.generate_reflection(origin_mail, output_text, eval_result_str)
+            if max_score < score:
+                max_score = score
+                final_output = output_text
+                if round(sum(eval_result.values()) / len(eval_result), 1) >= self.threshold:
+                    break
 
-            # 출력문 다시 생성하기
-            previous_reflections = self.self_reflection.reflection_memory
-            output_text = summary_agent.process(origin_mail, 3, previous_reflections)
+            # 점수 도달 실패 시 이유 성찰
+            self.self_reflection.generate_reflection(
+                origin_mail, output_text, self._create_eval_result_str(eval_result)
+            )
 
-            eval_average = round(aspect_score / aspect_len, 1)
-            scores.append(eval_average)
-            outputs.append(output_text)
-            previous_reflections_msg = "\n".join(previous_reflections)
+        self._print_result(eval_results, outputs)
+
+        return final_output
+
+    def _create_eval_result_str(self, eval_result: dict):
+        return "\n".join([f"항목: {aspect} 점수: {score}" for aspect, score in eval_result.items()])
+
+    def _print_result(self, eval_results: list[dict], outputs: list[str]):
+        max_score = 0
+        final_output = ""
+        max_index = 0
+
+        for i, (eval_result, output) in enumerate(zip(eval_results, outputs)):
+            score = round(sum(eval_result.values()) / len(eval_result), 1)
             print(
                 f"{'=' * 25}\n"
-                f"{i + 1}회차\n"
+                f"{i+1}회차 평균 {score}점\n"
+                f"{self._create_eval_result_str(eval_result)}\n"
+                f"Reflection 메모리:\n{self.self_reflection.get_reflection_memory_str()}\n"
                 f"{'-' * 25}\n"
-                f"{eval_result_str}, 평균 {eval_average}점\n"
+                f"생성된 텍스트:\n{output}\n"
                 f"{'-' * 25}\n"
-                f"Reflection 메모리:\n{previous_reflections_msg}\n\n"
-                f"{'-' * 25}\n"
-                f"성찰 후 재생성된 텍스트:\n{output_text}"
             )
+            if max_score < score:
+                max_score = score
+                final_output = output
+                max_index = i
 
-            if (threshold_type == "all" and all(value > threshold for value in scores)) or (
-                threshold_type == "average" and eval_average >= threshold
-            ):
-                print(f"{'=' * 25}\n" "Evaluation 점수 만족, Reflexion 루프 종료\n" f"{'=' * 25}")
-                break
-
-        for i, score in enumerate(scores):
-            print(f"{i+1}회차 평균 {score}점")
-        print("=" * 25)
-        print(f"\n최종 출력문 ({scores.index(max(scores)) + 1}회차, 평균: {max(scores)}점)")
-        final = outputs[scores.index(max(scores))]
-        print(f"{final}")
-
-        return final
+        print(f"{'=' * 25}\n최종 출력:{max_index+1}회차 평균 {max_score}점\n{final_output}\n{'=' * 25}\n")