boostcampaitech7 · github-classroom · Sep 30, 2024 · Oct 1, 2024 · Oct 1, 2024 · Oct 1, 2024
diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml
@@ -0,0 +1,8 @@
+blank_issues_enabled: false
+issue_templates:
+  - name: Experiment Template
+    description: For new experiments
+    file: experiment.md
+  - name: Feature Template
+    description: For new features
+    file: feature.md
diff --git a/.github/ISSUE_TEMPLATE/experiment.md b/.github/ISSUE_TEMPLATE/experiment.md
@@ -0,0 +1,26 @@
+---
+name: Experiment Issue
+about: For new experiments
+title: "[EXP]"
+labels: 
+assignees: ''
+---
+
+# ISSUE: Experiment
+
+## 노션 링크
+
+
+## 실험 목적
+
+- 목적 1
+- 목적 2
+
+## 체크리스트
+
+- [ ]  할일 1
+- [ ]  할일 2
+- [ ]  할일 3
+
+## 실험 내용
+
diff --git a/.github/ISSUE_TEMPLATE/feature.md b/.github/ISSUE_TEMPLATE/feature.md
@@ -0,0 +1,26 @@
+---
+name: Feature Request
+about: Suggest a new feature for this project
+title: "[FEAT]"
+labels: 
+assignees: ''
+---
+
+# ISSUE: Feature
+
+## 노션 링크
+
+
+## 기능 목적
+
+- 목적 1
+- 목적 2
+
+## 체크리스트
+
+- [ ]  할일 1
+- [ ]  할일 2
+- [ ]  할일 3
+
+## 기능 내용
+
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,13 @@
+## PR 정보
+
+- 목적: 
+- 이슈 번호: 
+- 노션 작업 카드 링크:
+
+## 변경 사항
+
+- 이번 PR에서 작업한 내용을 간략히 설명
+
+## 리뷰 참고사항
+
+- 리뷰어에게 필요한 설명이나 특별히 봐주었으면 하는 부분을 작성
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,35 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+.idea/
+# macOS
+*DS_Store
+
+# custom
+data/*
+!data/.gitkeep
+wandb
+models
+outputs
+/eda/eda_ignore/
+ensemble/results_hard
+ensemble/results_soft
+
+**/.ipynb_checkpoints
+**/lightning_logs
+*.ckpt
+*.pt
+*.arrow
+*.bin
+*/nohup.out
+code/config/exp/*
diff --git a/README.md b/README.md
@@ -0,0 +1,109 @@
+# ODQA(Open-Domain Question Anwering Competition)
+
+## 개요
+
+| 항목 | 내용 |
+| --- | --- |
+| 프로젝트 주제 | MRC(기계독해) 데이터셋으로 ODQA(Open-Domain Question Answering)를 수행합니다. <br>RAG(Retrieval-Augmented Generation)로 더 많이 알려져 있습니다. |
+| 프로젝트 구성 | 질문에 관련된 문서를 찾는 Retriever와 찾아온 문서에서 질문에 대한 정답을 찾는 Reader로 구성됩니다. |
+| 평가 지표 | Exact Match (EM) Score를 사용하여 모델의 예측과, 실제 답이 정확하게 일치할 때만 점수가 주어집니다. <br>F1 Score는 참고용으로만 활용됩니다. |
+| 진행 기간 |  2024년 10월 2일 ~ 2024년 10월 24일 |
+
+## 최종 리더보드 (Private)
+<img width="1216" alt="image" src="./assets/leaderboard.png">
+
+## 팀원
+
+|[이예서](https://github.com/yeseoLee)|[김수진](https://github.com/Sujinkim-625)|[김민서](https://github.com/luckyvickyricky)|[홍성재](https://github.com/koreannn)|[양가연](https://github.com/gayeon7877)|[홍성민](https://github.com/hsmin9809)|
+|:-:|:-:|:-:|:-:|:-:|:-:|
+|<a href="https://github.com/yeseoLee"><img src="https://github.com/yeseoLee.png" width='300px' style="border-radius: 50%;"></a>|<a href="https://github.com/Sujinkim-625"><img src="https://github.com/Sujinkim-625.png" width='300px' style="border-radius: 50%;"></a>|<a href="https://github.com/luckyvickyricky"><img src="https://github.com/luckyvickyricky.png" width='300px' style="border-radius: 50%;"></a>|<a href="https://github.com/koreannn"><img src="https://github.com/koreannn.png" width='300px' style="border-radius: 50%;"></a>|<a href="https://github.com/gayeon7877"><img src="https://github.com/gayeon7877.png" width='300px' style="border-radius: 50%;"></a>|<a href="https://github.com/hsmin9809"><img src="https://github.com/hsmin9809.png" width='300px' style="border-radius: 50%;"></a>|
+
+## 역할
+
+| 이름 | 역할 |
+| --- | --- |
+| 김민서 | 데이터분석, LLM기반 데이터 증강, 모델실험(증강 데이터 비교), 앙상블 |
+| 김수진 | 외부데이터 리서치 및 학습, 데이터 전처리, 모델 리서치, 모델 개선, 모델 실험, <br>Retrieval 구현 및 실험(ColBERT), 앙상블 |
+| 양가연 | Retrieval(BM25) 구현 및 실험, 모델 리서치, 모델 개선(Custom Layer, Distillation), <br>모델 실험, 앙상블(hard ensemble, soft ensemble, weighted ensemble) |
+| 이예서 | PM(마일스톤 및 이슈 관리), 인프라 담당(개발환경 구성 스크립트화), <br>베이스라인 코드 템플릿화, 데이터 전처리(cleansing), 데이터 증강(negative passage), <br>Retrieval 구현 및 실험(Elasticsearch, Reranking), 앙상블 (merged ensemble) |
+| 홍성민 | 모델 리서치, 모델 실험, Retrieval 구현 및 실험(Dense), 앙상블 |
+| 홍성재 | EDA, 데이터 전처리, Reader성능 개선 관련 조사(Retrospective Reader), 앙상블 |
+
+## Wrap-Up Report
+### [MRC_NLP_리포트(04조).pdf](./assets/MRC_NLP_04_report.pdf)  
+데이터 EDA부터 앙상블까지 프로젝트 전반의 시행착오와 솔루션 및 회고는 렙업 리포트를 통해 확인할 수 있습니다.
+
+## 폴더 구조
+```bash
+level2-mrc-nlp-04
+├── code
+│   ├── BertEncoder.py
+│   ├── README.MD
+│   ├── arguments.py
+│   ├── config
+│   │   ├── elastic_setting.json
+│   │   ├── eval.json
+│   │   ├── inference.json
+│   │   ├── inference_with_rerank.json
+│   │   ├── integration.json
+│   │   └── train.json
+│   ├── custom_model.py
+│   ├── dense_train.py
+│   ├── inference.py
+│   ├── integration_pipeline.py
+│   ├── retrieval
+│   │   ├── __init__.py
+│   │   ├── base.py
+│   │   ├── bm25.py
+│   │   ├── dense.py
+│   │   ├── dense_encoder
+│   │   ├── elastic.py
+│   │   ├── reranker.py
+│   │   └── tdidf.py
+│   ├── train.py
+│   └── utils
+│       ├── __init__.py
+│       ├── trainer_qa.py
+│       └── utils_qa.py
+├── data
+├── ensemble
+│   ├── gpt_voting.ipynb
+│   ├── hard_voting.py
+│   ├── results_hard
+│   ├── results_soft
+│   └── soft_voting.py
+├── external_dataset_processing
+│   ├── arrow_to_csv.py
+│   ├── csv_to_arrow.py
+│   ├── find_answer_start.py
+│   └── train_for_training.py
+├── notebooks
+│   ├── EDA_train,test.ipynb
+│   ├── EDA_wiki.ipynb
+│   ├── add_question_augmentation_GPT.ipynb
+│   ├── analyze_test_difficulty.ipynb
+│   ├── data_info1.png
+│   ├── dense_top-k_confirm.ipynb
+│   ├── eda_for_data_cleaning.ipynb
+│   ├── elasticsearch.ipynb
+│   ├── question_classification_GPT.ipynb
+│   ├── retriever_base_dataset.ipynb
+│   ├── retriever_evaluation.ipynb
+│   ├── shuffle_data.ipynb
+│   └── synonym_questions_augmentation_GPT.ipynb
+└── setup
+    ├── requirements.txt
+    ├── setup-elasticsearch.bash
+    ├── setup-git.bash
+    └── setup-gpu-server.bash
+```
+
+- code: 학습 및 추론에 사용되는 코드 전반을 관리합니다.
+- data: 학습 및 추론에 필요한 데이터를 관리합니다.
+- ensemble: 추론 결과 파일을 이용한 앙상블을 수행하는 코드를 관리합니다.
+- external_dataset_processing: 외부 데이터셋 사용을 위한 전처리 코드를 관리합니다.
+- notebooks: 데이터 분석 및 전처리, 증강 방법 소개와 구현 코드 사용법 안내 등의 주피터 노트북 파일을 관리합니다.
+- setup: GPU 서버 개발 환경 세팅을 위한 스크립트를 관리합니다.
+
+## 실행 방법
+[학습 및 추론 코드 사용 방법](code/README.MD)  
diff --git a/assets/MRC_NLP_04_report.pdf b/assets/MRC_NLP_04_report.pdf
diff --git a/assets/leaderboard.png b/assets/leaderboard.png
diff --git a/code/BertEncoder.py b/code/BertEncoder.py
@@ -0,0 +1,27 @@
+from transformers import BertModel, BertPreTrainedModel
+
+class BertEncoder(BertPreTrainedModel):
+
+    def __init__(self,
+        config
+    ):
+        super(BertEncoder, self).__init__(config)
+
+        self.bert = BertModel(config)
+        self.init_weights()
+
+
+    def forward(self,
+            input_ids,
+            attention_mask=None,
+            token_type_ids=None
+        ):
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids
+        )
+
+        pooled_output = outputs[1]
+        return pooled_output
diff --git a/code/README.MD b/code/README.MD
@@ -0,0 +1,70 @@
+# 실행 방법
+## 통합 파이프라인으로 실행
+```bash
+# train-eval-inference 통합 파이프라인
+python integration_pipline.py ./config/integration.json
-python integration_pipline.py ./config/integration.json
+python integration_pipeline.py ./config/integration.json
-python integration_pipline.py ./config/integration.json
+python integration_pipeline.py ./config/integration.json
+```
+
+## config.json으로 실행
+```bash
+# 학습
+python train.py ./config/train.json
+# 평가
+python train.py ./config/eval.json
+# 추론
+python inference.py ./config/inference.json
+```
+
+## 쉘 스크립트로 실행
+### 학습 또는 평가 방법
+```bash
+python train.py \
+    --model_name_or_path klue/bert-base \
+    --config_name None \
+    --tokenizer_name None \
+    \
+    --dataset_name ../data/train_dataset \
+    --max_seq_length 384 \
+    --pad_to_max_length False \
+    --doc_stride 128 \
+    --max_answer_length 30 \
+    --overwrite_cache True \
+    --preprocessing_num_workers None \
+    \
+    --output_dir ../models/train_dataset \
+    --overwrite_output_dir True \
+    --do_train True \
+    --do_eval True \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --learning_rate 5e-5 \
+    --weight_decay 0.0 \
+    --num_train_epochs 5.0 \
+    --warmup_ratio 0.0 \
+    --logging_steps 500 \
+    --save_steps 500 \
+    --seed 42
+```
+
+### 추론 방법
+```bash
+python inference.py \
+    --model_name_or_path ../models/train_dataset/ \
+    \
+    --dataset_name ../data/test_dataset/ \
+    --overwrite_cache True \
+    --max_seq_length 384 \
+    --pad_to_max_length False \
+    --doc_stride 128 \
+    --max_answer_length 30 \
+    --eval_retrieval True \
+    --num_clusters 64 \
+    --top_k_retrieval 20 \
+    --use_faiss False \
+    \
+    --output_dir code/outputs/test_dataset/ \
+    --overwrite_output_dir True \
+    --do_eval False \
+    --do_predict True \
+    --seed 42
+```