#3 Classify mnemonics with OpenAI API

* Add mnemonic examples to README * Refactor .gitignore and data_processing.py - Refactor .gitignore to exclude /temp directory and ignore all .parquet and .csv files. - Format mnemonics more consistently - Drop mnemonics with only 2 words or less. * Separate prompts storage to YAML files * Refactor prompt config and mnemonic processing module * Switch to uv for Python package and project management * Fix imports between Python modules inside src * Relax .python-version and add more instructions for installation and API keys * Add pre-commit and its configurations * Refactor file paths and add error handling module * Improve mnemonics classification prompts and instructions * Sync dependencies in requirements.txt with virtual env * Extend error handling and extend ruff safe-fixes in pyproject.toml * Improve standardization of classification results. - Add JSON schema as response format for OpenAI API - Handle errors when OpenAI response is too long/short - Refactor function classify_mnemonics_api - Improve logging * Fix lint workflow
chiffonng · Oct 27, 2024 · fd18d23 · fd18d23
1 parent cd700b0
commit fd18d23
Show file tree

Hide file tree

Showing 19 changed files with 2,511 additions and 124 deletions.
diff --git a/.editorconfig b/.editorconfig
@@ -6,4 +6,4 @@ indent_size = 2
 trim_trailing_whitespace = true
 insert_final_newline = true
 [*.py]
-indent_size = 4
+indent_size = 4
diff --git a/.env.example b/.env.example
@@ -0,0 +1,2 @@
+OPENAI_API_KEY=sk-proj-something
+HUGGINGFACE_ACCESS_TOKEN=hf_B-something
diff --git a/.gitattributes b/.gitattributes
@@ -1,2 +1,2 @@
 # Ignore Jupyter Notebooks from Github Linguist Stats
-*.ipynb linguist-vendored
+*.ipynb linguist-vendored
diff --git a/.github/workflows/ci.yaml → .github/workflows/lint.yaml b/.github/workflows/ci.yaml → .github/workflows/lint.yaml
diff --git a/.gitignore b/.gitignore
@@ -164,9 +164,10 @@ cython_debug/
 
 # Data
 /data
-.parquet
-.csv
+/temp
+*.parquet
+*.csv
 
 # Write up
 /pdf
-.pdf
+*.pdf
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,16 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.7.1
+    hooks:
+      - id: ruff
+        name: Ruff Linter
+        args: [--fix]
+      - id: ruff-format
+        name: Ruff Formatter
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: check-yaml
+      - id: end-of-file-fixer
+      - id: trailing-whitespace
+      - id: check-added-large-files
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.12
diff --git a/README.md b/README.md
@@ -1,16 +1,45 @@
-# Keyword Mnemonic Generation for English Words
+# Mnemonic Generation for English Words
 
-Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and are limited in their ability to generate diverse and contextually relevant mnemonics.
+Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and have lower likelihood of improving retention, compared to _deep-enconding information_
 
-This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts).
+This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (semantic information such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts). By fine-tuning the model on this diverse dataset, we aim to improve the quality and variety of mnemonics generated by the model, and improve the retention of new vocabulary for language learners.
 
-The fine-tuned model will generate diverse, contextually relevant and coherent mnemonics.
+| **Shallow-Encoding Mnemonics**                                                      | **Deep-Encoding Mnemonics**                                                                                  |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| **Homophonic:** olfactory sounds like "old factory."                                | **Etymology**: preposterous - pre (before) + post (after) + erous, which implies absurd.                     |
+| **Chunking:** obsequious sounds like "ob-se-ki-ass. Obedient servants kiss your ass | **Morphology:** Suffixes "ate" are usually verbs. Prefix "ab" means from, away.                              |
+| **Keyword:** Phony sounds like “phone-y,” which means fraudulent (phone calls).     | **Context/Story:** His actions of pouring acid on the environment are detrimental                            |
+| **Rhyming:** wistful/longing for the past but wishful for the future.               | **Synonym/Antonym** "benevolent" ~ "kind" or "generous," and "malevolent" is its antonym.                    |
+|                                                                                     | **Image Association:** exuberant - The ex-Uber driver never ranted; he always seems ebullient and lively.    |
+|                                                                                     | **Related words**: Phantasm relates to an illusion or ghostly figure, closely tied to the synonym “phantom.” |
 
-# Project goals
+---
 
-- [ ] Research: Compare performance between tuned and untuned models.
-- [ ] Gradio: Create a web interface for the model.
+## Project components
 
-# Setup
+- [ ] A web interface (using Gradio) for the tuned model.
+- [ ] A dataset of 1200 examples (will be refined continually).
+- [ ] This documented codebase.
 
-Python >= 3.10 and `requirements.txt`.
+## Setup
+
+### Installation
+
+```bash
+bash setup.sh
+```
+
+It attempts to install with `[uv](https://docs.astral.sh/uv/)` (a fast, Rust-based Python package and project manager) using `.python-version` file and `pyproject.toml` file. Otherwise, it falls back to `pip` installation.
+
+### Secrets
+
+`setup.sh` already creates a `.env`. You will need:
+
+- OpenAI API key (optional: for some modules inside `src/data_pipeline`)
+
+## Development
+
+```bash
+pre-commit install
+pre-commit run --all-files
+```
diff --git a/prompts/classify_mnemonics.yaml b/prompts/classify_mnemonics.yaml
@@ -0,0 +1,17 @@
+prompts:
+  system: |
+    You are an expert in English mnemonics. Your task is to classify each mnemonic as one of the following: shallow-encoding (0), deep-encoding (1), mixed (2), or unsure (-1). Think through the reasoning for classification yourself, and respond consistently with the response format. You have to classify every mnemonic in the prompt, no more no less. If unsure, return -1. \n
+    Classify the mnemonics below based on the following criteria:\n
+    - Shallow (0): Focus on how the word sounds, looks, or rhymes.
+    - Deep (1): Focus on semantics, morphology, etymology, context (inferred meaning, imagery), related words (synonyms, antonyms, words with same roots). Repeating the word or using a similar-sounding word is NOT deep-encoding.
+    - Mixed (2): Contains both shallow and deep features.\n
+
+    Examples:
+    - olfactory: Sounds like "old factory." The old factory had a strong smell, reminding workers of its olfactory history. Classification: shallow (0), since it's based on the sound.
+    - vacuous: Same Latin root "vacare" (empty) as "vacuum, vacant". His expression was as empty as a vacuum, showing no signs of thought. Classification: deep (1), since it only uses etymology and related words.
+    - malevolent: From male 'ill' + volent 'wishing' (as in "benevolent"). These male species are so violent that they always have evil plans. Classification: mixed (2) since it uses etymology and antonyms (deep-encoding), and the sounds of "male" and "violent" (shallow-encoding)\n
+  user: Mnemonics are seperated by a newline character. Please classify each mnemonic in the same order as they appear in the prompt.\n
+model: "gpt-4o-mini"
+temperature: 0.2
+num_outputs: 1
+batch_size: 50
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,55 @@
+[project]
+name = "mnemonic-gen"
+version = "0.1.0"
+description = "Generate mnemonic sentences for English words"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+  "openai>=1.52.2",
+  "pandas>=2.2.3",
+  "pre-commit>=4.0.1",
+  "pyarrow>=17.0.0",
+  "python-dotenv>=1.0.1",
+  "pyyaml>=6.0.2",
+  "ruff",
+  "tenacity>=9.0.0",
+  "torch",
+  "tqdm>=4.66.5",
+  "transformers>=4.46.0",
+]
+
+[project.urls]
+repository = "https://github.com/chiffonng/capstone"
+
+[project.optional-dependencies]
+plot = ["matplotlib>=3.9.2"]
+
+[dependency-groups]
+dev  = ["pytest>=8.3.3"]
+lint = ["ruff>=0.7.1"]
+data = ["pandas>=2.2.3", "pyarrow>=17.0.0"]
+
+[tool.ruff]
+src            = ["src"]
+target-version = "py312"
+extend-exclude = ["*__init__.py", "*.pyi"]
+
+[tool.ruff.format]
+docstring-code-format      = true # format code in docstrings
+docstring-code-line-length = 88
+
+[tool.ruff.lint]
+extend-select = [
+  "D", # pydocstyle, all functions and classes must have docstrings
+  "T", # mypy, type hints
+]
+extend-safe-fixes = ["D", "T"] # docstring, type hints
+extend-fixable = ["B"] # bugbear
+ignore = ["T201", "F401"] # print statements OK, unused imports OK
+pydocstyle.convention = "google"
+pycodestyle.max-doc-length = 88
+
+[tool.uv]
+python-downloads = "manual"         # change to "automatic" to download Python specified in .python-version
+upgrade-package  = ["ruff", "tqdm"]
+managed          = true
diff --git a/requirements.txt b/requirements.txt
@@ -1,69 +1,150 @@
-accelerate==0.34.2
-aiohappyeyeballs==2.4.2
-aiohttp==3.10.8
-aiosignal==1.3.1
-appnope==0.1.4
-asttokens==2.4.1
-attrs==24.2.0
+# This file was autogenerated by uv via the following command:
+#    uv pip freeze | uv pip compile - -o requirements.txt
+-e .
+annotated-types==0.7.0
+    # via pydantic
+anyio==4.6.2.post1
+    # via
+    #   httpx
+    #   openai
 certifi==2024.8.30
-charset-normalizer==3.3.2
-comm==0.2.2
-datasets==3.0.1
-debugpy==1.8.6
-decorator==5.1.1
-dill==0.3.8
-executing==2.1.0
+    # via
+    #   httpcore
+    #   httpx
+    #   requests
+cfgv==3.4.0
+    # via pre-commit
+charset-normalizer==3.4.0
+    # via requests
+distlib==0.3.9
+    # via virtualenv
+distro==1.9.0
+    # via openai
 filelock==3.16.1
-frozenlist==1.4.1
-fsspec==2024.6.1
-huggingface-hub==0.25.1
+    # via
+    #   huggingface-hub
+    #   torch
+    #   transformers
+    #   virtualenv
+fsspec==2024.10.0
+    # via
+    #   huggingface-hub
+    #   torch
+h11==0.14.0
+    # via httpcore
+httpcore==1.0.6
+    # via httpx
+httpx==0.27.2
+    # via openai
+huggingface-hub==0.26.1
+    # via
+    #   tokenizers
+    #   transformers
+identify==2.6.1
+    # via pre-commit
 idna==3.10
-ipykernel==6.29.5
-ipython==8.27.0
-jedi==0.19.1
-Jinja2==3.1.4
-jupyter_client==8.6.3
-jupyter_core==5.7.2
-MarkupSafe==2.1.5
-matplotlib-inline==0.1.7
+    # via
+    #   anyio
+    #   httpx
+    #   requests
+iniconfig==2.0.0
+    # via pytest
+jinja2==3.1.4
+    # via torch
+jiter==0.6.1
+    # via openai
+markupsafe==3.0.2
+    # via jinja2
 mpmath==1.3.0
-multidict==6.1.0
-multiprocess==0.70.16
-nest-asyncio==1.6.0
-networkx==3.3
-numpy==2.1.1
+    # via sympy
+networkx==3.4.2
+    # via torch
+nodeenv==1.9.1
+    # via pre-commit
+numpy==2.1.2
+    # via
+    #   pandas
+    #   pyarrow
+    #   transformers
+openai==1.52.2
+    # via mnemonic-gen
 packaging==24.1
+    # via
+    #   huggingface-hub
+    #   pytest
+    #   transformers
 pandas==2.2.3
-parso==0.8.4
-pexpect==4.9.0
+    # via mnemonic-gen
 platformdirs==4.3.6
-prompt_toolkit==3.0.48
-psutil==6.0.0
-ptyprocess==0.7.0
-pure_eval==0.2.3
+    # via virtualenv
+pluggy==1.5.0
+    # via pytest
+pre-commit==4.0.1
+    # via mnemonic-gen
 pyarrow==17.0.0
-Pygments==2.18.0
+    # via mnemonic-gen
+pydantic==2.9.2
+    # via openai
+pydantic-core==2.23.4
+    # via pydantic
+pytest==8.3.3
 python-dateutil==2.9.0.post0
+    # via pandas
+python-dotenv==1.0.1
+    # via mnemonic-gen
 pytz==2024.2
-PyYAML==6.0.2
-pyzmq==26.2.0
+    # via pandas
+pyyaml==6.0.2
+    # via
+    #   huggingface-hub
+    #   mnemonic-gen
+    #   pre-commit
+    #   transformers
 regex==2024.9.11
+    # via transformers
 requests==2.32.3
-ruff==0.6.8
+    # via
+    #   huggingface-hub
+    #   transformers
+ruff==0.7.1
+    # via mnemonic-gen
 safetensors==0.4.5
-setuptools==75.1.0
+    # via transformers
+setuptools==75.2.0
+    # via torch
 six==1.16.0
-stack-data==0.6.3
-sympy==1.13.3
-tokenizers==0.20.0
-torch==2.4.1
-tornado==6.4.1
+    # via python-dateutil
+sniffio==1.3.1
+    # via
+    #   anyio
+    #   httpx
+    #   openai
+sympy==1.13.1
+    # via torch
+tenacity==9.0.0
+    # via mnemonic-gen
+tokenizers==0.20.1
+    # via transformers
+torch==2.5.0
+    # via mnemonic-gen
 tqdm==4.66.5
-traitlets==5.14.3
-transformers==4.45.1
-typing_extensions==4.12.2
+    # via
+    #   huggingface-hub
+    #   mnemonic-gen
+    #   openai
+    #   transformers
+transformers==4.46.0
+    # via mnemonic-gen
+typing-extensions==4.12.2
+    # via
+    #   huggingface-hub
+    #   openai
+    #   pydantic
+    #   pydantic-core
+    #   torch
 tzdata==2024.2
+    # via pandas
 urllib3==2.2.3
-wcwidth==0.2.13
-xxhash==3.5.0
-yarl==1.13.1
+    # via requests
+virtualenv==20.27.0
+    # via pre-commit
diff --git a/ruff.toml b/ruff.toml
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		OPENAI_API_KEY=sk-proj-something
		HUGGINGFACE_ACCESS_TOKEN=hf_B-something