Skip to content

Commit

Permalink
#3 Classify mnemonics with OpenAI API
Browse files Browse the repository at this point in the history
* Add mnemonic examples to README

* Refactor .gitignore and data_processing.py

- Refactor .gitignore to exclude /temp directory and ignore all .parquet and .csv files.
- Format mnemonics more consistently
- Drop mnemonics with only 2 words or less.

* Separate prompts storage to YAML files

* Refactor prompt config and mnemonic processing module

* Switch to uv for Python package and project management

* Fix imports between Python modules inside src

* Relax .python-version and add more instructions for installation and API keys

* Add pre-commit and its configurations

* Refactor file paths and add error handling module

* Improve mnemonics classification prompts and instructions

* Sync dependencies in requirements.txt with virtual env

* Extend error handling and extend ruff safe-fixes in pyproject.toml

* Improve standardization of classification results.

- Add JSON schema as response format for OpenAI API
- Handle errors when OpenAI response is too long/short
- Refactor function classify_mnemonics_api
- Improve logging

* Fix lint workflow
  • Loading branch information
chiffonng authored Oct 27, 2024
1 parent cd700b0 commit fd18d23
Show file tree
Hide file tree
Showing 19 changed files with 2,511 additions and 124 deletions.
2 changes: 1 addition & 1 deletion .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ indent_size = 2
trim_trailing_whitespace = true
insert_final_newline = true
[*.py]
indent_size = 4
indent_size = 4
2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
OPENAI_API_KEY=sk-proj-something
HUGGINGFACE_ACCESS_TOKEN=hf_B-something
2 changes: 1 addition & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Ignore Jupyter Notebooks from Github Linguist Stats
*.ipynb linguist-vendored
*.ipynb linguist-vendored
File renamed without changes.
7 changes: 4 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -164,9 +164,10 @@ cython_debug/

# Data
/data
.parquet
.csv
/temp
*.parquet
*.csv

# Write up
/pdf
.pdf
*.pdf
16 changes: 16 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.7.1
hooks:
- id: ruff
name: Ruff Linter
args: [--fix]
- id: ruff-format
name: Ruff Formatter
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- id: check-added-large-files
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
47 changes: 38 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,45 @@
# Keyword Mnemonic Generation for English Words
# Mnemonic Generation for English Words

Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and are limited in their ability to generate diverse and contextually relevant mnemonics.
Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and have lower likelihood of improving retention, compared to _deep-enconding information_

This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts).
This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (semantic information such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts). By fine-tuning the model on this diverse dataset, we aim to improve the quality and variety of mnemonics generated by the model, and improve the retention of new vocabulary for language learners.

The fine-tuned model will generate diverse, contextually relevant and coherent mnemonics.
| **Shallow-Encoding Mnemonics** | **Deep-Encoding Mnemonics** |
| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| **Homophonic:** olfactory sounds like "old factory." | **Etymology**: preposterous - pre (before) + post (after) + erous, which implies absurd. |
| **Chunking:** obsequious sounds like "ob-se-ki-ass. Obedient servants kiss your ass | **Morphology:** Suffixes "ate" are usually verbs. Prefix "ab" means from, away. |
| **Keyword:** Phony sounds like “phone-y,” which means fraudulent (phone calls). | **Context/Story:** His actions of pouring acid on the environment are detrimental |
| **Rhyming:** wistful/longing for the past but wishful for the future. | **Synonym/Antonym** "benevolent" ~ "kind" or "generous," and "malevolent" is its antonym. |
| | **Image Association:** exuberant - The ex-Uber driver never ranted; he always seems ebullient and lively. |
| | **Related words**: Phantasm relates to an illusion or ghostly figure, closely tied to the synonym “phantom.” |

# Project goals
---

- [ ] Research: Compare performance between tuned and untuned models.
- [ ] Gradio: Create a web interface for the model.
## Project components

# Setup
- [ ] A web interface (using Gradio) for the tuned model.
- [ ] A dataset of 1200 examples (will be refined continually).
- [ ] This documented codebase.

Python >= 3.10 and `requirements.txt`.
## Setup

### Installation

```bash
bash setup.sh
```

It attempts to install with `[uv](https://docs.astral.sh/uv/)` (a fast, Rust-based Python package and project manager) using `.python-version` file and `pyproject.toml` file. Otherwise, it falls back to `pip` installation.

### Secrets

`setup.sh` already creates a `.env`. You will need:

- OpenAI API key (optional: for some modules inside `src/data_pipeline`)

## Development

```bash
pre-commit install
pre-commit run --all-files
```
17 changes: 17 additions & 0 deletions prompts/classify_mnemonics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
prompts:
system: |
You are an expert in English mnemonics. Your task is to classify each mnemonic as one of the following: shallow-encoding (0), deep-encoding (1), mixed (2), or unsure (-1). Think through the reasoning for classification yourself, and respond consistently with the response format. You have to classify every mnemonic in the prompt, no more no less. If unsure, return -1. \n
Classify the mnemonics below based on the following criteria:\n
- Shallow (0): Focus on how the word sounds, looks, or rhymes.
- Deep (1): Focus on semantics, morphology, etymology, context (inferred meaning, imagery), related words (synonyms, antonyms, words with same roots). Repeating the word or using a similar-sounding word is NOT deep-encoding.
- Mixed (2): Contains both shallow and deep features.\n
Examples:
- olfactory: Sounds like "old factory." The old factory had a strong smell, reminding workers of its olfactory history. Classification: shallow (0), since it's based on the sound.
- vacuous: Same Latin root "vacare" (empty) as "vacuum, vacant". His expression was as empty as a vacuum, showing no signs of thought. Classification: deep (1), since it only uses etymology and related words.
- malevolent: From male 'ill' + volent 'wishing' (as in "benevolent"). These male species are so violent that they always have evil plans. Classification: mixed (2) since it uses etymology and antonyms (deep-encoding), and the sounds of "male" and "violent" (shallow-encoding)\n
user: Mnemonics are seperated by a newline character. Please classify each mnemonic in the same order as they appear in the prompt.\n
model: "gpt-4o-mini"
temperature: 0.2
num_outputs: 1
batch_size: 50
55 changes: 55 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
[project]
name = "mnemonic-gen"
version = "0.1.0"
description = "Generate mnemonic sentences for English words"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"openai>=1.52.2",
"pandas>=2.2.3",
"pre-commit>=4.0.1",
"pyarrow>=17.0.0",
"python-dotenv>=1.0.1",
"pyyaml>=6.0.2",
"ruff",
"tenacity>=9.0.0",
"torch",
"tqdm>=4.66.5",
"transformers>=4.46.0",
]

[project.urls]
repository = "https://github.com/chiffonng/capstone"

[project.optional-dependencies]
plot = ["matplotlib>=3.9.2"]

[dependency-groups]
dev = ["pytest>=8.3.3"]
lint = ["ruff>=0.7.1"]
data = ["pandas>=2.2.3", "pyarrow>=17.0.0"]

[tool.ruff]
src = ["src"]
target-version = "py312"
extend-exclude = ["*__init__.py", "*.pyi"]

[tool.ruff.format]
docstring-code-format = true # format code in docstrings
docstring-code-line-length = 88

[tool.ruff.lint]
extend-select = [
"D", # pydocstyle, all functions and classes must have docstrings
"T", # mypy, type hints
]
extend-safe-fixes = ["D", "T"] # docstring, type hints
extend-fixable = ["B"] # bugbear
ignore = ["T201", "F401"] # print statements OK, unused imports OK
pydocstyle.convention = "google"
pycodestyle.max-doc-length = 88

[tool.uv]
python-downloads = "manual" # change to "automatic" to download Python specified in .python-version
upgrade-package = ["ruff", "tqdm"]
managed = true
185 changes: 133 additions & 52 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,69 +1,150 @@
accelerate==0.34.2
aiohappyeyeballs==2.4.2
aiohttp==3.10.8
aiosignal==1.3.1
appnope==0.1.4
asttokens==2.4.1
attrs==24.2.0
# This file was autogenerated by uv via the following command:
# uv pip freeze | uv pip compile - -o requirements.txt
-e .
annotated-types==0.7.0
# via pydantic
anyio==4.6.2.post1
# via
# httpx
# openai
certifi==2024.8.30
charset-normalizer==3.3.2
comm==0.2.2
datasets==3.0.1
debugpy==1.8.6
decorator==5.1.1
dill==0.3.8
executing==2.1.0
# via
# httpcore
# httpx
# requests
cfgv==3.4.0
# via pre-commit
charset-normalizer==3.4.0
# via requests
distlib==0.3.9
# via virtualenv
distro==1.9.0
# via openai
filelock==3.16.1
frozenlist==1.4.1
fsspec==2024.6.1
huggingface-hub==0.25.1
# via
# huggingface-hub
# torch
# transformers
# virtualenv
fsspec==2024.10.0
# via
# huggingface-hub
# torch
h11==0.14.0
# via httpcore
httpcore==1.0.6
# via httpx
httpx==0.27.2
# via openai
huggingface-hub==0.26.1
# via
# tokenizers
# transformers
identify==2.6.1
# via pre-commit
idna==3.10
ipykernel==6.29.5
ipython==8.27.0
jedi==0.19.1
Jinja2==3.1.4
jupyter_client==8.6.3
jupyter_core==5.7.2
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
# via
# anyio
# httpx
# requests
iniconfig==2.0.0
# via pytest
jinja2==3.1.4
# via torch
jiter==0.6.1
# via openai
markupsafe==3.0.2
# via jinja2
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.3
numpy==2.1.1
# via sympy
networkx==3.4.2
# via torch
nodeenv==1.9.1
# via pre-commit
numpy==2.1.2
# via
# pandas
# pyarrow
# transformers
openai==1.52.2
# via mnemonic-gen
packaging==24.1
# via
# huggingface-hub
# pytest
# transformers
pandas==2.2.3
parso==0.8.4
pexpect==4.9.0
# via mnemonic-gen
platformdirs==4.3.6
prompt_toolkit==3.0.48
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
# via virtualenv
pluggy==1.5.0
# via pytest
pre-commit==4.0.1
# via mnemonic-gen
pyarrow==17.0.0
Pygments==2.18.0
# via mnemonic-gen
pydantic==2.9.2
# via openai
pydantic-core==2.23.4
# via pydantic
pytest==8.3.3
python-dateutil==2.9.0.post0
# via pandas
python-dotenv==1.0.1
# via mnemonic-gen
pytz==2024.2
PyYAML==6.0.2
pyzmq==26.2.0
# via pandas
pyyaml==6.0.2
# via
# huggingface-hub
# mnemonic-gen
# pre-commit
# transformers
regex==2024.9.11
# via transformers
requests==2.32.3
ruff==0.6.8
# via
# huggingface-hub
# transformers
ruff==0.7.1
# via mnemonic-gen
safetensors==0.4.5
setuptools==75.1.0
# via transformers
setuptools==75.2.0
# via torch
six==1.16.0
stack-data==0.6.3
sympy==1.13.3
tokenizers==0.20.0
torch==2.4.1
tornado==6.4.1
# via python-dateutil
sniffio==1.3.1
# via
# anyio
# httpx
# openai
sympy==1.13.1
# via torch
tenacity==9.0.0
# via mnemonic-gen
tokenizers==0.20.1
# via transformers
torch==2.5.0
# via mnemonic-gen
tqdm==4.66.5
traitlets==5.14.3
transformers==4.45.1
typing_extensions==4.12.2
# via
# huggingface-hub
# mnemonic-gen
# openai
# transformers
transformers==4.46.0
# via mnemonic-gen
typing-extensions==4.12.2
# via
# huggingface-hub
# openai
# pydantic
# pydantic-core
# torch
tzdata==2024.2
# via pandas
urllib3==2.2.3
wcwidth==0.2.13
xxhash==3.5.0
yarl==1.13.1
# via requests
virtualenv==20.27.0
# via pre-commit
18 changes: 0 additions & 18 deletions ruff.toml

This file was deleted.

Loading

0 comments on commit fd18d23

Please sign in to comment.