-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
#3 Classify mnemonics with OpenAI API
* Add mnemonic examples to README * Refactor .gitignore and data_processing.py - Refactor .gitignore to exclude /temp directory and ignore all .parquet and .csv files. - Format mnemonics more consistently - Drop mnemonics with only 2 words or less. * Separate prompts storage to YAML files * Refactor prompt config and mnemonic processing module * Switch to uv for Python package and project management * Fix imports between Python modules inside src * Relax .python-version and add more instructions for installation and API keys * Add pre-commit and its configurations * Refactor file paths and add error handling module * Improve mnemonics classification prompts and instructions * Sync dependencies in requirements.txt with virtual env * Extend error handling and extend ruff safe-fixes in pyproject.toml * Improve standardization of classification results. - Add JSON schema as response format for OpenAI API - Handle errors when OpenAI response is too long/short - Refactor function classify_mnemonics_api - Improve logging * Fix lint workflow
- Loading branch information
Showing
19 changed files
with
2,511 additions
and
124 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
OPENAI_API_KEY=sk-proj-something | ||
HUGGINGFACE_ACCESS_TOKEN=hf_B-something |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,2 @@ | ||
# Ignore Jupyter Notebooks from Github Linguist Stats | ||
*.ipynb linguist-vendored | ||
*.ipynb linguist-vendored |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -164,9 +164,10 @@ cython_debug/ | |
|
||
# Data | ||
/data | ||
.parquet | ||
.csv | ||
/temp | ||
*.parquet | ||
*.csv | ||
|
||
# Write up | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
repos: | ||
- repo: https://github.com/astral-sh/ruff-pre-commit | ||
rev: v0.7.1 | ||
hooks: | ||
- id: ruff | ||
name: Ruff Linter | ||
args: [--fix] | ||
- id: ruff-format | ||
name: Ruff Formatter | ||
- repo: https://github.com/pre-commit/pre-commit-hooks | ||
rev: v5.0.0 | ||
hooks: | ||
- id: check-yaml | ||
- id: end-of-file-fixer | ||
- id: trailing-whitespace | ||
- id: check-added-large-files |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
3.12 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,45 @@ | ||
# Keyword Mnemonic Generation for English Words | ||
# Mnemonic Generation for English Words | ||
|
||
Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and are limited in their ability to generate diverse and contextually relevant mnemonics. | ||
Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and have lower likelihood of improving retention, compared to _deep-enconding information_ | ||
|
||
This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts). | ||
This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (semantic information such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts). By fine-tuning the model on this diverse dataset, we aim to improve the quality and variety of mnemonics generated by the model, and improve the retention of new vocabulary for language learners. | ||
|
||
The fine-tuned model will generate diverse, contextually relevant and coherent mnemonics. | ||
| **Shallow-Encoding Mnemonics** | **Deep-Encoding Mnemonics** | | ||
| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | | ||
| **Homophonic:** olfactory sounds like "old factory." | **Etymology**: preposterous - pre (before) + post (after) + erous, which implies absurd. | | ||
| **Chunking:** obsequious sounds like "ob-se-ki-ass. Obedient servants kiss your ass | **Morphology:** Suffixes "ate" are usually verbs. Prefix "ab" means from, away. | | ||
| **Keyword:** Phony sounds like “phone-y,” which means fraudulent (phone calls). | **Context/Story:** His actions of pouring acid on the environment are detrimental | | ||
| **Rhyming:** wistful/longing for the past but wishful for the future. | **Synonym/Antonym** "benevolent" ~ "kind" or "generous," and "malevolent" is its antonym. | | ||
| | **Image Association:** exuberant - The ex-Uber driver never ranted; he always seems ebullient and lively. | | ||
| | **Related words**: Phantasm relates to an illusion or ghostly figure, closely tied to the synonym “phantom.” | | ||
|
||
# Project goals | ||
--- | ||
|
||
- [ ] Research: Compare performance between tuned and untuned models. | ||
- [ ] Gradio: Create a web interface for the model. | ||
## Project components | ||
|
||
# Setup | ||
- [ ] A web interface (using Gradio) for the tuned model. | ||
- [ ] A dataset of 1200 examples (will be refined continually). | ||
- [ ] This documented codebase. | ||
|
||
Python >= 3.10 and `requirements.txt`. | ||
## Setup | ||
|
||
### Installation | ||
|
||
```bash | ||
bash setup.sh | ||
``` | ||
|
||
It attempts to install with `[uv](https://docs.astral.sh/uv/)` (a fast, Rust-based Python package and project manager) using `.python-version` file and `pyproject.toml` file. Otherwise, it falls back to `pip` installation. | ||
|
||
### Secrets | ||
|
||
`setup.sh` already creates a `.env`. You will need: | ||
|
||
- OpenAI API key (optional: for some modules inside `src/data_pipeline`) | ||
|
||
## Development | ||
|
||
```bash | ||
pre-commit install | ||
pre-commit run --all-files | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
prompts: | ||
system: | | ||
You are an expert in English mnemonics. Your task is to classify each mnemonic as one of the following: shallow-encoding (0), deep-encoding (1), mixed (2), or unsure (-1). Think through the reasoning for classification yourself, and respond consistently with the response format. You have to classify every mnemonic in the prompt, no more no less. If unsure, return -1. \n | ||
Classify the mnemonics below based on the following criteria:\n | ||
- Shallow (0): Focus on how the word sounds, looks, or rhymes. | ||
- Deep (1): Focus on semantics, morphology, etymology, context (inferred meaning, imagery), related words (synonyms, antonyms, words with same roots). Repeating the word or using a similar-sounding word is NOT deep-encoding. | ||
- Mixed (2): Contains both shallow and deep features.\n | ||
Examples: | ||
- olfactory: Sounds like "old factory." The old factory had a strong smell, reminding workers of its olfactory history. Classification: shallow (0), since it's based on the sound. | ||
- vacuous: Same Latin root "vacare" (empty) as "vacuum, vacant". His expression was as empty as a vacuum, showing no signs of thought. Classification: deep (1), since it only uses etymology and related words. | ||
- malevolent: From male 'ill' + volent 'wishing' (as in "benevolent"). These male species are so violent that they always have evil plans. Classification: mixed (2) since it uses etymology and antonyms (deep-encoding), and the sounds of "male" and "violent" (shallow-encoding)\n | ||
user: Mnemonics are seperated by a newline character. Please classify each mnemonic in the same order as they appear in the prompt.\n | ||
model: "gpt-4o-mini" | ||
temperature: 0.2 | ||
num_outputs: 1 | ||
batch_size: 50 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
[project] | ||
name = "mnemonic-gen" | ||
version = "0.1.0" | ||
description = "Generate mnemonic sentences for English words" | ||
readme = "README.md" | ||
requires-python = ">=3.10" | ||
dependencies = [ | ||
"openai>=1.52.2", | ||
"pandas>=2.2.3", | ||
"pre-commit>=4.0.1", | ||
"pyarrow>=17.0.0", | ||
"python-dotenv>=1.0.1", | ||
"pyyaml>=6.0.2", | ||
"ruff", | ||
"tenacity>=9.0.0", | ||
"torch", | ||
"tqdm>=4.66.5", | ||
"transformers>=4.46.0", | ||
] | ||
|
||
[project.urls] | ||
repository = "https://github.com/chiffonng/capstone" | ||
|
||
[project.optional-dependencies] | ||
plot = ["matplotlib>=3.9.2"] | ||
|
||
[dependency-groups] | ||
dev = ["pytest>=8.3.3"] | ||
lint = ["ruff>=0.7.1"] | ||
data = ["pandas>=2.2.3", "pyarrow>=17.0.0"] | ||
|
||
[tool.ruff] | ||
src = ["src"] | ||
target-version = "py312" | ||
extend-exclude = ["*__init__.py", "*.pyi"] | ||
|
||
[tool.ruff.format] | ||
docstring-code-format = true # format code in docstrings | ||
docstring-code-line-length = 88 | ||
|
||
[tool.ruff.lint] | ||
extend-select = [ | ||
"D", # pydocstyle, all functions and classes must have docstrings | ||
"T", # mypy, type hints | ||
] | ||
extend-safe-fixes = ["D", "T"] # docstring, type hints | ||
extend-fixable = ["B"] # bugbear | ||
ignore = ["T201", "F401"] # print statements OK, unused imports OK | ||
pydocstyle.convention = "google" | ||
pycodestyle.max-doc-length = 88 | ||
|
||
[tool.uv] | ||
python-downloads = "manual" # change to "automatic" to download Python specified in .python-version | ||
upgrade-package = ["ruff", "tqdm"] | ||
managed = true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,69 +1,150 @@ | ||
accelerate==0.34.2 | ||
aiohappyeyeballs==2.4.2 | ||
aiohttp==3.10.8 | ||
aiosignal==1.3.1 | ||
appnope==0.1.4 | ||
asttokens==2.4.1 | ||
attrs==24.2.0 | ||
# This file was autogenerated by uv via the following command: | ||
# uv pip freeze | uv pip compile - -o requirements.txt | ||
-e . | ||
annotated-types==0.7.0 | ||
# via pydantic | ||
anyio==4.6.2.post1 | ||
# via | ||
# httpx | ||
# openai | ||
certifi==2024.8.30 | ||
charset-normalizer==3.3.2 | ||
comm==0.2.2 | ||
datasets==3.0.1 | ||
debugpy==1.8.6 | ||
decorator==5.1.1 | ||
dill==0.3.8 | ||
executing==2.1.0 | ||
# via | ||
# httpcore | ||
# httpx | ||
# requests | ||
cfgv==3.4.0 | ||
# via pre-commit | ||
charset-normalizer==3.4.0 | ||
# via requests | ||
distlib==0.3.9 | ||
# via virtualenv | ||
distro==1.9.0 | ||
# via openai | ||
filelock==3.16.1 | ||
frozenlist==1.4.1 | ||
fsspec==2024.6.1 | ||
huggingface-hub==0.25.1 | ||
# via | ||
# huggingface-hub | ||
# torch | ||
# transformers | ||
# virtualenv | ||
fsspec==2024.10.0 | ||
# via | ||
# huggingface-hub | ||
# torch | ||
h11==0.14.0 | ||
# via httpcore | ||
httpcore==1.0.6 | ||
# via httpx | ||
httpx==0.27.2 | ||
# via openai | ||
huggingface-hub==0.26.1 | ||
# via | ||
# tokenizers | ||
# transformers | ||
identify==2.6.1 | ||
# via pre-commit | ||
idna==3.10 | ||
ipykernel==6.29.5 | ||
ipython==8.27.0 | ||
jedi==0.19.1 | ||
Jinja2==3.1.4 | ||
jupyter_client==8.6.3 | ||
jupyter_core==5.7.2 | ||
MarkupSafe==2.1.5 | ||
matplotlib-inline==0.1.7 | ||
# via | ||
# anyio | ||
# httpx | ||
# requests | ||
iniconfig==2.0.0 | ||
# via pytest | ||
jinja2==3.1.4 | ||
# via torch | ||
jiter==0.6.1 | ||
# via openai | ||
markupsafe==3.0.2 | ||
# via jinja2 | ||
mpmath==1.3.0 | ||
multidict==6.1.0 | ||
multiprocess==0.70.16 | ||
nest-asyncio==1.6.0 | ||
networkx==3.3 | ||
numpy==2.1.1 | ||
# via sympy | ||
networkx==3.4.2 | ||
# via torch | ||
nodeenv==1.9.1 | ||
# via pre-commit | ||
numpy==2.1.2 | ||
# via | ||
# pandas | ||
# pyarrow | ||
# transformers | ||
openai==1.52.2 | ||
# via mnemonic-gen | ||
packaging==24.1 | ||
# via | ||
# huggingface-hub | ||
# pytest | ||
# transformers | ||
pandas==2.2.3 | ||
parso==0.8.4 | ||
pexpect==4.9.0 | ||
# via mnemonic-gen | ||
platformdirs==4.3.6 | ||
prompt_toolkit==3.0.48 | ||
psutil==6.0.0 | ||
ptyprocess==0.7.0 | ||
pure_eval==0.2.3 | ||
# via virtualenv | ||
pluggy==1.5.0 | ||
# via pytest | ||
pre-commit==4.0.1 | ||
# via mnemonic-gen | ||
pyarrow==17.0.0 | ||
Pygments==2.18.0 | ||
# via mnemonic-gen | ||
pydantic==2.9.2 | ||
# via openai | ||
pydantic-core==2.23.4 | ||
# via pydantic | ||
pytest==8.3.3 | ||
python-dateutil==2.9.0.post0 | ||
# via pandas | ||
python-dotenv==1.0.1 | ||
# via mnemonic-gen | ||
pytz==2024.2 | ||
PyYAML==6.0.2 | ||
pyzmq==26.2.0 | ||
# via pandas | ||
pyyaml==6.0.2 | ||
# via | ||
# huggingface-hub | ||
# mnemonic-gen | ||
# pre-commit | ||
# transformers | ||
regex==2024.9.11 | ||
# via transformers | ||
requests==2.32.3 | ||
ruff==0.6.8 | ||
# via | ||
# huggingface-hub | ||
# transformers | ||
ruff==0.7.1 | ||
# via mnemonic-gen | ||
safetensors==0.4.5 | ||
setuptools==75.1.0 | ||
# via transformers | ||
setuptools==75.2.0 | ||
# via torch | ||
six==1.16.0 | ||
stack-data==0.6.3 | ||
sympy==1.13.3 | ||
tokenizers==0.20.0 | ||
torch==2.4.1 | ||
tornado==6.4.1 | ||
# via python-dateutil | ||
sniffio==1.3.1 | ||
# via | ||
# anyio | ||
# httpx | ||
# openai | ||
sympy==1.13.1 | ||
# via torch | ||
tenacity==9.0.0 | ||
# via mnemonic-gen | ||
tokenizers==0.20.1 | ||
# via transformers | ||
torch==2.5.0 | ||
# via mnemonic-gen | ||
tqdm==4.66.5 | ||
traitlets==5.14.3 | ||
transformers==4.45.1 | ||
typing_extensions==4.12.2 | ||
# via | ||
# huggingface-hub | ||
# mnemonic-gen | ||
# openai | ||
# transformers | ||
transformers==4.46.0 | ||
# via mnemonic-gen | ||
typing-extensions==4.12.2 | ||
# via | ||
# huggingface-hub | ||
# openai | ||
# pydantic | ||
# pydantic-core | ||
# torch | ||
tzdata==2024.2 | ||
# via pandas | ||
urllib3==2.2.3 | ||
wcwidth==0.2.13 | ||
xxhash==3.5.0 | ||
yarl==1.13.1 | ||
# via requests | ||
virtualenv==20.27.0 | ||
# via pre-commit |
Oops, something went wrong.