Skip to content

Commit

Permalink
Minor changes
Browse files Browse the repository at this point in the history
  • Loading branch information
chiffonng committed Oct 10, 2024
1 parent b78532c commit cd700b0
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 5 deletions.
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
# Keyword Mnemonic Generation for English Words

Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is keyword mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often lack diversity and structure in resulting mnemonics.
Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and are limited in their ability to generate diverse and contextually relevant mnemonics.

This project explores an alternative approach by fine-tuning the LLaMA 3 (8B) language model using instruction tuning on a manually curated dataset of over 1,000 examples. Unlike prior methods that primarily focus on syllabic and phonetic mnemonics, this dataset is more representative of mnemonic types, including more etymological mnemonics, which research shows can deepen understanding and retention by linking new vocabulary to their roots and origins. The fine-tuned model will generate diverse, contextually relevant and coherent mnemonics.
This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts).

The fine-tuned model will generate diverse, contextually relevant and coherent mnemonics.

# Project goals

- [ ] Research: Compare performance between tuned and untuned models.
- [ ] Gradio: Create a web interface for the model.

# Setup

Expand Down
5 changes: 2 additions & 3 deletions src/data_pipeline/data_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,14 @@ def load_clean_txt_csv_data(path: Path | str) -> pd.DataFrame:

logger.info(f"Read {df.shape[0]} rows from txt/csv files.")

# Column names
assert df["term"].str.islower().all(), "All terms should be lower case."

# Drop empty mnemonics
df.dropna(subset=["mnemonic"], inplace=True)
logger.info(f"From txt/csv files, kept {df.shape[0]} rows with mnemonics.")

# Remove leading and trailing double quotes from mnemonics
df["mnemonic"] = df["mnemonic"].str.strip('"')

assert df["term"].str.islower().all(), "All terms should be lower case."
return df


Expand Down

0 comments on commit cd700b0

Please sign in to comment.