Minor changes

chiffonng · Oct 10, 2024 · cd700b0 · cd700b0
1 parent b78532c
commit cd700b0
Show file tree

Hide file tree

Showing 2 changed files with 11 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -1,8 +1,15 @@
 # Keyword Mnemonic Generation for English Words
 
-Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is keyword mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often lack diversity and structure in resulting mnemonics.
+Vocabulary acquisition poses a significant challenge for language learners, particularly at medium and advanced levels, where the complexity and volume of new words can hinder retention. One promising solution is mnemonics, which leverage associations between new vocabulary and memorable cues to enhance recall. Previous efforts to automate generating these mnemonics often focus primarily on _shallow-encoding mnemonics_ (spelling or phonological features of a word) and are limited in their ability to generate diverse and contextually relevant mnemonics.
 
-This project explores an alternative approach by fine-tuning the LLaMA 3 (8B) language model using instruction tuning on a manually curated dataset of over 1,000 examples. Unlike prior methods that primarily focus on syllabic and phonetic mnemonics, this dataset is more representative of mnemonic types, including more etymological mnemonics, which research shows can deepen understanding and retention by linking new vocabulary to their roots and origins. The fine-tuned model will generate diverse, contextually relevant and coherent mnemonics.
+This project explores an alternative approach by instruction tuning the LLaMA 3 (8B) language model on a manually curated dataset of over 1,000 examples. Unlike prior methods, this dataset includes more _deep-encoding mnemonics_ (such as morphology and etymology, associations with synonyms, antonyms, or related words and concepts).
+
+The fine-tuned model will generate diverse, contextually relevant and coherent mnemonics.
+
+# Project goals
+
+- [ ] Research: Compare performance between tuned and untuned models.
+- [ ] Gradio: Create a web interface for the model.
 
 # Setup
 

diff --git a/src/data_pipeline/data_processing.py b/src/data_pipeline/data_processing.py
@@ -74,15 +74,14 @@ def load_clean_txt_csv_data(path: Path | str) -> pd.DataFrame:
 
     logger.info(f"Read {df.shape[0]} rows from txt/csv files.")
 
-    # Column names
-    assert df["term"].str.islower().all(), "All terms should be lower case."
-
     # Drop empty mnemonics
     df.dropna(subset=["mnemonic"], inplace=True)
     logger.info(f"From txt/csv files, kept {df.shape[0]} rows with mnemonics.")
 
     # Remove leading and trailing double quotes from mnemonics
     df["mnemonic"] = df["mnemonic"].str.strip('"')
+
+    assert df["term"].str.islower().all(), "All terms should be lower case."
     return df