LLM Data Cleaner

LLM Data Cleaner automates the transformation of messy text columns into well structured data using OpenAI models. It eliminates the need for complex regular expressions or manual parsing while ensuring the output conforms to a schema.

Why use it?

Less manual work – delegate repetitive cleaning tasks to a language model.
Consistent results – validate responses with Pydantic models.
Batch processing – send rows in chunks to respect API rate limits.

Installation

Requires Python 3.9+.

pip install git+https://github.com/codedthinking/llm_data_cleaner.git

Or with Poetry:

poetry add git+https://github.com/codedthinking/llm_data_cleaner.git

Step by step

Create Pydantic models describing the cleaned values.
Define a dictionary of instructions mapping column names to a prompt and schema.
Instantiate DataCleaner with your OpenAI API key.
Load your raw CSV file with pandas.
Call clean_dataframe(df, instructions).
Inspect the returned DataFrame which contains new cleaned_* columns.
Save or further process the cleaned data.

Example: inline models

import pandas as pd
from pydantic import BaseModel
from llm_data_cleaner import DataCleaner

class AddressItem(BaseModel):
    index: int
    city: str | None
    country: str | None
    postal_code: str | None

class TitleItem(BaseModel):
    index: int
    profession: str | None

instructions = {
    "address": {
        "prompt": "Extract city, country and postal code if present.",
        "schema": AddressItem,
    },
    "profession": {
        "prompt": "Normalize the profession to a standard job title.",
        "schema": TitleItem,
    },
}

cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY")
raw_df = pd.DataFrame({
    "address": ["Budapest Váci út 1", "1200 Vienna Mariahilfer Straße 10"],
    "profession": ["dev", "data eng"]
})
cleaned = cleaner.clean_dataframe(raw_df, instructions)
print(cleaned)

Example: loading YAML instructions

from llm_data_cleaner import DataCleaner, load_yaml_instructions
import pandas as pd

instructions = load_yaml_instructions("instructions.yaml")
cleaner = DataCleaner(api_key="YOUR_OPENAI_API_KEY", system_prompt="{column_prompt}")
raw_df = pd.read_csv("data.csv")
result = cleaner.clean_dataframe(raw_df, instructions)

load_yaml_instructions reads the same structure shown above from a YAML file so cleaning rules can be shared without modifying code.

Authors

Miklós Koren
Gergely Attila Kiss

Preferred citation

If you use LLM Data Cleaner in your research, please cite the project as specified in CITATION.cff.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
llm_data_cleaner		llm_data_cleaner
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
cleaned_data.csv		cleaned_data.csv
example.py		example.py
instructions.yaml		instructions.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Data Cleaner

Why use it?

Installation

Step by step

Example: inline models

Example: loading YAML instructions

Authors

Preferred citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

codedthinking/llm_data_cleaner

Folders and files

Latest commit

History

Repository files navigation

LLM Data Cleaner

Why use it?

Installation

Step by step

Example: inline models

Example: loading YAML instructions

Authors

Preferred citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages