Classify any text dataset with one config file. Use OpenAI, Anthropic, or a free local model — and fall back to unsupervised clustering when you have no labels at all.
Point classifai at a CSV, define your categories (or skip them for clustering), and get back a labeled dataset plus an HTML report.
python main.py --config config.example.yamlIt supports two modes that can run in the same pass:
| Mode | When to use |
|---|---|
| AI classification — OpenAI, Anthropic, or Ollama (free, local) | You know your categories |
| Clustering — HDBSCAN, K-Means, UMAP | You want to discover patterns |
git clone https://github.com/JuanLara18/classifai.git
cd classifai
pip install -r requirements.txtFor LLM classification, add your API key:
echo "OPENAI_API_KEY=sk-..." > .envTo run completely free and offline, use Ollama instead — no key needed, just set provider: ollama in the config.
# config.yaml
input_file: "data/support_tickets.csv"
output_file: "data/classified.csv"
text_columns: [subject, body]
clustering_perspectives:
# Label tickets by department (LLM)
department:
type: "openai_classification"
columns: [subject, body]
target_categories: [Billing, Technical Support, Account, Other]
output_column: "routed_to"
llm_config:
model: "gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
# Discover unknown patterns (no labels needed)
topics:
type: "clustering"
algorithm: "hdbscan"
columns: [body]
output_column: "topic_cluster"python main.py --config config.yamlOutput: data/classified.csv with two new columns (routed_to, topic_cluster) and an HTML report in output/.
- Guaranteed valid labels — uses instructor + Pydantic, so the LLM always returns one of your categories. No regex, no parsing errors.
- Unique-value optimization — classifies each distinct text only once, then maps results back. Reduces API calls by up to 90% on real datasets.
- Multi-provider — OpenAI, Anthropic, or Ollama (local, free). Same config, different
provider:line. - Dual mode — run AI classification and clustering in the same job and compare results.
- Cost control — set
max_cost_per_runto hard-stop before overspending. - Resumable — checkpoints let you continue interrupted runs without re-classifying.
| quickstart.ipynb — classify, cluster, visualize in one notebook |
gpt-4o-mini with unique-value optimization:
| Rows | Unique texts | Estimated cost |
|---|---|---|
| 10,000 | ~2,000 | ~$0.05 |
| 100,000 | ~15,000 | ~$0.40 |
| 1,000,000 | ~80,000 | ~$2.00 |
Set provider: ollama to pay nothing.
CSV, Stata (.dta), Excel (.xlsx).
classifai was built during a research project at Harvard Business School to classify BMW manufacturing maintenance records — thousands of work order descriptions in English and German, across taxonomies from 2 to 20 categories. The underlying data is confidential, but the methodology is exactly what's in this repo.
See CONTRIBUTING.md. Bug reports and new backends are especially welcome.
MIT — see LICENSE.