Token Merging for Training (TM4T)

A Python tool designed to optimize and shorten Danbooru tags in image captions to save tokens during AI model training. This tool intelligently merges, filters, and consolidates tags while preserving semantic meaning, helping reduce token usage and improve training efficiency.

Features

Smart Tag Merging - Combines related tags (e.g., "short hair, black hair" → "short black hair")
Hierarchical Filtering - Removes redundant tags when more specific versions exist (breasts → large breasts)
Blacklist Support - Filters out unwanted or irrelevant tags such as "commentary request"
Animal-Specific Processing - Handles animal character tags with specialized logic (animal ears → dog ears)
Color Optimization - Intelligently handles multicolored attributes (remove black,white,etc hair if multicolored hair)
Batch Processing - Process multiple text files at once
Token Estimation - Reports approximate token savings
YAML Configuration - Easily customizable rules and settings

Prerequisites

pyyaml

Installation

Clone the repository:

git clone https://github.com/seedmanc/token-merging-4-training.git
cd token-merging-4-training

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Basic Usage

Process all text files in a directory:

python main.py "path/to/your/txt/captions"

Also see the help:


positional arguments:
  captions_path         Required.

optional arguments:
  -h, --help            show this help message and exit
  --dry                 Don't change the files
  --author AUTHOR       Replace author tag with --class-tokens + " style"
  --class-tokens CLASS_TOKENS
                        Replace --author tag with this. Defaults to --author w/o spaces or (...). Use --class-tokens= to remove author entirely.
  --brief               Reduce console spam
  --verbose

Edit YAML dictionaries as you see fit. The replace.yaml works as follows: the key will be replaced by one of the values under it but only if the values are found in the tags. So "adjusting eyewear, glasses" becomes "adjusting glasses, glasses" (with the duplicate removed further in processing). Expand animals and colors dicts to ensure special processing of those categories (mainly to avoid "animal dog ears" entries, get rid of animal features if animal girl is already mentioned and remove colors if multicolored is present in the tags).

Example console output:

python main.py C:\Users\USERNAME\Downloads\hukuro --author="poporu (hukuroneko)" --class-tokens=hukuro
FILE: __yak_kemono_friends_and_1_more_drawn_by_poporu_hukuroneko__da95e66e2af395a6c9c35f2eb732626f.txt
- commentary request
poporu (hukuroneko)  =>  hukuro style  b/c  --class-tokens
bow  =>  bowtie
brown bow  =>  brown bowtie
- ribbon  b/c  brown ribbon
- shirt  b/c  yellow shirt
- bowtie  b/c  brown bowtie
- horns  b/c  black horns
- breasts  b/c  large breasts
- animal ears  b/c  cow ears
- cow ears,cow horns  b/c  cow girl
- black horns,grey horns  b/c  multicolored
 - white hair
 - long hair
+ long white hair
- kemono friends 3,kemono friends  b/c  yak (kemono friends)
Saved ~43 tokens or 41%
['1girl', 'blush', 'brown bowtie', 'brown eyes', 'brown ribbon', 'cow girl', 'dress', 'extra ears', 'gloves', 'hair over one eye', 'highres', 'hukuro style', 'large breasts', 'long white hair', 'multicolored horns', 'short sleeves', 'smile', 'solo', 'twintails', 'yak (kemono friends)', 'yellow shirt']

Processing Pipeline

The tool applies transformations in this order:

Filtering - Remove blacklisted tags and clip series information
Hierarchy - Remove redundant generic tags when specific ones exist
Animal and color processing - Handle specific tag logic
Merging - Combine tags with the same noun but different adjectives
Artist conversion - turn author tags into ready to use "style" class tokens for style-lora training or add if none present.

Configuration

The tool uses YAML configuration files in the config/ directory:

`config/blacklist.yaml`

- virtual youtuber
- looking at viewer
- multiple girls
- commentary request
# ... more blacklisted tags

These are removed unconditionally.

`config/colors.yaml`

- red
- blue
- green 
# ... more colors

Removes colored tags if multicolored/two-tone tag of the same subject is present.

`config/animals.yaml`

- cat
- dog
- wolf
- tiger
# ... more animals

Removes literal "animal part" if specific animal parts are present. Removes specific animal parts if that animal girl is present.

`config/replace.yaml`

eyewear:
   - glasses
   - goggles
   - sunglasses
   - monocle
# ... more generic tags paired with a list of concrete ones

Replaces generic tag with one of the concrete ones if the concrete is also present. If instead of the list there is a string like "one eye closed: wink" then replaces unconditionally.

Contact/Support

Issues: GitHub Issues

For questions about usage or contributions, please open an issue on GitHub.

Acknowledgments

Inspired by the need to optimize token usage in AI lora training
Built for the *booru tagging community

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
config		config
tests		tests
transforms		transforms
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
__init__.py		__init__.py
args.py		args.py
main.py		main.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Token Merging for Training (TM4T)

Features

Prerequisites

Installation

Usage

Basic Usage

Also see the help:

Example console output:

Processing Pipeline

Configuration

`config/blacklist.yaml`

`config/colors.yaml`

`config/animals.yaml`

`config/replace.yaml`

Contact/Support

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Seedmanc/token-merging-4-training

Folders and files

Latest commit

History

Repository files navigation

Token Merging for Training (TM4T)

Features

Prerequisites

Installation

Usage

Basic Usage

Also see the help:

Example console output:

Processing Pipeline

Configuration

config/blacklist.yaml

config/colors.yaml

config/animals.yaml

config/replace.yaml

Contact/Support

Acknowledgments

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

`config/blacklist.yaml`

`config/colors.yaml`

`config/animals.yaml`

`config/replace.yaml`

Packages