stylevec

Note 1: This README reflects the canonical state of the project. READMEs on other branches are not maintained and should be disregarded — refer to this one regardless of which branch you are on. Note 2: This is the development repository. The files here are unorganized and not currently in a usable state. Please monitor the research branches if you wish to follow the latest progress (main branch is out-of-date).

Research project exploring style (not content) embeddings for short texts — can a speaker's stylistic fingerprint be extracted as a separable vector, independent of topic?

Disclaimer: Datasets used in this repository may be subject to copyrights held by other individuals or organizations. This project is intended solely for academic research purposes.

Setup

uv sync
uv run python download_base_models.py          # downloads Qwen3-0.6B into artifacts/base-models
uv run python download_base_models.py --modelscope  # use ModelScope mirror (China)

Build the Rust similarity utility when needed:

pip install maturin
cd tools/simlar && maturin develop

Repository Structure

Directory	Description
`genshin/`	Data import, cleaning, paraphrase generation, content masking
`naive/`	Baseline approaches: residual vectors, prompt-residual, LDA, MLP+ArcFace
`hidden/`	Causal LM hidden layer probe experiments (Qwen3-0.6B)
`lora/`	LoRA fine-tuning approach (active research)
`paper_replication/`	StyleDistance (Patel et al., 2024) replication — roberta-base + LoRA + Triplet Loss
`shared/`	Shared training infrastructure: config, data loader, classifiers, evaluation
`tools/simlar/`	Rust/PyO3 utility for batched char n-gram Jaccard + normalized Levenshtein similarity
`docs/verifier/`	Algorithm notes and verification
`data/`	Local datasets and corpora
`artifacts/`	Pre-computed embeddings, checkpoints, plots, logs, and local model weights

Data Pipeline

Genshin Parallel Corpus

283,972 raw dialogue records from 4,332 speakers, cleaned to ~258,868 valid entries. Each utterance is rewritten by an LLM (via OpenRouter API) into semantically equivalent, stylistically neutral standard Chinese, producing 262,196 pairs. The pipeline supports checkpoint resumption, concurrent requests, exponential-backoff rate limiting, and automatic quality validation (speaker consistency, line count parity, n-gram + Levenshtein similarity scoring).

Content Masking System

To isolate style from content independently of the paraphrase approach, 5,021 Chinese words from the Genshin game lexicon (6,050 terms, 82.9% coverage) are mapped to 12 semantic category masks (person, location, quest, enemy, item, food, loot, weapon, artifact, animal, domain, group). This substitutes domain-specific content words while preserving syntactic structure and stylistic markers.

Multilingual Text Pipeline (data/datasets/msynthstel/pipeline/datadelta)

A 75-language text quality pipeline used by the msynthstel data program: document collection (CulturaY/SkyPile, 20k docs for major languages, 1k for others) → sentence segmentation (Intl.Segmenter) → length filtering (15–512 tokens) → heuristic filtering (terminal punctuation, special-char ratio, repeated-word ratio, URL residuals) → MinHash near-dedup (5-gram character shingles, Jaccard threshold 0.75) → quality scoring (Qwen3-0.6B perplexity / LiteLLM API, pending). Currently ~2.3M deduplicated sentences across the top 6 languages (zh/en/ru/ja/fr/de); quality scoring not yet run.

Literary Book Collection

Several Chinese and English literary works collected from Anna's Archive and Z-Library. Raw epub/mobi formats acquired; text extraction and character-dialogue annotation pending to done.

Copyright status varies depending on the author and your jurisdiction.

AO3 Fanfiction Corpus

Archive of Our Own Chinese-language fanfiction crawled by kudos ranking across 34 fandoms (Genshin Impact, Harry Potter, Marvel, Jujutsu Kaisen, etc.), ~1,000 works per fandom. An additional HuggingFace AO3 random subset (64,000 works with rich metadata) provides a baseline corpus for general style/metadata experiments.

SynthSTEL Multilingual Translation

The SynthSTEL benchmark (40 style features × 100 pairs, 3,600 train + 400 test) translated from English into Chinese, Japanese, French, and Russian via LiteLLM. Style features were adapted per language: features meaningless in the target language (e.g., capitalization variants for CJK) were skipped; language-specific features (Chinese homophone-digit substitution, Japanese contracted forms, Russian colloquial reductions) received localized rewrite instructions.

Method Exploration

Phase 1: Static Embedding Extraction (Dec 2025 – Jan 2026)

Attempted to extract style vectors from frozen embedding models (Qwen3-embedding-0.6b/8b, embeddinggemma).

Residual method: style = embed(original) − embed(paraphrase)

Model	Mode	Avg Consistency	Silhouette
qwen3-embedding:0.6b	raw residual	0.0183	−0.0085
qwen3-embedding:0.6b	PCA projected	0.0043	−0.0055
embeddinggemma:latest	raw residual	0.0089	−0.0085
embeddinggemma:latest	PCA projected	0.0024	−0.0067

Instruction-aware prompting: Tested 7 prompt variants by prepending instructions to input text (the Ollama/OpenRouter prompt field is silently ignored; instructions must be concatenated into the input string).

Prompt	Model	Avg Consistency	Silhouette
baseline	0.6b	0.0171	−0.0087
style_v2 ("analyze linguistic style")	0.6b	0.0341	−0.0147
style_v2	8b	0.7675	−0.0372
style_v4 ("identify the speaker")	8b	0.6946	−0.0427

8b + style_v2 achieved 0.77 intra-speaker consistency, but silhouette remained negative — different speakers' style directions are highly overlapping. The residual method destroyed signal (0.77 → 0.08).

LDA supervised dimensionality reduction:

Scale	Dim	Train Sil	Test Sil
6→5 speakers	—	0.35	−0.03
80→20 speakers	64	0.019	−0.015
80→20 speakers	32	0.003	−0.032
80→20 speakers	16	−0.075	−0.068

LDA learns speaker-specific discriminants, not a generalizable style space.

MLP + ArcFace (141→36 speakers, stratified sampling):

Arch	Dim	Train Sil	Val Sil	Test Sil	Train Cons
linear	32	−0.198	−0.277	−0.146	0.981
linear	128	−0.075	−0.160	−0.071	0.984
1h-512	128	−0.110	−0.246	−0.131	0.995
2h-1k	128	−0.314	−0.593	−0.685	0.976
1h-2k	64	−0.225	−0.332	−0.190	0.994

Universal mode collapse: consistency ~0.98–0.99 across train/val/test, silhouette uniformly negative. ArcFace's margin is too aggressive at this data scale (141-class loss floor ~0, actual minimum ~8).

Phase 1 conclusion: Frozen embedding models have been thoroughly stripped of style information during contrastive pretraining. No post-hoc projection, dimensionality reduction, or supervised classification can recover a separable style signal. The model must be fine-tuned to actively learn style separability.

Phase 2: Causal LM Hidden Layer Probes (Jan – Feb 2026)

Tested whether Qwen3-0.6B CausalLM intermediate hidden states (not contrastive-trained) retain extractable style information.

Setup: 29 layers (embedding + 28 transformer) × 4 pooling strategies (last token, attention-weighted, reverse-attention complement, reverse-attention inverse) × 5 classifiers (LDA + 4 MLP+ArcFace variants) = 580 independent evaluations. 8 training speakers, 4 held-out speakers, 1,200 sentences.

Key findings:

All 580 val_sil scores are negative. Best result: reverse-attention inverse pooling at layer 1 + MLP-linear-d64, val_sil = −0.0705.
Extreme overfitting: MLP+ArcFace reaches train_sil +0.87 while val_sil crashes to −0.10. The model memorizes sentence identity, not speaker style.
Attention pooling collapse: Forward attention-weighted pooling locks consistency at 1.0000 from layer 3 onward — all vectors collapse to a single direction.
Layer differences are noise: Layers 2–28 val_sil fluctuate within a 0.011 band with no systematic trend (shallow vs. deep vs. middle).

Theoretical interpretation: Style information exists in causal LMs (they perform stylistic continuation and style transfer), but it is encoded in generation dynamics — each token step draws on style cues through the full 28-layer computation graph to modulate the conditional distribution. Style is a function, not a point. Static vector extraction from individual layers is a paradigm mismatch.

Phase 2 conclusion: Single-layer hidden state snapshots cannot capture style. Style is encoded in the computation, not in the representation.

Phase 3: LoRA Fine-tuning (Feb 2026 – present, active)

Forces style separability into the representation space by fine-tuning Qwen3-0.6B with LoRA adapters and ArcFace supervision.

Architecture (lora/model.py):

Frozen Qwen3-0.6B encoder + LoRA adapters (rank=8, alpha=16, targeting q_proj/k_proj/v_proj/o_proj)
Optional LayerFusion: learned weighted combination of selected intermediate hidden states
Optional AttentionPooling: learned query vector replaces mean pooling
style_head (linear projection, no bias) + L2 normalization → 128-dim style vector
ArcFaceHead (s=30, m=0.3) for speaker classification
PKSampler: P speakers × K utterances per batch

5-epoch results (batch=32, rank=8, alpha=16):

Epoch	Train Loss	Train Sil	Val Sil	Train Acc	Val Acc
1	14.79	−0.266	−0.225	0.004	0.004
2	12.61	−0.192	−0.185	0.082	0.061
3	10.34	−0.103	−0.137	0.177	0.118
4	8.61	−0.026	−0.088	0.272	0.154
5	7.11	+0.037	−0.053	0.370	0.188

Training silhouette turned positive for the first time across all experiments (previous phases: near-zero or train/val gap of 0.9+). Loss is decreasing steadily, accuracy climbing. Validation silhouette remains negative (−0.053), indicating generalization to unseen speakers is the current bottleneck. Post-training analysis of 354 speakers shows median separability of 0.027 with only 6 speakers above 0.1 — training is far from converged at 5 epochs.

Phase 3 status: The problem has shifted from "completely inseparable" to "learnable but not yet generalizing." Next steps: more epochs, higher rank, hyperparameter grid search (alpha, margin, lr), synthetic data augmentation, and multilingual joint training.

StyleDistance Replication (May 2026)

Reproduced StyleDistance (Patel et al., 2024) to establish a quantitative baseline:

Model: roberta-base + LoRA rank=8 (1.34M / 126M trainable params, 1.06%)
Data: SynthSTEL (40 features × 100 pairs each)
Training: batch=512, triplet margin=0.1, lr=1e-4, cosine schedule, remote AutoDL GPU
Result: Training loss 0.017 → 0.0004 over 10 epochs; validation loss converged at epoch 2 (0.057), consistent with the paper's early-stopping design (patience=1). Pending STEL/STEL-or-Content evaluation on the remote instance.

Engineering Infrastructure

Shared training infrastructure in shared/: unified Config system (Device/Model/Data/Train/Eval), standardized DataLoader (raw/cached/full/core modes), PKSampler for balanced mini-batches, ArcFaceHead/LDA/MLP classifier interfaces, and silhouette + consistency evaluation functions. LoRA training pipeline supports bf16 mixed precision, torch.compile acceleration, fused AdamW, gradient clipping, checkpoint persistence, and TensorBoard logging. tools/simlar/ provides a Rust/PyO3 library for high-performance batched character n-gram Jaccard and normalized Levenshtein similarity. Cached embeddings live under artifacts/cache/.

License

GPL-3.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stylevec

Setup

Repository Structure

Data Pipeline

Genshin Parallel Corpus

Content Masking System

Multilingual Text Pipeline (data/datasets/msynthstel/pipeline/datadelta)

Literary Book Collection

AO3 Fanfiction Corpus

SynthSTEL Multilingual Translation

Method Exploration

Phase 1: Static Embedding Extraction (Dec 2025 – Jan 2026)

Phase 2: Causal LM Hidden Layer Probes (Jan – Feb 2026)

Phase 3: LoRA Fine-tuning (Feb 2026 – present, active)

StyleDistance Replication (May 2026)

Engineering Infrastructure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data/datasets/msynthstel		data/datasets/msynthstel
docs		docs
genshin		genshin
hidden		hidden
lora		lora
naive		naive
paper_replication		paper_replication
shared		shared
tools/simlar		tools/simlar
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
download_base_models.py		download_base_models.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

stylevec

Setup

Repository Structure

Data Pipeline

Genshin Parallel Corpus

Content Masking System

Multilingual Text Pipeline (data/datasets/msynthstel/pipeline/datadelta)

Literary Book Collection

AO3 Fanfiction Corpus

SynthSTEL Multilingual Translation

Method Exploration

Phase 1: Static Embedding Extraction (Dec 2025 – Jan 2026)

Phase 2: Causal LM Hidden Layer Probes (Jan – Feb 2026)

Phase 3: LoRA Fine-tuning (Feb 2026 – present, active)

StyleDistance Replication (May 2026)

Engineering Infrastructure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages