Bitcrusher is a lightweight REST API that serves FastText word embeddings with full out-of-vocabulary (OOV) support via subword n-gram hashing — the same way FastText does it natively.
The FastText Common Crawl model (600B tokens) was quantized using IVF+PQ (inverted file index + product quantization) and stored in SQLite. At query time, vectors are reconstructed from PQ codes and IVF centroids in memory — no multi-gigabyte model file, no warmup delay, and a fraction of the RAM footprint of the raw float32 model.
OOV vectors are generated using the same FNV-1a subword n-gram algorithm as FastText, so results are consistent with the original implementation.
- ⚡ Instant Startup: Quantized data loads from SQLite once, then everything is served from memory
- 🗜️ IVF+PQ Quantization: Dramatically smaller memory footprint than raw float32 vectors
- 🎯 FastText Common Crawl: 600B-token model, 300-dimensional vectors
- 🔧 Full OOV Support: Subword n-gram hashing matches FastText's own OOV behavior
- 📦 Batch Endpoint: Up to 100 words per request
GET /health
Returns the service status. Indicates unhealthy until the quantized data finishes loading on startup.
GET /v1/word/{word}
Returns the 300-dimensional vector for a single word. OOV words receive a generated vector via subword hashing.
Constraints: Max 128 characters. Word is lowercased and trailing punctuation stripped before lookup.
Response:
[0.1234, -0.5678, 0.9012, ...]Error responses:
400 Bad Request— empty, whitespace, or too-long input429 Too Many Requests— rate limit exceeded
POST /v1/words
Content-Type: application/json
Returns vectors for multiple words in a single request. Duplicate words are deduplicated before lookup.
Constraints: Max 100 words per request, max 128 characters per word, max 32KB request body.
Request:
["hello", "world", "fasttext"]Response:
[
{ "word": "hello", "vector": [0.1234, -0.5678, 0.9012, ...] },
{ "word": "world", "vector": [0.2345, -0.6789, 0.1123, ...] },
{ "word": "fasttext", "vector": [0.3456, -0.7890, 0.2234, ...] }
]Error responses:
400 Bad Request— missing body, empty array, over limit, or invalid word429 Too Many Requests— rate limit exceeded
All limits are per IP, sliding window (1 minute):
| Endpoint | Limit |
|---|---|
| Global (all routes) | 120 req/min |
GET /v1/word/{word} |
120 req/min |
POST /v1/words |
30 req/min |
Rejected requests receive 429 Too Many Requests with a Retry-After: 60 header.
# Single word
curl https://bitcrusher.neobit.gg/v1/word/hello
# Batch request
curl -X POST https://bitcrusher.neobit.gg/v1/words \
-H "Content-Type: application/json" \
-d '["hello", "world", "fasttext"]'// Single word
const res = await fetch('https://bitcrusher.neobit.gg/v1/word/hello');
const vector = await res.json(); // float[]
// Batch request
const res = await fetch('https://bitcrusher.neobit.gg/v1/words', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(['hello', 'world', 'fasttext'])
});
const results = await res.json(); // { word: string, vector: float[] }[]using var client = new HttpClient();
// Single word
var vector = await client.GetFromJsonAsync<float[]>(
"https://bitcrusher.neobit.gg/v1/word/hello");
// Batch request
var response = await client.PostAsJsonAsync(
"https://bitcrusher.neobit.gg/v1/words",
new[] { "hello", "world", "fasttext" });
record WordVector(string Word, float[] Vector);
var results = await response.Content.ReadFromJsonAsync<WordVector[]>();When a word isn't in the vocabulary, Bitcrusher generates a vector using the FastText subword approach:
- Boundary markers: Wraps the word as
<word>(same as FastText) - N-gram extraction: Generates all character n-grams of length 3–6
- FNV-1a hashing: Hashes each n-gram using FastText's own algorithm, bucketed into 2M slots
- Subword lookup: Retrieves the quantized vector for each matching bucket
- Averaging: Averages all found subword vectors into the final OOV vector
Words that are entirely punctuation or reduce to empty after normalization return a zero vector.
Vectors from the FastText Common Crawl model trained by Facebook Research (600B tokens).
Personal project, open for suggestions and improvements. Raise an issue or reach out at hey@edyburr.com.